Sei sulla pagina 1di 32

Type author name/s here

Dougherty

Introduction to Econometrics,
5th edition
Chapter heading
Chapter 5: Dummy Variables

© Christopher Dougherty, 2016. All rights reserved.


DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'

b1
occupational school
regular school
N

This sequence explains how you can include qualitative explanatory variables in your
regression model.

1
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'

b1
occupational school
regular school
N

Suppose that you have data on the annual recurrent expenditure, COST, and the number of
students enrolled, N, for a sample of secondary schools, of which there are two types:
regular and occupational.
2
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'

b1
occupational school
regular school
N

The occupational schools aim to provide skills for specific occupations and they tend to be
relatively expensive to run because they need to maintain specialized workshops.

3
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'

b1
occupational school
regular school
N

One way of dealing with the difference in the costs would be to run separate regressions
for the two types of school.

4
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'

b1
occupational school
regular school
N

However this would have the drawback that you would be running regressions with two
small samples instead of one large one, with an adverse effect on the precision of the
estimates of the coefficients.
5
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'

b1
occupational school
regular school
N

Regular school COST = b1 + b2N + u


Occupational school COST = b1' + b2N + u

Another way of handling the difference would be to hypothesize that the cost function for
occupational schools has an intercept b1' that is greater than that for regular schools.

6
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'

b1
occupational school
regular school
N

Regular school COST = b1 + b2N + u


Occupational school COST = b1' + b2N + u

Effectively, we are hypothesizing that the annual overhead cost is different for the two
types of school, but the marginal cost is the same. The marginal cost assumption is not
very plausible and we will relax it in due course.
7
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1'
d

b1
occupational school
regular school
N

Regular school COST = b1 + b2N + u


Occupational school COST = b1' + b2N + u
Define d = b1' – b1

Let us define d to be the difference in the intercepts: d = b1' – b1.

8
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1+
d d

b1
occupational school
regular school
N

Regular school COST = b1 + b2N + u


Occupational school COST = b1 + d + b2N + u
Define d = b1' – b1

Then b1' = b1 + d and we can rewrite the cost function for occupational schools as shown.

9
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1+
d d

b1
occupational school
regular school
N

OCC = 0 Regular school COST = b1 + b2N + u


OCC = 1 Occupational school COST = b1 + d + b2N + u
Combined equation COST = b1 + d OCC + b2N + u

We can now combine the two cost functions by defining a dummy variable OCC that has
value 0 for regular schools and 1 for occupational schools.

10
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

b1+
d d

b1
occupational school
regular school
N

OCC = 0 Regular school COST = b1 + b2N + u


OCC = 1 Occupational school COST = b1 + d + b2N + u
Combined equation COST = b1 + d OCC + b2N + u

Dummy variables always have two values, 0 or 1. If OCC is equal to 0, the cost function
becomes that for regular schools. If OCC is equal to 1, the cost function becomes that for
occupational schools.
11
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST occupational school


regular school
600000

500000

400000

300000

200000

100000

0
0 200 400 600 800 1000 1200 N
-100000

We will now fit a function of this type using actual data for a sample of 74 secondary
schools in Shanghai.

12
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

School Type COST N OCC

1 Occupational 345,000 623 1


2 Occupational 537,000 653 1
3 Regular 170,000 400 0
4 Occupational 526.000 663 1
5 Regular 100,000 563 0
6 Regular 28,000 236 0
7 Regular 160,000 307 0
8 Occupational 45,000 173 1
9 Occupational 120,000 146 1
10 Occupational 61,000 99 1

The table shows the data for the first 10 schools in the sample. The annual cost is
measured in yuan, one yuan being worth about 20 cents U.S. at the time. N is the number
of students in the school.
13
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

School Type COST N OCC

1 Occupational 345,000 623 1


2 Occupational 537,000 653 1
3 Regular 170,000 400 0
4 Occupational 526.000 663 1
5 Regular 100,000 563 0
6 Regular 28,000 236 0
7 Regular 160,000 307 0
8 Occupational 45,000 173 1
9 Occupational 120,000 146 1
10 Occupational 61,000 99 1

OCC is the dummy variable for the type of school.

14
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

We now run the regression of COST on N and OCC, treating OCC just like any other
explanatory variable, despite its artificial nature. The Stata output is shown.

15
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

We will begin by interpreting the regression coefficients.

16
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

The regression results have been rewritten in equation form. From it we can derive cost
functions for the two types of school by setting OCC equal to 0 or 1.

17
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

^
Regular school COST = –34,000 + 331N
(OCC = 0)

If OCC is equal to 0, we get the equation for regular schools, as shown. It implies that the
marginal cost per student per year is 331 yuan and that the annual overhead cost is ‒34,000
yuan.
18
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

^
Regular school COST = –34,000 + 331N
(OCC = 0)

Obviously having a negative intercept does not make any sense at all and it suggests that
the model is misspecified in some way. We will come back to this later.

19
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

^
Regular school COST = –34,000 + 331N
(OCC = 0)

The coefficient of the dummy variable is an estimate of d, the extra annual overhead cost of
an occupational school.

20
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

^
Regular school COST = –34,000 + 331N
(OCC = 0)

^
Occupational school COST = –34,000 + 133,000 + 331N
(OCC = 1) = 99,000 + 331N

Putting OCC equal to 1, we estimate the annual overhead cost of an occupational school to
be 99,000 yuan. The marginal cost is the same as for regular schools. It must be, given the 21
model specification.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST occupational school


regular school
600000

500000

400000

300000

200000

100000

0
0 200 400 600 800 1000 1200 N
-100000

The scatter diagram shows the data and the two cost functions derived from the regression
results.

22
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

In addition to the estimates of the coefficients, the regression results will include standard
errors and the usual diagnostic statistics.

23
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

We will perform a t test on the coefficient of the dummy variable. Our null hypothesis is H0:
d = 0 and our alternative hypothesis is H1: d  0.
24
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

In words, our null hypothesis is that there is no difference in the overhead costs of the two
types of school. The t statistic is 6.40, so it is rejected at the 0.1% significance level.

25
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

We can perform t tests on the other coefficients in the usual way. The t statistic for the
coefficient of N is 8.34, so we conclude that the marginal cost is (very) significantly
different from 0.
26
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

In the case of the intercept, the t statistic is –1.43, so we do not reject the null hypothesis
H0: b1 = 0.

27
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

Thus one explanation of the nonsensical negative overhead cost of regular schools might
be that they do not actually have any overheads and our estimate is a random number.

28
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

A more realistic version of this hypothesis is that b1 is positive but small (as you can see,
the 95 percent confidence interval includes positive values) and the error term is
responsible for the negative estimate.
29
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

. reg COST N OCC

Source | SS df MS Number of obs = 74


---------+------------------------------ F( 2, 71) = 56.86
Model | 9.0582e+11 2 4.5291e+11 Prob > F = 0.0000
Residual | 5.6553e+11 71 7.9652e+09 R-squared = 0.6156
---------+------------------------------ Adj R-squared = 0.6048
Total | 1.4713e+12 73 2.0155e+10 Root MSE = 89248

------------------------------------------------------------------------------
COST | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
N | 331.4493 39.75844 8.337 0.000 252.1732 410.7254
OCC | 133259.1 20827.59 6.398 0.000 91730.06 174788.1
_cons | -33612.55 23573.47 -1.426 0.158 -80616.71 13391.61
------------------------------------------------------------------------------

As already noted, a further possibility is that the model is misspecified in some way. We
will continue to develop the model in the next sequence.

30
Copyright Christopher Dougherty 20126

These slideshows may be downloaded by anyone, anywhere for personal use.


Subject to respect for copyright and, where appropriate, attribution, they may be
used as a resource for teaching an econometrics course. There is no need to
refer to the author.

The content of this slideshow comes from Section 5.1 of C. Dougherty,


Introduction to Econometrics, fifth edition 2016, Oxford University Press.
Additional (free) resources for both students and instructors may be
downloaded from the OUP Online Resource Centre
www.oxfordtextbooks.co.uk/orc/dougherty5e/.

Individuals studying econometrics on their own who feel that they might benefit
from participation in a formal course should consider the London School of
Economics summer school course
EC212 Introduction to Econometrics
http://www2.lse.ac.uk/study/summerSchools/summerSchool/Home.aspx
or the University of London International Programmes distance learning course
EC2020 Elements of Econometrics
www.londoninternational.ac.uk/lse.

2016.05.03

Potrebbero piacerti anche