Sei sulla pagina 1di 25

DUMMY VARIABLE

DUMMY VARIABLE REGRESSION MODELS

• THE NATURE OF DUMMY VARIABLES


• In regression analysis the dependent variable, or regressand, is frequently
influenced not only by ratio scale variables (e.g., income, output, prices, costs,
height, temperature) but also by variables that are essentially qualitative, or
nominal scale, in nature, such as sex, race, color, religion, nationality, geographical
region, political, upheavals, and party affiliation.
For example
Holding all other factors constant, female workers are found to earn less their male
counterparts or nonwhites workers are found to earn less than whites.
• Dummy variables can be incorporated in regression model just as easily quantities
variables . As matter of fact, a regression model may contain regressor that are all
exclusively dummy , or qualitative, in nature such models are a called Analysis of
variance (ANOVA) models.
ANOVA Models
Public School Teachers’ salaries by Geographical Region
• Given example is related to average salary of public school
teachers in 50 states and the district of Columbia for the year
1985.
• These 51 areas are classified in three geographical regions:
Northeast & North Central(21 states in all), South(17 states in
all) and West (13 states in all):
• Following model has been considered.

Where
Yi = (average) salary of public school teacher in state I
D2i = 1 if the state is in the Northeast or North Central
= 0 otherwise (i.e., in other regions of the country )
D3i = 1 if the state is in the south
= 0 other wise (i.e., in other regions of the country)
ANOVA Example

Mean salary of public school teachers in the Northeast and North central:

Mean salary of public school teachers in the South


E(Yi/D2i = 0, D3i = 1) = β1 + β3

Mean salary of public school teachers in the West

Yi = 26158.62 - 1734.473 D2i - 3264.615 D3i

* Indicate the p value


ANNOVA Model
• The mean salary of teachers in the West is about $ 26158, that of
teachers in the Northeast and North Central is lower by about
$1734 and that of teachers in South is lower by about $3264.
• The mean salaries in the two reasons Northeast & North Central
and South are respectively are $24424 and $ 22894.
• Northeast & North Central is not significant, as p value is 23 %
whereas that of South is statistically significant, as the p value is
only about 3.5 %.
• The overall conclusion is that statistically the mean salaries of public
school teachers in the West and Northeast & North Central are
about same but the mean salary of teachers in the South is
statistically significantly lower by about $3265.
Caution of use dummy variables
1) if a qualitative variables has m categories, introduce only (m – 1) dummy
variables, if you do not follow this rule, you will fall into what is called the
dummy variable trap, that is, the situation of perfect multicollinearity. For
each quantitative regressor the number of dummy variables introduced
must be one less must be one less than the categories of dummy
variable.
2) The category for which no dummy variable is assigned is known as the
base, benchmark, comparison, reference, or omitted category.
3) The coefficients attached to the dummy variables are known as the
differential intercept that receives the value of 1 differs from the
intercept coefficient of the benchmark category.
4) If qualitative variables has more than one category, as in our illustrative
example, the choice of the benchmark category is strictly up to the
researcher.
5) There is way to circumvent this trap by introducing as many
dummy variables as the number of categories of that variable,
provided we do not introduce the intercept in such a model.

• β1 = mean salary of teachers in the west.


• β2 = mean salary of teachers in the Northeast and North Central.

• β2 =mean salary of teachers in the South.


• Yi = 26158.62 D1i + 24424.14 D2i + 22894D3i

Se (1128.523) (887.9170) (986.8645)

t (23.1795)* (27.5072)* (23.1987)*


R2 = 0.090
*Indicates that the p value of these t ratios are very small.

Here the dummy coefficient give the direct


mean values in the three regions, West,
Northeast and North central, and South.
ANOVA MODELS WITH TWO QUALITATIVE VARIABLES

Example: Hourly Wages In Relation To Marital Status and Region of Residence.

Where Y = hourly wages ($)


D2 = married status, 1=married, 0=otherwise
D3 = region of residence; 1= South 0=otherwise
and * denotes the p values
ANOVA Model with two qualitative
Variables
• Here there are two qualitative regressors, each with two categories.
• Benchmark category will be unmarried, non-south residence
• In other words, unmarried persons who do not live in in the South are
the omitted category.
• The mean hourly wage in this benchmark is about $8.81.
• The mean hourly wage of those who are married is higher by about
$1.10 .
• Who lives in South, the mean hourly wage is lower by about $1.6.
• Important message: Concentrate on the bench mark category and then
compare from it.
REGRESSION WITH A MIXTURE OF QUANTITATIVE AND
QUALITATIVE REGRESSORS: THE ANCOVA MODELS

• Regression models contending an admixture of quantitative


and qualitative variables are called analysis of covariance
(ANCOVA) models.
• ANCOVA models are an extension of the ANOVA models in
that they provide a method of statistically controlling the
effects of quantitative regressors, called covariates or
control variables, in a model that includes both quantitative
and qualitative, or dummy variables.
Teacher’s salary in relation to region and spending on
public school per pupil

• This is an extension to previous Public School


Teachers’ salaries by geographical Region
example.
• In this example we want to see the impact of
variable expenditure on public schools by local
authorities.
• Here again West will be our our bench mark
category.
• Besides the two qualitative regressors, here a
quantitative variable X has been introduced.
Where, Yi = average annual salary of public school teachers in stare ($)
Xi = spending on public school per pupil ($)
D2i = 1, if the state is in the Northeast or North Central
= 0, otherwise
D3i =1, if the state is in the South
= 0, otherwise
Example:
Teacher’s salary in relation to region and spending on public school per
pupil

Where, * indicates p values less than 5 percent and ** indicates p values greater
than 5 percent.
Teacher’s salary in relation to region and spending on
public school per pupil

• As public expenditure goes up by a dollar, on an


average, a public school teacher’s salary goes up
by about $3.29.
• The differential intercept coefficient is
significant for the Northeast and North- Central
region, but not for South.
INTERACTION EFFECTS USING DUMMY VARIABLE

• Y= hourly wages in dollars


• X = education (years of schooling)
• D2= 1 if female, 0 otherwise
• D3 = 1if nonwhite and non- Hispanic, 0
otherwise.
• In this model gender race are qualitative regressors and education
is a quantitative regressor.
• Implicit in this model is the assumption that the differential effect
of the gender dummy D2 is constant across the two categories of
race and the differential effect of the race dummy D3 is also
constant across the two sexes.
• If the mean salary is higher for males than for females, this is so
whether they are nonwhite/non-Hispanic or not. Nonwhite/non-
Hispanic have lower mean wages, this is so whether they are
female or males.
• In many applications such an assumption may be untenable. A
female nonwhite/non-Hispanic may earn lower wages than a male
nonwhite/non-Hispanic.
• There may be interaction between the two qualitative variables D2
and D3.
• Then the effect of these qualitative variables may not be additive
but multiplicative.
INTERACTION EFFECTS USING
DUMMY VARIABLE
• Yi = α1 + α2D2i + α3D3i + α4D2i D3i + βXi + Ui

• α2 = differential effect of being a female


• α3 = differential effect of being a Nonwhite /
non Hispanic
• α4 = differential effect of being a female
nonwhite / non Hispanic
The mean hourly wages of female nonwhite/ non
Hispanics is different (by α4 ) from the mean hourly
wages of females or nonwhite/non Hispanic.
Interaction effects using Dummy variables

• All the equations shows that the mean hourly wages


of female nonwhite/non Hispanics is different by
α4 from the mean hourly wages of females or
nonwhite/non Hispanics .
• If, for instance, all the three differential dummy
coefficients are negative it would imply that female
nonwhite/non Hispanics workers earn much lower
mean hourly wages than female or nonwhite/non
Hispanics as compared with the base category,
which in this example is male white or Hispanic.
Interaction effects using Dummy
variables
Without interaction model
• Yi = -0.2610 – 2.3606 D2i – 1.7327D3i + 0.8028Xi
t = (-2.357)** (-5.4873)* (-2.1803)* (9.9094)*
R2 = .2032 n= 528
 Where * indicates p values less than 5 % and ** indicates p values greater
than 5%.
With interaction model
• Yi = -0.26100 – 2.3606 D2i – 1.7327D3i +2.1289D2iD3i+0.8028Xi
t = (-2.357)** (-5.4873)* (-2.1803)* (1.7420)** (9.9094)**
R2 = .2032 n= 528
• The two additive dummy variables are still statistically
significant but the interactive dummy is not at the
conventional 5 % level.
• Holding the level of the education constant , if three dummy
coefficients has been added then :
• -1.964(= -2.3605-1.7327+2.1289) ,
• which means that mean wages of female nonwhite/non
Hispanics is lower by about $1.96, which is between the value
of -2.3605 (gender difference alone ) and -1.7327 (race
difference alone)
The use of Dummy variables in
seasonal analysis
• Yt= a1D1t +a2D2t +a3D3t +a4D4t + ut
Where Yt = sales of refrigerators (in thousands)
D’s are the Dummy variables, taking a value of 1 in the relevant
quarter and 0 otherwise
• If there is any seasonal effect in a given quarter,
that will be indicated by the a statistically
significant t values of the dummy coefficient for
that quarter.
• Yt= 1222.125 D1t +1467.500 D2t +1569.750 D3t +1160 D4t
t (20.3720) (24.4622) (26.1666) (19.3364)
R2 = 0.5317
The use of Dummy variables in
seasonal analysis
• The estimated a coefficient represents the average sales of refrigerators in
each season.
• The sale of refrigerators in the first quarter, in thousands of unit, is about
1222, that in the second quarter about 1467, that in third quarter about
1570 and that in fourth quarter about 1160.
• After assigning three dummy the shape of the equation may be in this
format
• Yt= 1222.125 +245.3750 D2t +347.6250 D3t - 62.1250 D4t
• t (20.3720)* (2.8922) * (4.0974)** (-.7322)**
R2 = 0.5318
Here fourth quarter is not statically different from the average value of the first
quarter, as the dummy coefficient is not statistically significant
The use of Dummy variables in
seasonal analysis
• Since expenditure on durable goods has an important factor
influences on the demand for refrigerators. After including this
factors new output is as follows:
• Yt= 456.2440 +242.4976 D2t +325.2643 D3t – 86.0804 D4t + 2.7734 Xt
t (2.5593)* (3.6951) * (4.9421)** (-1.3073)** (4.4496)*
R2 = 0.7298
• The differential intercepts coefficients for the second and third quarters are
statistically different from that of the first quarter, but the intercepts of the
fourth quarter and the first quarter are statistically about the same.
• The coefficients of X (durable goods expenditure) of about 2.77 tell that,
allowing the seasonal effects, if expenditure on durable goods goes up by a
dollar, on average sales of the refrigerators go up by the about 2.77 units.

Potrebbero piacerti anche