Sei sulla pagina 1di 14

Vid Adrison

Outline
Background
Rules in Generating Dummy Variable
Another Simulation
Dummy Variables for More than Two Categories
Interaction Term
Policy Analysis: Difference in Difference Approach
Background
Qualitative information
Some information are qualitative in nature rather than quantitative.
Gender: Male Vs Female
Location: Seaside Vs Other
War: Peace time Vs War time
Degree of education: High School vs Other
If there is a statistical differences among different
categories, why dont we regress based on different sub
sample?
Example: Regress Wage for male workers only, and then do the
same regression for female workers
We can perform such approach, but sometime the number of
observations is insufficient
Rules in Generating Dummy Variable
# of Dummy variables = Categories 1
Example: Gender consists only two categories: (1) Male and (2)
Female
Suppose we want to investigate if wage discrimination between
gender exists.
Wage = B0 + B1 Experience + B2 Education + B3 Gender + error
Where Gender = 1 if male, and 0 = if female
In this case, female is used as the base
If B3 is statistically different from zero, then we can say that wage
discrimination exists
What is the effect if we use different base?
It only changes the sign of B3 parameter
Another Simulation
Generate random variable representing independent variable:
Experience = round(2+rand()*18,0)
Education = round(12+rand()*12,0)
Gender = round(rand(),0)
Error=-0.5 + rand()
Generate dependent variable based on independent variable and the
desired parameter values
Log(wage) = 1.5 + 0.5*log(experience) + 0.3*log(education) + 0.2*gender +
error
Export simulated dataset into Statistical software, and run a regression
of wage on experience, education and gender, in double log
specification
Why dont we use log(gender) in the regression?
Simulation-based Regression
EXPERIENCE EDUCATION GENDER ERROR
Mean 11.26500 17.66000 0.515000 -0.007427
Median 11.00000 17.00000 1.000000 0.011996
Maximum 20.00000 24.00000 1.000000 0.485267
Minimum 2.000000 12.00000 0.000000 -0.494393
Std. Dev. 5.096533 3.496502 0.501029 0.282415
Skewness 0.004184 0.169399 -0.060027 -0.003171
Kurtosis 1.914534 1.791708 1.003603 1.801743

Jarque-Bera 9.819220 13.12295 33.33344 11.96551
Probability 0.007375 0.001414 0.000000 0.002522

Observations 200 200 200 200


Dependent Variable: LOG(WAGE)
Method: Least Squares
Date: 03/01/10 Time: 16:03
Sample: 1 200
Included observations: 200
Variable Coefficient Std. Error t-Statistic Prob.
C 1.655749 0.297988 5.556429 0.0000
LOG(EXPERIENCE) 0.499068 0.035470 14.07008 0.0000
LOG(EDUCATION) 0.236338 0.100591 2.349504 0.0198
GENDER 0.239788 0.040182 5.967484 0.0000
R-squared 0.543603 Mean dependent var 3.594928
Adjusted R-squared 0.536617 S.D. dependent var 0.416514
S.E. of regression 0.283530 Akaike info criterion 0.336802
Sum squared resid 15.75634 Schwarz criterion 0.402769
Log likelihood -29.68024 F-statistic 77.81693
Durbin-Watson stat 1.858693 Prob(F-statistic) 0.000000


Simulation-based Regression
The previous regression results indicate that wage is
determined by three factors, namely; experience, education
and gender
Since we define male =1 and female =0, then the previous slide
indicates that male workers receive 20% higher wages than their
female counterparts (the base group) given the same experience
and education
In this case, we find an evidence of gender-based wage discrimination
If we use different base; i.e, female =1 and male =0, then the
parameter for gender will change sign into negative. Thus,
interpretation remains the same
What is the interpretation if we use wage instead of
log(wage) in the dependent variable?
Dummy Variables for More than
Two Categories
Education High school Bachelor Master PhD
12 ? 0 0 0
16 ? 1 0 0
18 ? 0 1 0
24 ? 0 0 1
Suppose that the effect of education is measured on the
degree completed rather than years of education
High school Vs Undergraduate Vs Master Vs Doctoral
How many dummy variables do we have to generate?
# Dummy variable = Categories 1 = 3 dummy variables
Interaction Term
The previous example, the discrimination is based on the starting wage
(intercept).
Any additional experience and education will have the same effect on wage
What if additional experience in male results in different additional
wage than female?
In this case, the slope (for experience) will be different between male and
female
We have to create an interaction term of Gender and Experience, and
add into the regression
Log(Wage) = B0 + B1 log(Experience) + B2 log(Education) +
B3 Gender + B4 Gender x log(Experience) + error
Question (Homework??):
Suppose someone is creating an interaction term of two continuous
variables, for instance Z = Experience x Education, and then add Z into the
above regression function. How can we interpret the parameter for Z ?
Policy Analysis Using
Dummy Variable
Suppose a manager wants to investigate the effectiveness of
training on worker productivity.
He draws a sample of 100 employees, 50 of them participated in a
training, while the rest didnt
The regression:
Productivity = B0 + B1 Experience + B2 Education + B3 Training + error
Where Training = 1 if the employee participated in a training, 0
otherwise
Does this approach appropriate? Not quite
We only observe those with training, and those without, at a SINGLE
point of time. However, we dont see the changes in productivity
BEFORE and AFTER a training
Policy Analysis Using
Dummy Variable
How do we do the analysis?
We must have data prior and after the training. Thus, at least we have T=0
(pre) and T=1 (post)
Create an interaction term between participation indicator and time
dummy
Regression Model:
Productivity = B0 + B1 Z+ B2 Training + B3 Time + B4 Training*Time + error
Treatment group: those with training (Training=1)
Control group: those without training (Training=0)
Treatment Control Difference
Pre
B0 + B1 Z + B2 (1) +
B3 (0) + B4 (1) (0)
B0 + B1 Z + B2 (0) +
B3 (0) + B4 (0) (0)
B2
Post
B0 + B1 Z + B2 (1) +
B3 (1) + B4 (1) (1)
B0 + B1 Z + B2 (0)
+B3 (1) + B4 (0) (1)
B2 + B4
Difference
B3 + B4 B3 B4
Policy Analysis Using
Dummy Variable
Using previous matrix, we can see that;
B2 reflects the difference between treatment and control group in
the initial period
B3 reflects the difference in the control group between initial and
end of period (time effect)
B4 reflects the difference of treatment and control group, between
initial and end of period
This approach is known as Difference-in-Difference
(DID) analysis
If someone wants to investigate the effectiveness of a
training between male and female worker, how should
he do it?
Difference-in-Difference-in-Difference (DIDID)
As an exercise, write the representative equation for such regression
Homework (One week)
Generate a 200 random data of experience, years of
education, gender and training
Experience: ranges from 0 to 20 years (round number)
Years of Education: ranges from 6 to 24 (round number)
Gender = 1 if male, 0 if female
Training = 1 if participate, 0 if otherwise
Create wage data such that the regression result indicate;
the elasticity of wage with respect to
Experience is 0.6
Years of education is 0.3
Male workers receive 15% higher wages compare to their female
counterpart with the same characteristics
Those who participated in a training receive 5% higher wages than
those who didnt participate
Homework (One week)
NOTE:
Syntax in generating random data (independent and dependent
variable) must be provided (10 point)
Descriptive statistics of generated data must be provided (5 point)
Regression result must be provided (5 point)
Statistical analysis of regression result must be performed. This
include;
Evaluation of explanatory power of regression (5 point)
Interpretation of each parameter and its respective significance
(15 point)
Perform Wald test:
Test if the elasticity of wage with respect to experience = 0.6 (5
point)
Test if the elasticity of wage with respect to years of education
= 0.3 (5 point)

Potrebbero piacerti anche