Sei sulla pagina 1di 15

DUMMY VARIABLES INDEPENDENT & DEPENDENT DUMMY VARIABLES

Dummy variables are independent variables which take the value of either 0 or 1. Just as a "dummy" is a stand-in for a real person, in quantitative analysis, a dummy variable is a numeric stand-in for a qualitative fact or a logical proposition. For example, a model to estimate demand for electricity in a geographical area might include the average temperature, the average number of daylight hours, the total number of structure square feet, numbers of businesses, numbers of residences, and so forth. It might be more useful, however, if the model could produce appropriate results for each month or each season. Using the number of the month, such as 12 for December, would be silly, because that implies that the demand for electricity is going to be very different between December and January, which is month 1. It also implies that Winter occurs during the same months everywhere, which would preclude the use of the model for the opposite polar hemisphere. Thus, another way to represent qualitative concepts such as season, male or female, smoker or non-smoker, etc., is required for many models to make sense. In a regression model, a dummy variable with a value of 0 will cause its coefficient to disappear from the equation. Conversely, the value of 1 causes the coefficient to function as a supplemental intercept, because of the identity property of multiplication by 1. This type of specification in a linear regression model is useful to define subsets of observations that have different intercepts and/or slopes without the creation of separate models. In logistic regression models, encoding all of the independent variables as dummy variables allows easy interpretation and calculation of the odds ratios, and increases the stability and significance of the coefficients. Examples of these results are in Section 3. In addition to the direct benefits to statistical analysis, representing information in the form of dummy variables is makes it easier to turn the model into a decision tool. Consider a risk manager who needs to assign credit limits to businesses. The age of the business is almost always significant in assessing risk. If the risk manager has to assign a different credit limit for each year in business, it becomes extremely complicated and difficult to use because some businesses are several hundred years old. Bivariate analysis of the relationship between age of business and default usually yields a small number of groups that are far more statistically significant than each year evaluated separately.

In regression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. For example, in econometric time series analysis, dummy variables may be used to indicate the occurrence of wars, or major strikes. It could thus be thought of as a truth value represented as a numerical value 0 or 1 (as is sometimes done in computer programming). The addition of dummy variables always increases model fit (coefficient of determination), but at a cost of fewer degrees of freedom and loss of generality of the model. Too many dummy variables result in a model that does not provide any general conclusions. Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the panel data fixed effects estimator dummies are created for each of the units in cross-sectional data (e.g. firms or countries) or periods in a pooled time-series. However in such regressions either the constant term has to be removed or one of the dummies removed making this the base category against which the others are assessed, for the following reason: If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-ofones variable were also present, this would result in perfect multicollinearity, so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the dummy variable trap.

Describing qualitative data Far from all of the data of interest to econometricians is quantitative. For instance, gender of individuals, whether they are married, the industry of firms, countries or regions are all considered to be qualitative. To include them in regression, we go for dummy variable. In many cases, the information can be described as being true or false or the character 2

present or absent. In those cases, it is easy to set up a binary variable or dummy variable taking values 0 and 1. For instance, male is usually set to 1 when the individual is male and 0 when female, while if rather we define female we would likely do the opposite. Those are clearer than a gender variable. It does not matter to the result, but it does to their interpretation! Describing categories or ranges Dummy variables are also useful to describe categories. Indeed, even if the variable is not binary, if it takes a finite number of values then it can be described by a complete set of dummy variables For instance, if eyes colour can be brown, blue, green or red, we can have four dummy variables for each of these colour, taking 1 whenever an individual has eyes of this colour. More complex for Bowie. Notice that summing all variables in a complete set should give you 1 for all observations. This technique can also be useful for quantitative data which you do not believe should be considered as one continuous variable. Dummy variables are 'discrete' and 'qualitative' (e.g., male or female, in the labour force or not, working under a collective or individual employment contract, renting or owning your home). Units of measurement are meaningless. Normally 1 is assigned to the presence of some characteristic or attribute; 0 for the absence of that characteristic or attribute. A dummy variable for several ranges allows you to distinguish the effects of what you might see as thresholds. Example1: in the Mincer equation, we often use dummy variables for high school dropouts, high school graduates, etc. EXAMPLE2: A regression model of labour market discrimination by gender.
Y i = 0 + 1 S i + 2 G i + i

where

Yi =

annual earnings

Si = years of education. Gi = 1 if ith person is a male 0 if ith person is a female.

No special estimation issues as long as the regression meets the all the classical assumptions. Only the nature of the independent variables has changed. The expected salary of a female is:
E ( Y i | S i , G i = 0 ) = 0 + 1 S i

The expected salary of a male is:


E ( Y i | S i , G i = 1 ) = 0 + 1 S i + 2 = ( 0 + 2 ) + 1 S i

Since E( i | Si, Gi)=0. Testing for discrimination (i.e., H0: 2=0) is a test for a difference in the intercept terms. Intercept shift

Men: wage = (0 + 2) 1 Si

Slope = 2 0+ 1

Women: wage = 0 + 1 Si

Dummy Variable Trap: Suppose we estimate the following:


Y i = 1 + 2 S i + 3 F i + 3 M i + i

where Fi = 1 if ith person is female 0 if ith person is male Mi = 1 if ith person is male 4

0 if ith person is female This is known as the 'Dummy Variable Trap'. We're including redundant information in the regression. Suppose the sample looks like this:

Constant 1 1 1 1 1 1 1

Fi 1 0 1 0 1 1 0

Mi 0 1 0 1 0 0 1

The problem is that the two dummies are a linear function of the constant (i.e., Fi+Mi = 1). Perfect multicollinearity. Violates Assumption (6). Estimated coefficients and their standard errors cant be computed. The solution is simple -- drop a dummy variable or the constant term. Rule of Thumb: If you have 'm' categories, then use 'm-1' dummies. Slope dummy variables: We could allow for differences in these returns by adding an 'interacted' variable:
Y i = 0 + 1 S i + 2 Gi + 3 Gi S i + i

This is a more 'flexible' specification. The expected salary of female is:


E ( Y i | S i , G i = 0 ) = 0 + 1 S i

The expected salary of male is:


E ( Y i | S i , G i = 1 ) = ( 0 + 2 ) + ( 1 + 3 ) S i

We now have both a 'composite' intercept term and slope coefficient for male. 5

If 2>0, then male regression line has a higher intercept. Using a set of dummy variables What happens if we use a complete set of dummy variables?

The four dummies sum to one, hence we have perfect co linearity. The regression will not be able to identify properly the coefficients. It is as if we had a single variable always equal to one (like for the intercept). One possible way out is then to drop the intercept. Each dummy coefficient will then be interpreted as the intercept for this specific group. Another (more common) possibility is to drop one variable in the set. This will be the baseline and the other dummy coefficients will read directly as the difference from this baseline. Example from Alesina, Algan, Cahuc and Giuliano (2009)

Dummy variables in R By default, R will automatically remove the last dummy variable if you provide a complete set. However, you are well-advised to do it yourself as this will help with the interpretation, and also because other software may not be as kind. There are many methods to create dummy variables from qualitative data. Fixed effects Dummy variables are also frequently used as fixed effects. Typically, we might add time-fixed effect to our regression to capture structural changes 6

underlying our regression. For instance, this could be a dummy variable for each year or each period (minus one). In many cases, it is also useful to define a set of individual-fixed effects to capture all unobserved individual characteristics. This might lead to a potentially large number of dummy variables, which is usually not a problem with modern computers. However, you must have several observations for each individual or you will not have degrees of freedom! Dummy Dependent Variables Models In this chapter we introduce models that are designed to deal with situations in which our dependent variable is a dummy variable. That is, it assumes either the value 0 or the value 1. Such models are very useful in that they allow us to address questions for which there is a yes or no answer. 1. Linear Probability Model In the case of dummy dependent variable model we have:
y i = 1 + 2 xi + i

where y i = 0 or 1 and E ( i ) = 0 . What would happen if we simply estimated the slope coefficients of this model using OLS? What would the coefficients mean? Would they be unbiased? Are they efficient? A regression model in the situation where the dependent variable takes on the two values 0 or 1 is called a linear probability model. To see its properties note the following.
a) Since the mean error is zero, we know that E ( yi ) = 1 + 2 xi . b) Now, pi = prob( yi = 1) and 1 pi = prob( yi = 0) ,

if

we

define

then

E ( yi ) = 1 pi + 0 (1 pi ) . Therefore,

our model is pi = 1 + 2 xi and the

estimated slope coefficients would tell us the impact of a unit change in that explanatory variable on the probability that yi = 1
c) The predicted values from the regression model pi = b1 + b2 xi would

provide

predictions,

based

on

some

chosen

values

for

the

explanatory variables, for the probability that

yi = 1 . There is,

however, nothing in the estimation strategy that would 7

constrain the resulting predictions from being negative or larger than 1-clearly an unfortunate characteristic of the approach.
d) Since E ( i ) = 0 and uncorrelated with the explanatory variables (by

assumption), it is easy to show that the OLS estimators are unbiased. The errors, however, are heteroscedastic. A simple way to see this is to consider an example. Suppose that the dependent variable takes the value 1 if the individual buys a Rolex watch and 0 other wise. Also, suppose the explanatory variable is income. For low level of income it is likely that all of the observations are zeros. In this case, there would be no scatter around the line. For higher levels of income there would be some zeros and some ones. That is, there would be some scatter around the line. Thus, the errors would be heteroscedastic. This suggests two empirical strategies. First, we know that the OLS estimators are unbiased but would yield the incorrect standard errors. We might simply use OLS and then use the White correction to produce correct standard errors. 2. Logit and Probit Models One potential criticism of the linear probability model (beyond those mentioned above) is that the model assumes that the probability that yi = 1 is linearly related to the explanatory variable(s). We might, however, expect the relation to be nonlinear. For example, increasing the income of the very poor or the very rich will probably have little effect on whether they buy an automobile. It could, however, have a nonzero effect on other income groups. Two models that are nonlinear, yet provide predicted probabilities between 0 and 1, are the logit and probit models. The difference between the linear probability model and the nonlinear logit and probit models can be explained using an example. To motivate these models, suppose that our underlying dummy dependent variable depends on an unobserved (latent) utility index y * . For example, if the variable y is discrete, taking on the values 0 and 1 if someone buys a car, then we can imagine a continuous variable y * that reflects a persons desire to buy the car. It

seems reasonable that y * would vary continuously with some explanatory variable like income. More formally, suppose
y* = 1 + 2 xi + i

and
yi = 1 if yi = 0 if y* 0 (i.e., the utility index is high enough) y* < 0 (i.e., the utility index is not high enough)

Then:
pi = prob( yi = 1) = prob( yi* 0) = prob( 1 + 2 xi + i 0) = prob( i 1 2 xi ) = 1 F ( 1 2 xi ) where is the c.d . f . for = F ( 1 + 2 xi ) if F is symmetric

Given this, our basic problem is selecting F the cumulative density function for the error term. It is here where the logit and probit models differ. As a practical matter, we are likely interested in estimating the s in the model. This is typically done using a Maximum Likelihood Estimator (MLE). To outline the MLE in this context, recognize that each outcome yi has the density function f ( yi ) = piyi (1 pi )1 yi . That is, each yi takes on either the value of 0 or 1 with probability f (0) = (1 pi ) and f (1) = pi . Then the likelihood function is:
L = f ( y1 , y2 ..... yn ) = f ( y1 ) f ( y2 )...... f ( yn )
y y = [ p1y1 (1 p1 )1 y1 ][ p2 2 (1 p2 )1 y2 ].....[ pn n (1 pn )1 yn ]

p
i =1 n

yi i

(1 pi )1 yi

and
ln L =

y ln p
i i =1 n

+ (1 yi ) ln(1 pi )

which, given pi = F ( 1 + 2 xi ) , becomes


ln L =

y ln F (
i i =1

+ 2 xi ) + (1 yi ) ln(1 F ( 1 + 2 xi ))

Analytically, the next step would be to take the partial derivatives of the likelihood function with respect to the s, set them equal to zero, and solve for the MLEs. This could be a very messy calculation depending on the functional form of F. In practice, the computer will solve this problem for us. 2.1. Logit Model For the logit model we specify
p( yi = 1) = F ( 1 + 2 xi ) = 1 1+ e
( 1 + 2 xi )

It can be seen that p( yi = 1) 0 as 1 + 2 xi . Similarly, p( yi = 1) 1 as 1 + 2 xi . Thus, unlike the linear probability model, probabilities from the logit will be between 0 and 1. A complication arises in interpreting the estimated s. In the case of a linear probability model, a b measures the ceteris paribus effect of a change in the explanatory variable on the probability y equals 1. In the logit model we can see that
prob( yi = 1) F (b1 + b2 xi ) = b2 xi xi = b2 e ( 1 + 2 xi ) [1 + e ( 1 + 2 xi ) ]2

Notice that the derivative is nonlinear and depends on the value of x. It is common to evaluate the derivative at the mean of x so that a single derivative can be presented. Odds Ratio
p( yi = 1) = F ( 1 + 2 xi ) = 1 1+ e
( 1 + 2 xi )

For ease of exposition, we write above equation as pi =

1 ez = where 1 + ez 1 + e z

z = 1 + 2 xi . To avoid the possibility that the predicted values might be

outside the probability interval of 0 to 1, we model the ratio

pi . This 1 pi

ratio is the likelihood, or odds, of obtaining a successful outcome (the

10

ration of the probability that a family will own a car to the probability that it will not own a car)1.
pi 1 + e zi = = e zi 1 pi 1 + e zi

If we take the natural log of above equation, we obtain


p L = ln i 1 p i = zi = 1 + 2 xi

that is, L, the log of the odds ration, is not only linear in x, but also linear in the parameters. L is called the logit, and hence the name logit model. Logit model cannot be estimated using OLS. Instead, we use MLE that discussed previous section, an iterative estimation technique that is especially useful for equations that are nonlinear in the coefficients. MLE is inherently different from least squares in that it chooses coefficient estimates that maximize the likelihood of the sample data set being observed. Interestingly, OLS and MLE are not necessarily different; for a linear equation that meets the classical assumptions (including the normality assumption), MLE are identical to the OLS. Once the logit has been estimated, hypothesis testing and econometric analysis can be undertaken in much the same way as for linear equations. When interpreting coefficients, however, be careful to recall that they represent the impact of a one unit increase in the independent variable in question, holding the other explanatory variables constant, on the log of the odds of a given choice, not on the probability itself. But we can always compute the probability as certain level of variable in question. 2.2. Probit Model In the case of the probit model, we assume that the i ~ N (0, 2 ) . That is, we assume the error in the utility index model is normally distributed. In this case,
+ 2 xi p( yi = 1) = F 1

where F is the standard normal cumulative density function. That is

11

+ 2 xi p( yi = 1) = F 1 =

1 + 2 xi

1 2

t2 2 dt

In practice, the c.d.f. of the logit and the probit look quite similar to one another. Once again, calculating the derivative is moderately complicated . In this case,
prob( yi = 1) = xi F (

1 + 2 xi ) + 2 xi = f 1 2 xi

where f is the density function of the normal distribution. As in the logit case, the derivative is nonlinear and is often evaluated at the mean of the explanatory variables. In the case of dummy explanatory variables, it is common to estimate the derivative as the probability yi = 1 when the dummy variable is 1 (other variables set to their mean) minus the probability yi = 1 when the dummy variable is 0 (other variables set to their mean). That is, you simply calculate how the predicted probability changes when the dummy variable of interest switches from 0 to 1. Which Is Better? Logit or Probit Fortunately, from an empirical standpoint, logits and probits typically yield very similar estimates of the relevant derivatives. This is because the cumulative distribution functions for the logit and probit are similar, differing slightly only in the tails of their respective distributions. Thus, the derivatives are different only if there are enough observations in the tail of the distribution. While the derivatives are usually similar, it is important to remember the parameter estimates associated with logit and probit models are not. A simple approximation suggests that multiplying the logit estimates by 0.625 makes the logit estimates comparable to the probit estimates. Example: We estimate the relationship between the openness of a country Y and a countrys per capita income in dollars X in 1992. We hypothesize that higher per capita income should be associated with free trade, and test this at the 5% significance level. The variable Y takes the value of 1 for free trade, 0 otherwise. Since the dependent variable is a binary variable, we set up the index function 12

Y * = 1 + 2 X i

If Y * 0, Y = 1 (open); if Y * < 0, Y = 0 (not open) Probit estimation gives the following results: Dependent Variable: Y Method: ML - Binary Probit (Quadratic hill climbing) Date: 05/27/04 Time: 13:54 Sample(adjusted): 1 20 Included observations: 20 after adjusting endpoints Convergence achieved after 7 iterations Covariance matrix computed using second derivatives Variable C Coeffici Std. z- Prob.

ent Error Statistic - 0.82470 - 0.0156 1.99418 8 2.41804 4 8 0.00100 0.00047 2.12948 0.0332 3 0.50000 1 8 S.D. dependent 0.5129 89 0.8864 71 0.9860 45 0.9059 09 0.3432 36 0.5048 16

X Mean dependent var S.E. of regression Sum squared resid Log likelihood

0 var 0.33728 Akaike info 0 criterion 2.04763 Schwarz 6 criterion Hannan-Quinn 6.86471 criter. 3 4 13.9964

Restr. log likelihood LR statistic (1 df) Probability(LR

Avg. log

13.8629 likelihood McFadden R-

6 squared 0.00018

stat) 3 Slope is significant at the 5% level. The interpretation of the b2 changes in a probit model. b2 is the effect of X on Y * . The marginal effect of X on p(Yi = 1) is easier to interpret and is given by f (b1 + b2 X ) b2 . 13

f (1.9942 + 0.001(3469.5))(0.001) = 0.0001

To test the fit of the model (analogous to R-squared), the maximized loglikelihood value (lnL) can be compared to the maximized log likelihood in a model with only a constant ln L0 in the likelihood ratio index
LRI = 1 ln L 6.8647 =1 = 0.50 ln L0 13.8629

Logit estimation gives the following results: Dependent Variable: Y Method: ML - Binary Logit (Quadratic hill climbing) Date: 05/27/04 Time: 14:12 Sample(adjusted): 1 20 Included observations: 20 after adjusting endpoints Convergence achieved after 7 iterations Covariance matrix computed using second derivatives Variable C Coeffici Std. z- Prob.

ent Error Statistic - 1.68106 - 0.0320 3.60499 8 2.14446 5 7 0.00179 0.00090 1.99541 0.0460 6 0.50000 0 5 S.D. dependent 0.5129 89 0.8766 47 0.9762 20 0.8960 84 0.3383 23 0.5119 03 14

X Mean dependent var S.E. of regression Sum squared resid Log likelihood

0 var 0.33374 Akaike info 5 criterion 2.00493 Schwarz 9 criterion Hannan-Quinn 6.76646 criter. 5 4 14.1929

Restr. log likelihood LR statistic (1 df) Probability(LR

Avg. log

13.8629 likelihood McFadden R-

6 squared 0.00016

stat)

As you can see from the output, the slop coefficient is significant at the 5% level. The coefficients are proportionally higher in absolute value than in the probit model, but the marginal effects and significance should be similar.
prob( yi = 1) F (b1 + b2 xi ) = b2 xi xi = b2 e ( 1 + 2 xi ) [1 + e ( 1 + 2 xi ) ]2 e 3.605+ 0.0018(3469.5) (0.0018) = 0.0001 (1 + e 3.605+ 0.0018(3469.5) ) 2

(b1 + b2 X ).b2 =

This can be interpreted as the marginal effect of GDP per capita on the expected value of Y.
LRI = 1 ln L 6.7664 =1 = 0.51 ln L0 13.8629

15

Potrebbero piacerti anche