Sei sulla pagina 1di 7

Business Analysis and Econometric Applications

2. PART 2 – Short Answer Type


1. What is a Dummy variable trap? How can one avoid it?
Ans. A Dummy variable is an artificial variable generated to represent an attribute with
two or more distinct categories / levels.
The Dummy Variable trap is a scenario in which the independent variables are
multicollinear — a scenario in which two or more variables are highly correlated; in
simple terms one variable can be predicted from the others.
Lets’s look at the following examples to illustrate the dummy variable trap :-
(Categorical Data =2 ):- Consider a data set where a data can take 2 values like (male or
female).
D1 = Dummy variable to represent male.
D2= Dummy variable to represent female.
D1 and D2 are dependent on each other , if D1 is “True” , then D2 is “False”. These variables
are multi collinear in nature (highly correlated) which means :-
D1 + D2 =1 (eqn 1 )
D2=1-D1 (eqn 2)
Let’s consider our linear regression equation where y is the ‘dependent variable’ :-
y= b0 +b1 *D1+ b2 * D2 (eqn 3)
Substituting the value of D2 from (eqn 2) in the (eqn 3):-
y= b0+b1*D1 +b2*(1-D1)
y= b0+b1*D1+b2-b2*D1
y=(b0+b2) +(b1-b2)D1 (eqn 4)
You can see that the b2 value which was associated with D2 got added or adjusted to the
constant term b0 .
So we don’t need to add the declare dummy variables for every category as we did in
the (eqn 3) to avoid the dummy variable trap.
In order to avoid dummy variable trap , we always declare one less dummy variable (n-
1 )than the categorical values (n).
No of Dummy variables = categorical values -1 .
In the above case , we have two categorical values (Male or Female). So we can have
only 1 Dummy variable as proved by (eqn 4). Hence the new equation should be
y= b0 +b1 *D1
(Categorical Data >2 ):- Consider a data set with more than 2 values . For example a
person’s nationality (Germany,France or Australia).As per the above rule , we need to
declare only 2 dummy variables as the categorical values count is 3 . But we will declare
3 variables in our equation:-
Dummy variables are :-
D1= If person is from Germany
D2 =If person is from France.
D3 =If person is from Australia.
Since all three are multicollinear in nature , following equation holds true :-
D1+D2+D3=1
D3=[1-(D1+D2) ]…………(eqn 5)
Lets see our linear regression equation :-
y= b0+b1*D1+b2*D2+b3*D3…(eqn 6)
y= b0+b1*D1+b2*D2 +b3 *[1-(D1+D2) ]
y=(b0+b3)+ D1(b1-b3)+D2(b2-b3)…….(eqn 6)
So we can see the coefficient for D3 i.e b3 got adjusted with the constant term ,we can
avoid the declaration of the third variable. Hence the above assumption holds true, the
simplified equation will be :-
y= b0+b1*D1+b2*D2
This is the method we can follow to avoid dummy variable trap.

2. What is the difference between fixed and random effects models?


Ans.
Random and Fixed Variables
A “fixed variable” is one that is assumed to be measured without error. It is also assumed
that the values of a fixed variable in one study are the same as the values of the fixed
variable in another study. “Random variables” are assumed to be values that are drawn
from a larger population of values and thus will represent them. You can think of the
values of random variables as representing a random sample of all possible values or
instances of that variable. Thus, we expect to generalize the results obtained with a
random variable to all other possible instances of that value (e.g., a job candidate with a
strong résumé). Most of the time in ANOVA and regression analysis we assume the
independent variables are fixed.
Random vs. Definition Example Use in Multilevel
Fixed Regression
Variables Random variable: (1) is assumed to be measured Random variable: Predictor variables in
with measurement error. The scores are a function of a photographs MLR generally assumed
true score and random error; (2) the values come from representing to be fixed
and are intended to generalize to a much larger individuals with
population of possible values with a certain probability differing levels of
distribution (e.g., normal distribution); (3) the number of attractiveness
values in the study is small relative to the values of the manipulated in an
variable as it appears in the population it is drawn from. experiment, a subset
Fixed variable: (1) assumed to be measured without of census tracks
measurement error; (2) desired generalization to
population or other studies is to the same values; (3)
the variable used in the study contains all or most of Fixed variable:
the variable’s values in the population. gender, race, or
intervention vs.
It is important to distinguish between a variable that is control group.
varying and a variable that is random. A fixed variable
can have different values, it is not necessarily invariant
(equal) across groups.

Effects Random effect: (1) different statistical model of Random effect: Intercept only models in
regression or ANOVA model which assumes that an random effects MLR are equivalent to
independent variable is random; (2) generally used if ANOVA, random random effects ANOVA
the levels of the independent variable are thought to be effects regression and inclusion of one or
a small subset of the possible values which one wishes Fixed effect: fixed more level-1 predictors
to generalize to; (3) will probably produce larger effects ANOVA, makes the model
standard errors (less powerful). Fixed effect: (1) fixed effects equivalent to a random
statistical model typically used in regression and regression effects ANCOVA when
ANOVA assuming independent variable is fixed; (2) slopes do not vary
generalization of the results apply to similar values of across groups.
independent variable in the population or in other
studies; (3) will probably produce smaller standard
errors (more powerful).

Coefficients Random coefficient: term applies only to MLR Random Both used in MLR.
analyses in which intercepts, slopes, and variances can coefficient: the Slopes and intercept
be assumed to be random. MLR analyses most level-2 predictor, values can be
typically assume random coefficients. One can average income, is considered to be fixed or
conceptualize the coefficients obtained from the level-1 used to predict random, depending on
regressions as a type of random variable which comes school performance researchers'
from and generalizes to a distribution of possible in each school. assumptions and how
values. Groups are conceived of as a subset of the Intercept values for the model is specified.
possible groups. school performance The average intercept or
are assumed to be a slope is referred to as a
Fixed coefficient: a coefficient can be fixed to be non- sample of the "fixed effect." Variances
varying (invariant) across groups by setting the intercepts from a of the slopes and
between-group variance to zero. larger population of intercepts (if allowed to
schools. vary across groups) are
Random coefficients must be variable across groups. called “random
Conceptually, fixed coefficients may be invariant or Fixed coefficient: coefficients."
varying across groups. slopes or intercepts
constrained to be
equal over different
schools.
1. PART 3 – Long Answer Type (IE)
1. (1) What is the model that you fit to this data? Write the regression equation for this
model, define each variable and explain what would be the sign of the coefficients and
why? (10 M)
Ans.

m1
Strength
m2

II
I

Water- Cement ratio (%)

Observations from graph –


• The function is well represented by combination of 2 straight lines
• Slope is -ve
• there is change in slope of the graph at W-C ratio 70%
Hence it makes sense to use a Piecewise linear regression model

Regression is done over Strength and Water-Cement Ratio. Here, the regressor
is X = Water- Cement Ratio while the Regressand is Y = Strength

The equation is given by,


Y = A + B*X + C*(X – X’) *D
where,
Y = Estimator for strength
X = Water – Cement Ratio %
X’ = Threshold value (70% in this case)
D = 1 if X > X’ or 0 otherwise
A, B & C are the coefficients of regression
Properties of liner regression holds. Strength given X = Y (given X) + u (given X).
Where the Expected value of u is zero.

When D = 0,
The equation is of the form
Y = A + B*X
This represents the first part of the graph. From visual inspection of graph, we
can see that A is positive and B is negative. This equation will give strength when
the ratio is below threshold ratio.
When D = 1,
Y = A + (B – C) *X – C*X’
This represents the second part of graph. Here, (B-C) shall be negative, as evident
from the graph. Since, B is -ve already (from the previous equation) and slope of
this part is more negative than the first part, we conclude that C also is -ve. Here
X’ is 70% in this case

1. (2) Suppose you ignore the scatter plot as shown above and fit the following regression
equation: Strength= α+β water/cement +u. (5 M) How will you interpret? What is wrong
with this model?
Ans.
Given,
Strength = A + B * Water-Cement Ratio + u

This is a binary regression model between strength and W-C ratio. To fit this
equation to the scatter plot, A should be positive, and B will be negative. While u refers
to the residual between the actual value and estimated values.
• In such scenario, value of u shall be higher such that summation of absolute
value of u or squared values of u will be higher in this case compared to previous.
Which in turn will affect the total error and error due to regression such that R-
square value will reduce. Thus, this model will less of a good fit to the scenario.
In the above model, threshold value will not be observable. One of the aims of the
experiment is to determine the critical value of water-cement ratio and use it as a
specification for mixing cement. Hence the whole purpose of experiment is failed. A
simple regression will not incorporate the sharp change. So, a piecewise linear function
is used for this type of regression for better R-square value. Maximum Likelihood model
can also be is to plot higher degree of regression to trace mean value of samples more
accurately.
2. (1) What econometric technique should be used by the team to solve this problem?
And why? What would be the dependent variable? (6 M)
Ans. As banks want to identify that whether they should off for loan or not. Therefore
regression is of qualitative nature. That is yes or no
We have three models to solve these kind of problems
1) LPM (linear probability model)
2) Lokgit
3) Probit
LPM has many problems like
1. Non normality of error term
2. Heteroscedasticity of error term etc
Many researchers use either logit or probit model
Logit is preferred for its mathematical simplicity.
The equation will look like
𝑝𝑖
𝑙𝑛 ( ) = 𝛽1 + 𝛽2 𝑥𝑖 + ⋯
1 − 𝑝𝑖
Pi is the no. of yes of loan offer accepted.
1-Pi is the number of loan offers rejected in a sample.
𝑝𝑖
In logit 𝑙𝑛 ( ) is used as a dependent variable.
1−𝑝𝑖

2.(2) What specific characteristics of the credit card history and personal details would
affect the dependent variable and how? Explain (6 M)

Ans. Specific characteristics of the credit card history that affect loan offers are
1. Number of times default to pay
This is normally calculated by C I B I L score. Better the score better is the person to
offer loan.
2. Number of times loan already taken and payment done on time
Specific characteristics of personal details that would affect dependent variables are
1. Salary - higher the salary more probability to pay bills on time
2. Age - lower the age more is the time left for retirement high paying potential
3. Number of dependents - more number of dependents mod is a chance of
defaulting payment.

2.(3) Suppose one of the independent variables is income of the customer. How can we
obtain the marginal effect of income on the dependent variable? (3 M)

Ans. In one variable income of the customer – i.e.

𝑝𝑖
𝑙𝑛 ( ) = 𝛽1 + 𝛽2 𝐼 + ⋯ (where I is the Income)
1−𝑝𝑖

Marginal effect of income is a change in the dependent variable per unit change in
income.
Therefore,
Taking I to increase by one unit change in

𝑝𝑖
𝑙𝑛 ( ) = 𝛽2
1 − 𝑝𝑖
Taking antilog

𝑝𝑖
( ) = 𝑒 𝛽2
1 − 𝑝𝑖
𝑝𝑖 = 𝑒 𝛽2 (1 − 𝑝𝑖 )

𝑒 𝛽2
𝑝𝑖 =
1 + 𝑒𝛽2

Marginal effect of income

Potrebbero piacerti anche