Sei sulla pagina 1di 40

TUTORIAL 1 (WEEK 2): An Overview of Regression Analysis

Activities:

Step 1: create a Google slide. You are required to use this Google slide for all activities until the
end of semester;

Step 2: in the first slide, Write your name and the tutorial group.

Step 3: Briefly introduce yourself (including what you want me to call you in the class) attaching
a photo of you in the second slide.

Step 4: give us one example of deterministic and one example of probabilistic (stochastic) models
from any discipline (such as physics, chemistry, business, biology etc.). It would be appreciated it
if you can use any video to support your example. If you cannot find any video, that is fine too.
Due date: Before Next Tutorial.

Tutorial Exercises:
The end of Chapter 1 pp 42-47 (Using Econometrics A Practical Guide, AH Studenmund
7th edition), Questions 3.

Question 3:
Decide whether you would expect relationships between the following pairs of dependent
and independent variables (respectively) to be positive, negative or ambiguous. Explain
your reasoning.
a. Aggregate net investment in the United States in a given year and GDP in that
year.
b. The amount of hair on the head of a male professor and the age of that professor.
c. The number of acres of wheat planted in a season and the price of wheat at the
beginning of that season.
d. The net investment and the real rate of interest in the same year and country.
e. The growth rate of GDP in a year and the average hair length in that year.
f. The quantity of canned tuna demanded and the price of a can of tuna.

Answer: (a) Positive, (b) negative, (c) positive, (d) negative, (e) ambiguous, (f) negative.
TUTORIAL 2 (WEEK 3): Ordinary least squared

Activities:

Note: please use the same Google Slide that you used for Activity 1.

Step 1: Data Collection


Ask 20 students of your gender how tall they are (in centimeters) and how much they weigh (in
kg). Do not include names in the data. Enter the data in an Excel spreadsheet;

Step 2: Estimation
Using the formulas, estimate the coefficients, 𝛽0 and 𝛽0;

Step 3: write the formula and interpret the estimated coefficients.

Step 4: calculate the residuals and check whether the sum of errors equal zero.

Due date: Before Next Tutorial.


Discussion Questions:

Discussion No. 1
a. Why do we need regression analysis? Why not simply use the mean value of the regressors
as its best value?
Answer: we need regression analysis to find the relationship between variables. Mean cannot be
a good tool because the errors can be very wide.
b. What do we mean by a linear regression model?
Answer: linear regression model is a model in which all the variables and coefficients appears in
a simplest format.

Tutorial Exercises:

The end of Chapter 2 pp 75-80 (Using Econometrics A Practical Guide, AH Studenmund 7th
edition), Questions 2, and 4.

Question 2:
Just as you are about to estimate a regression (due tomorrow), massive sunspots cause magnetic
interference that ruins all electrically powered machines (e.g. computers). Instead of giving up and
flunking, you decide to calculate estimates from your data (on per capita income in thousands of
US dollars as a function of the percent of the labor force in agriculture in 10 developed countries)
using methods like those used in Section 1 without a computer. Your data are:

Country A B C D E F G H I J
Per Capital Income 6 8 8 7 7 12 9 8 9 10
% in Agriculture 9 10 8 7 10 4 5 5 6 7

a. Calculate 𝛽̂0 and 𝛽̂1.


Answer:

Per
% in
Country Capital
Agriculture
Income

Y X Y-Average Y X-Average X (X-Average)(Y-Average) (X-Average)^2


A 6 9 -2.4 1.9 -4.56 3.61
B 8 10 -0.4 2.9 -1.16 8.41
C 8 8 -0.4 0.9 -0.36 0.81
D 7 7 -1.4 -0.1 0.14 0.01
E 7 10 -1.4 2.9 -4.06 8.41
F 12 4 3.6 -3.1 -11.16 9.61
G 9 5 0.6 -2.1 -1.26 4.41
H 8 5 -0.4 -2.1 0.84 4.41
I 9 6 0.6 -1.1 -0.66 1.21
J 10 7 1.6 -0.1 -0.16 0.01
Average 8.4 7.1 Sum -22.4 40.9

∑(𝑌 − 𝑌̅)(𝑋 − 𝑋̅) −22.4


𝛽̂1 = = = −0.55
∑(𝑋 − 𝑋̅)2 40.9

𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅ = 8.4 − (7.1 ∗ (−0.55) = 12.89

Question 4:
Consider the following two least squares estimates of the relationship between interest rates and
the federal budget deficit in the United States:
Model A: 𝑌̂1 = 0.103 − 0.079𝑋1
Where: 𝑌1 = the interest rate on Aaa corporate bonds
𝑋1 = the federal budget deficit as a percentage of GNP
(quarterly model : N=56)

Model T: 𝑌̂2 = 0.089 + 0.369𝑋2 + 0.887𝑋3


Where: 𝑌2 = the interest rate on 3-month Treasury bills
𝑋2 = the federal budget deficit in billions of dollas
𝑋3 = the rate of inflation (in percent)
(quarterly model : N=38)

a. What does “least squares estimates” mean? What is being estimated? What is being
squared? In what sense are the squares “least”?
Answer: The squares are “least” in the sense that they are being minimized. The slope and
intercept are being estimated. We square the errors and then minimize them.
b. Based on economic theory, what signs would you have expected for the estimated slope
coefficients of the two models?
Answer: Positive.
c. Interpret the estimated coefficients for both models.
Answer:
Model A: if the federal budget deficit as a percentage of GNP increases by 1%, the interest rate
on Aaa corporate bonds drops by 0.79%.
Model T:
if the federal budget deficit increases by 1 billion of dollars, the interest rate on 3-month
Treasury bills increases by 0.369%, Holding other variables constant.
if the rate of inflation increases by 1%, the interest rate on 3-month Treasury bills increases by
0.887%, Holding other variables constant
TUTORIAL 3 (WEEK 4): Ordinary Least Squared: Classical Assumptions

Activities:

Note: please use the same Google Slide that you used for Activity 1 and 2.

Step 1: Choose an economic topic of your interest and find a journal article related to that topic
You can login into https://taylorslibrary.taylors.edu.my and download economic articles based on
your chosen topic. The articles should be published anytime between 2010 and 2018. Briefly
explain the reason that you have chosen this topic (identify the key issue that your result aims to
identify).

Step 2: Review the literature and develop the theoretical model


After reading the related article, determine the dependent and independent variables and the
country that you choose to study.

Step 3: Model specification


Select the independent variables and write the general and specific model.

Step 4: Hypothesize the expected signs of the coefficients


Hypothesize the expected signs of the slope coefficients of your model. Explain your reasoning-
if you really do not know, read about it. That’s what literature reviews are for.

Step 5: Data collection


Collect the data for dependent and independent variables and explain the date source for your
study.

Step 6: Estimate and evaluate the equation


Estimate your equation using Eviews and print out the results. Then evaluate your results by
answering the following questions:
a) Do the sign of the coefficients meet the expectation you developed in step 4?
b) What is 𝑅̅ 2 ? what is 𝑅 2 ? Interpret them. Are they different? If yes, why?
c) Interpret the estimated coefficients.

Due date: Before Next Tutorial.


Discussion Questions:

Discussion No. 1
a. What is the conditional expectation function or the population regression function?
b. What is the difference between the population and sample regression functions? Is this a
distinction without difference?
c. What is the role of the stochastic error term in regression analysis? What is the difference
between the stochastic error term and the residual?
Answer:
a. population regression function is an equation that shows the relationship between DV and IVs
in the population.
b. the population regression function shows the relationship between DV and IV in population,
so the coefficients cannot be estimated as we do not have data for population but for sample
regression function in the relationship between DV and IVS in the sample and we can estimate
the coefficient because we data for the sample.
c. stochastic error term shows the difference between real DV and estimated DV for population
and it cannot be seen. But, residuals are the differences between real value of DV and estimated
DV for the sample and can be calculated and minimized.

Discussion No 3
The distinction between the stochastic error term and the residual is one of the most difficult
concepts to master in this chapter. List at least three differences between the stochastic error term
and the residual.

Answer:
a.
the stochastic error term: the residual:
population Sample
the population regression function shows the the relationship between DV and IVS in the
relationship between DV and IV in population sample
Cannot be calculated. Can be calculated.
No data Have data

Tutorial Exercises:

The end of Chapter 2 pp 75-80 (Using Econometrics A Practical Guide, AH Studenmund


7th edition), Questions 4 and 6.
Question 4:
Let’s return to the height-weight example on page 71 and recall what happened when we added a
nonsensical variable that measured the student’s campus post office boxnumber (MAIL) to the
equation. The estimated equation changed from:
̂ = 103.40 + 6.38𝐻𝐸𝐼𝐺𝐻𝑇
𝑊𝐸𝐼𝐺𝐻𝑇
to:
̂ = 102.35 + 6.36𝐻𝐸𝐼𝐺𝐻𝑇 + 0.02𝑀𝐴𝐼𝐿
𝑊𝐸𝐼𝐺𝐻𝑇
a. The estimated coefficient of HEIGHT changed when we added MAIL to the equation.
Does that make sense? Why?
b. In theory, someone’s weight has nothing to do with their campus mail box number, yet 𝑅 2
went up from 0.74 to 0.75 when MAIL was added to the equation! How is it possible that
adding nonsensical variable to an equation can increase 𝑅 2 ?
c. Adding the nonsensical variable to the equation decreases 𝑅̅ 2 from 0.73 to 0.72. explain
how it’s possible that 𝑅̅ 2 can go down at the same time that 𝑅 2 goes up.
d. If a person’s campus mail box number truly is unrelated to their weight, shouldn’t the
estimated coefficient of the variable equal exactly 0.00? How is it possible for a nonsensical
variable to get a nonzero estimated coefficient?
Answer:
(a) Yes. The new coefficient represents the impact of HEIGHT on WEIGHT, holding
MAIL constant, while the original coefficient did not hold MAIL constant. We’d
expect the estimated coefficient to change (even if only slightly) because of this new
constraint.
(b) One weakness of R2 is that adding a variable will usually decrease (and will never
increase) the summed squared residuals no matter how nonsensical the variable is. As
a result, adding a nonsensical variable will usually increase (and will never decrease)
R2 .
2
(c) R is adjusted for degrees of freedom and R2 isn’t, so it’s completely possible that
the two measures could move in opposite directions when a variable is added to an
equation.
(d) The coefficient is indeed equal to zero in theory, but in any given sample the
observed values for MAIL may provide some minor explanatory power beyond that
provided by HEIGHT. As a result, it’s typical to get a nonzero estimated coefficient
even for the most nonsensical of variables.
TUTORIAL 4 (WEEK 5): Hypothesis Testing

Activities:

Note:
1. Please use the same Google Slide that you used for Activity 1, 2 and 3.
2. You have to use the model and collected data in class activity 3 for this activity.

Step 1: Create hypothesis


Review the literature and set up the null and alternative hypotheses about the independent
variables.

Step 2: Choose a level of significance and a critical t-value

Step 3: Run the regression and obtain an estimated t-value

Step 4: Apply the decision rule in order to reject or not reject the null hypothesis.

Step 5: Interpret the estimated coefficients

Step 6: Calculate the confidential intervals for each coefficient.

Due date: Before Next Tutorial.


Tutorial Exercises:
The end of Chapter 5 pp 166-172 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, 3, and 6.

Question 1:
Write the meaning of each of the following terms.
a. alternative hypothesis g. Type I Error
the outcome the researcher does expect Rejecting a true null hypothesis
b. confidence interval h. Type II Error
A confidence interval is a range that contains Not rejecting a false null hypothesis
the true value of an item a specified
percentage of the time

c. Critical value i. p-value


The critical value effectively separates the A p-value, or marginal significance level, is
“acceptance”/non-rejection region from the the probability of observing a t-score that
rejection region when testing a null size or larger (in absolute value) if the null
hypothesis hypothesis were true

d. decision rule
The rule to apply when testing a single
regression coefficient ends up being that you
should: Reject H0 if |tk| > tc and if tk also has
the sign implied by HA
Do not reject H0 otherwise

e. level of significance
The level of significance indicates the
probability of observing an estimated t-value
greater than the critical
t-value if the null hypothesis were correct

f. Null hypothesis
the outcome that the researcher does not
expect (almost always includes an equality
sign)

Question 3:
To get more experience with the t-test, let’s return to the model of alcohol consumption that we
developed in Exercise 11 of Chapter 4.
̂ 𝑖 13.00+11.36𝐴𝐷𝑉𝐼𝐶𝐸𝑖 −0.20𝐸𝐷𝑈𝐶𝑖 +2.85𝐷𝐼𝑉𝑆𝐸𝑃𝑖 +14.20𝑈𝑁𝐸𝑀𝑃𝑖
𝐷𝑅𝐼𝑁𝐾𝑆 =
(2.12) (0.31) (2.55) (5.16)
𝑁 = 500 𝑅̅ 2 = 0.07
where: 𝐷𝑅𝐼𝑁𝐾𝑆𝑖 = drinks consumed by the ith individual in the last two weeks
𝐴𝐷𝑉𝐼𝐶𝐸𝑖 = 1 if a physician had advised the ith individual to cut back on drinking alcohol,
0 otherwise
𝐸𝐷𝑈𝐶𝑖 = years of schooling of ith individual
𝐷𝐼𝑉𝑆𝐸𝑃𝑖 = 1 if the ith individual was divorced or separated, 0 otherwise.
𝑈𝑁𝐸𝑀𝑃𝑖 = 1 if the ith individual was unemployed, 0 otherwise
a. It seems reasonable to expect positive coefficient for 𝐷𝐼𝑉𝑆𝐸𝑃 and 𝑈𝑁𝐸𝑀𝑃. Create and test
appropriate hypotheses for the coefficients of 𝐷𝐼𝑉𝑆𝐸𝑃 and 𝑈𝑁𝐸𝑀𝑃 at the 5-percent level.
b. Create and run a two sided hypothesis test around zero of coefficient of 𝐸𝐷𝑈𝐶 at 1-percent
level. Why might a two-sided test be appropriate for this coefficient?
c. Most physicians would expect that if they urged patients to drink less alcohol, that’s what
patients actually would do (holding constant the other variables in the equation). Create and
test appropriate hypotheses for the coefficient of 𝐴𝐷𝑉𝐼𝐶𝐸 at 10-percent level.
d. Does your answer to part c causes you to wonder if perhaps you should change your hypotheses
in part c? explain.

Answer:
H0 : 0, H A : 0. . Cannot reject H since |1.11| 1.658 even though 1.11 has the
(a) DIVSEP: 0

sign of HA.

UNEMP:
H0 : 0, H A : 0. Reject H since | 2.75 | 1.658 and 2.75 has the sign of H .
0 A
H0 : 0, H A : 0. Cannot reject H since | 0.65 | 2.617.
(b) EDUC: 0
H0 : 0, H A : 0. Can reject H since | 5.37 | 1.289. however, it doesn’t have the sign
(c) ADVICE: 0

of HA, so, it is positively links and it is a concern that our number may not be correct.
(d) No. We’d still expect ADVICE to have a negative impact on DRINKS in this structural equation. The
problem is that the two variables almost surely are simultaneously determined, since a physician would be
more likely to advise a patient to drink less if that patient was drinking quite a bit. This simultaneity
violates Classical Assumption III. We’ll learn it in lecture 8.

Question 6:
Suppose that you estimate a model of house prices to determine the impact of having beach
frontage on the value of house. You do some research, and you decide to use the size of the lot
instead of the size of the house for a number of theoretical and data availability reasons. Your
results (standard errors in parenthesis) are:
̂ 𝑖 40+35.0 𝐿𝑂𝑇𝑖 −2.0𝐴𝐺𝐸𝑖 +10.0 𝐵𝐸𝐷𝑖 −4.0𝐹𝐼𝑅𝐸𝑖 +100𝐵𝐸𝐴𝐶𝐻𝑖
𝑃𝑅𝐼𝐶𝐸 =
(5.0) (1.0) (10.0) (4.0) (10)
𝑁 = 30 𝑅̅ 2 = 0.63
where: 𝑃𝑅𝐼𝐶𝐸𝑖 = the price of the ith house (in thousands of dollars)
𝐿𝑂𝑇𝑖 = the size of the lot of the ith house (in thousands of square feet)
𝐴𝐺𝐸𝑖 = the age of the ith house in years
𝐵𝐸𝐷𝑖 = the number of bedrooms in the ith house
𝐹𝐼𝑅𝐸𝑖 = a dummy variable for a fireplace (1= yeas for the ith house)
𝐵𝐸𝐴𝐶𝐻𝑖 = a dummy for having beach frontage (1= yes for the ith house)
a. You expect the variables 𝐿𝑂𝑇, 𝐵𝐸𝐷, and 𝐵𝐸𝐴𝐶𝐻 to have positive coefficients. Create and
test the appropriate hypotheses to evaluate these expectations at the 5-percent level.
b. You expect 𝐴𝐺𝐸 to have a negative coefficient. Create and test the appropriate hypotheses
to evaluate this expectation at the 10-percent level.
c. At first you expect 𝐹𝐼𝑅𝐸 to have a positive coefficient, but one of your friends says that
fireplaces are messy and are a pain to keep clean, so you’re not sure. Run two-sided t-test
around zero to test this expectation at the 5-percent level.
d. What problems appear to exist in your equation? (Hint: Do you have any unexpected signs?
Do you have any coefficients that are not significantly different from zero?)
e. Which of the problems that you outlined in part d is the most worrisome? Explain your
answer.

Answer:
H0: 0, HA: 0,
(a) For all three, and the critical 5% one-sided t-value for 24 degrees of
freedom is 1.711. For LOT, we can reject H0 because | 7.0 | 1.711 and 7.0 is positive. For
BED, we cannot reject H0 because | 1.0 | 1.711 even though 1.0 is positive. For BEACH,
we can reject H because | 10.0 | 1.711 and 10.0 is positive.
0

(b) H0: 0, HA: 0, and the critical 10% one-sided t-value for 24 degrees of freedom is
1.318, so we reject H0 because | 2.0 | 1.318 and 2.0 is negative.
(c) H0: 0, HA: 0, and the critical 5% two-sided t-value for 24 degrees of freedom is
2.064, so we cannot reject H because | 1.0 | 2.064. Note that we don’t check the sign
0

because the test is two-sided and both signs are in the alternative hypothesis.
(d) The main problems are that the coefficients of BED and FIRE are insignificantly different
from zero.
(e) Given that we weren’t sure what sign to expect for the coefficient of FIRE, the insignificant
coefficient for BED is the most worrisome.
TUTORIAL 5 (WEEK 6): Specification: Choosing the Independent Variables

Tutorial Exercises:

The end of Chapter 6 pp 196-202 (Using Econometrics A Practical Guide, AH


Studenmund 7th edition), Questions 1, 3, and 5.

Question 1:
Write the meaning of each of the following terms.
a. expected sign g. specification errors
the sign that the researcher expects for the A specification error results when one of
coefficient. these choices is made incorrectly:
independent variables
functional form
form of the stochastic error term
b. irrelevant variable
This refers to the case of including a variable h. the four specification criteria
in an equation when it does not belong there We can summarize the previous discussion
into four criteria to help decide whether a
given variable belongs in the equation:
1. Theory: Is the variable’s place in the
equation unambiguous and theoretically
sound?
2. t-Test: Is the variable’s estimated
coefficient significant in the expected
direction?
3. R2: Does the overall fit of the equation
(adjusted for degrees of freedom) improve
when the variable is added to the equation?
4. Bias: Do other variables’ coefficients
change significantly when the variable is
added to the equation?

c. omitted variable
An important explanatory variable that have
been left out.

d. omitted variable bias


Omitting a relevant variable usually is
evidence that the entire equation is a suspect,
because of the likely bias of the coefficients.
This is called omitted variable bias.
e. sensitivity analysis
This essentially consists of purposely
running a number of alternative
specifications to determine whether particular
results are robust (not statistical flukes) to a
change in specification

f. sequential specification search


specifying different regressions until
estimates with the desired properties are
obtained

Question 3:
Consider the following annual model of the death rate (per million population) due to coronary
heart disease in the United States (𝑌𝑡 ):

𝑌̂𝑡 140+10.0 𝐶𝑡 +4.0 𝐸𝑡 −1.0 𝑀𝑡


=
(2.5) (1.0) (0.5)
𝑁 = 31 (1975 − 2005) 𝑅̅ 2 = 0.678
where: 𝐶𝑡 = per capita cigarette consumption (pounds of tobacco) in year t
𝐸𝑡 = per capita consumption of edible saturated fats (pounds of butter, margarine, and
lard) in year t
𝑀𝑡 = per capita consumption of meat (pounds) in year t
a. Create and test appropriate hypotheses at the 10-percent level. What, if anything, seems to
be wrong with the estimated coefficient of M.
b. The most likely cause of a coefficient that is significant in the upexpected direction is
omitted variable bias. Which of the following variables could possibly be an omitted
variable that is causing 𝛽̂𝑀 ’s unexpected sign? Explain. (Hint: Be sure to analyze expected
bias in your explanation.)
𝐵𝑡 = per capita consumption of hard liquor (gallons) in year t
𝐹𝑡 = the average fat content (percentage) of the meat that was consumed in year t
𝑊𝑡 = per capita consumption of wine and beer (gallons) in year t
𝑅𝑡 = per capita number of miles run in year t
𝐻𝑡 = per capita open-heart surgeries in year t
𝑂𝑡 = per capita amount of oat bran eaten in year t
Answer:
(a) Coefficient: C E M

Hypothesized sign:
t-value: 4.0 4.0 2.0
tC 1.314 reject reject reject
(10% one-sided
with 27 d.f.)
The problem with the coefficient of M is that it is significant in the unexpected direction, one
indicator of a possible omitted variable.
(b) The coefficient of M is unexpectedly negative, so we’re looking for a variable the omission
of which would cause negative bias in the estimate of M . We thus need a variable that is
negatively correlated with meat consumption with a positive expected coefficient or a
variable that is positively correlated with meat consumption with a negative expected
coefficient. For the six variables listed, the expected bias is:

Possible Omitted Expected Sign Correlation Direction


Variable of with M of Bias
B *

F
W *

R
H
O
*Indicates a weak expected sign or correlation.

Question 5:
Assume that you have been hired by the surgeon general of the United States to study the
determinants of smoking behavior and that you estimate the following cross-sectional model
based on data for all 50 states (standard errors in parentheses):

𝐶̂𝑖 = 100−9.0 𝐸𝑖 +1.0 𝐼𝑖 −0.04 𝑇𝑖 −3.0 𝑉𝑖 +1.5 𝑅𝑖


(3.0) (1.0) (1.0) (3.0) (3.0)

𝑁 = 50 (𝑠𝑡𝑎𝑡𝑒𝑠) 𝑅̅ 2 = 0.40
where: 𝐶𝑖 = the number of cigarettes consumed per day per person in the ith state
𝐸𝑖 = the average years of education for persons over 21 in the ith state
𝐼𝑖 = the average income in the ith state (thousands of dollars)
𝑇𝑖 = the tax per package of cigarettes in the ith state(cents)
𝑉𝑖 = the number of video ads against smoking aired on the three major networks in the ith
state
𝑅𝑖 = the number of radio ads against smoking aired on the five largest radio networks in
the ith state
a. Develop and test (at the 5-percent level) appropriate hypothesis for the coefficients of the
variables in this equation.
b. Do you appear to have any irrelevant variables? Do you appear to have any omitted
variables? Explain your answer.
c. Let’s assume that your answer to part b was yes to both. Which problem is more important
to solve first- irrelevant variables or omitted variables? Why?
d. One of the purposes of running the equation was to determine the effectiveness of
antismoking advertising on television and radio. What is your conclusion?
e. The surgeon general decided that tax rates are irrelevant to cigarette smoking and orders
you to drop the variable from your equation. Given the following results, use our four
specification criteria to decide whether you agree with her conclusion, carefully explain
your reasoning (standard errors in parentheses).

𝐶̂𝑖 101−9.0 𝐸𝑖 +1.0 𝐼𝑖 −3.5 𝑉𝑖 +1.6 𝑅𝑖


=
(3.0) (0.9) (1.0) (0.5)

𝑁 = 50 (𝑠𝑡𝑎𝑡𝑒𝑠) 𝑅̅ 2 = 0.40
f. In answering part e, you surely noticed that the figures 𝑅̅ 2 were identical. Does this surprise
you? Why or why not?

Answer:

(a) Coefficient: E I T V R

Hypothesized sign:
Calculated t-score: 3.0 1.0 1.0 3.0 3.0
tC 1.682, so: sig. insig. insig. sig. sig. but
unexp. sign
(b) Both income and tax rate are potential irrelevant variables not only because of the sizes of the
t-scores but also because of theory. The significant unexpected sign for R is a clear
indication that there is a potential omitted variable.
(c) It’s prudent to attempt to solve an omitted variable problem before worrying about irrelevant
variables because of the bias that omitted variables cause.
(d) The equation appears to show that television advertising is effective and radio advertising
isn’t, but you shouldn’t jump to this conclusion. Improving the specification could change
this result. In particular, although it’s possible that radio advertising has little impact on
smoking, it’s very hard to believe that a radio antismoking campaign could cause a
significant increase in cigarette consumption!
(e) Theory: Given the fairly price-inelastic demand for cigarettes, it’s possible that T is
irrelevant.
t-score: The estimated coefficient isn’t significantly different from zero in the expected
direction.
R 2 : R2 remains constant, which is exactly what will happen whenever a variable with a
t-score with an absolute value of 1 is removed from (or added to) an equation.
Do other coefficients change?: None of the other estimated coefficients change significantly
when T is dropped, indicating that dropping T caused no bias.
Conclusion: Based on these four criteria, it’s reasonable to conclude that T is an irrelevant
variable.
(f) You should not have been surprised. If a variable’s coefficient has a t-score of exactly 1.00,
2
then taking that variable out of an equation will not change R .
TUTORIAL 6 (WEEK 7): Specification: Choosing the right independent variables

Tutorial Exercises:
The end of Chapter 7 pp 228-235 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, 3, and 5.

Question 1:
Write the meaning of each of the following terms.
a. double-log functional form f. linear in variables
the natural log of Y is the dependent variable An equation is linear in the variables if
and the natural log of X is the independent plotting the function in terms of X and Y
variable generates a straight line
b. elasticity g. log
the percentage change in the dependent Thus, a log (or logarithm) is the exponent to
variable caused by a 1-percent increase in the which a given base must be taken in order to
independent variable, holding the other produce a specific number, refer to slide 6
variables in the equation constant

h. natural log
The symbol for a natural log is “ln,” so ln(x)
= b means that (2.71828) b = x
c. intercept dummy j. polynomial functional form, refer to slide 6
A dummy variable is a variable that takes on
the values of 0 or 1, depending on whether a i. polynomial functional form
condition for a qualitative attribute (such as Polynomial functional forms express Y as a
gender) function of independent variables, some of
which are raised to powers other than 1

d. lag j. semilog functional form


Many econometric equations include one or The semilog functional form is a variant of
more lagged independent variables like X1t-1 the double-log equation in which some but
where “t–1” indicates that the observation of not all of the variables (dependent and
X1 is from the time period previous to time independent) are expressed in terms of their
period t (please explain that we will learn natural logs.
about lag more in the future)

e. linear in the coefficients


an equation is linear in the coefficients only
if the coefficients appear in their simplest
form—they:
– are not raised to any powers (other
than one)
– are not multiplied or divided by other
coefficients
– do not themselves include some sort of
function (like logs or exponents)

Question 3:
Suppose you have been hired by a union that wants to convince workers in local dry cleaning
establishments that joining the union will improve their well-being. As your first assignment,
your boss asks you to build a model of wages for fry cleaning workers that measures the impact
of union membership on those wages. Your first equation (standard errors in parenthesis) is:
̂𝑖
𝑊 −11.4+0.30 𝐴𝑖 −0.003 𝐴2𝑖 +1.00 𝑆𝑖 +1.20 𝑈𝑖
=
(0.10) (0.002) (0.20) (1.00)

𝑁 = 34 𝑅̅ 2 = 0.14
where: 𝑊𝑖 = the hourly wage (in dollars) of ith worker
𝐴𝑖 = the age of ith worker
𝑆𝑖 = the number of years of education o ith worker
𝑈𝑖 = a dummy variable = 1 if the ith worker is a union member, 0 otherwise
a. Evaluate the equation. How do the sign and significance of the coefficients compare with
your expectation?
b. What is the meaning of 𝐴2 tem? What relationship between A and W does it amply?
c. Do you think that you should have used the log of W as your dependent variable? Why or
why not?
d. On the basis of your regression, should the workers be convinced that joining the union
will improve their well-being? Why or why not?

Answer:

(a) The estimated coefficients all are in the expected direction, and those for A and S are significant. R
2

seems fairly low, even for a cross-sectional data set of this nature.

(b) It implies that wages rise and then fall with respect to age but does not imply perfect collinearity.

(c) With a semilog left functional form (lnY), a slope coefficient represents the percentage change in the
dependent variable caused by a one-unit increase in the independent variable (holding constant all
the other independent variables). Since pay raises are often discussed in percentage terms, such a
functional form frequently is used to model wage rates and salaries.

(d) The poor fit and the insignificant estimated coefficient of union membership are all reasons for being
extremely cautious about using this regression to draw any conclusions about union membership.
Question 5:
Walter Primeaux used slope dummies to help test his hypothesis that monopolies tend to advertise
less intensively than do duopolies in the electric utility industry. His estimated equation was (t-
scores in parenthesis):

𝑌̂𝑖 0.15+5.0 𝑆𝑖 +0.015 𝐺𝑖 +0.35 𝐷𝑖


=
(4.5) (0.4) (2.9)
𝑁 = 350 𝑅̅ 2 = 0.456
where: 𝑌𝑖 = advertising and promotional expense (in dollar) per 1,000 residential kilowatt hours
(KWH) of the ith electric utility
𝑆𝑖 = number of residential customers of the ith electric utility (hundreds of thousands)
𝐺𝑖 = annual percentage growth in residential KWH of the ith electric utility
𝐷𝑖 = a dummy variable equal to 1 if the ith electric utility is a duopoly, 0 if a monopoly
a. Carefully explain the economic meaning of each of the three slope coefficients.
b. Hypothesize and test the relevant null hypotheses with the t-test at the 5-percent level of
significance. (Hint: Primeaux expected positive coefficients for all three.)

Answer:

(a) Let’s look at 𝛽𝑆 The easy answer is that an increase of 100,000 residential customers will cause an
increase of $5.00 in advertising and promotional expense per 1000 residential kilowatt hours, holding
constant G and D.

𝛽𝐺 : 1% growth in residential KWH will cause an $0.015 advertising and promotional expense (in
dollar) per 1,000 residential kilowatt hours (KWH), holding constant S and D.
𝛽𝐷 : if the electric utility is a duopoly, advertising and promotional expense (in dollar) per 1,000
residential kilowatt hours (KWH) increases by $0.35, holding constant S and G.

(b) Coefficient: S G D

Hypothesized sign:
t-value: 4.5 0.4 2.9
tC 1.645 reject do not reject
(5% one-sided reject
with infinite d.f.)
TUTORIAL 7 (WEEK 8): Multicollinearity

Activities:

Note: please use the same Google Slide that you used for Activity 1.

Step 1: Diagnose and explain whether your model has multicollinearity problem

Step 2: Explain how you are going to solve the problem.

Due date: Before Next Tutorial.

Tutorial Exercises:
The end of Chapter 8 pp 259-262 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, and 5.

Question 1:
Write the meaning of each of the following terms.
a. dominant variable
A special case is that of a dominant variable: an explanatory variable is
definitionally related to the dependent variable
b. imperfect multicollinearity
Imperfect multicollinearity occurs when two (or more) explanatory variables are
imperfectly linearly related, as in:
X1i = α0 + α1X2i + ui
c. perfect multicollinearity
Perfect multicollinearity violates Classical Assumption VI, which specifies that no
explanatory variable is a perfect linear function of any other explanatory variables

d. redundant variable
if two or more IVs in the equation measuring essentially the same thing. In such case
the multicollinear variabkes are not irrelevant, since any one of them is quite probably
theoretically and statistically sounds. Instead the variables might be called redundant;
only one of them is needed to represent the effect of dependent variable that all of them
currently represent.
e. simple correlation coefficient
a characteristic that helps detect the degree of multicollinearity for a given application.
It shows the extend that two variables are linked
f. Variance inflation factor
a characteristic that helps detect the degree of multicollinearity for a given application.
It shows the extend that a group of variables affects another variable.

Question 5:
You have been hired by Dean of Students Office to help reduce damage done to dorms by rowdy
students and your first step is to build a cross-sectional model of last term’s damage to each
dorm as a function of the attributes of that dorm (standard errors in parenthesis):
̂𝑖
𝐷 210+733 𝐹𝑖 −0.805 𝑆𝑖 +74 𝐴𝑖
=
(253) (0.752) (12.4)
𝑁 = 33 𝑅̅ 2 = 0.84
where: 𝐷𝑖 = the amount of damage (in dollars) done to the ith dorm last term
𝐹𝑖 = the percentage of the ith dorm residents who are first-year students
𝑆𝑖 = the number of students who live in the ith dorm
𝐴𝑖 = the number of incidents involving alcohol that were reported to the Dean of Students
Office from the ith dorm last term (incidents involving alcohol may or may not involve
damage to the dorm)
a. Hypothesize signs, calculate t-scores, and test hypotheses for this result (5-percent level).
b. What problems (omitted variables, irrelevant variables, or multicollinearity) appear to exist
in this equation? Why?
c. Suppose that you were now told that the simple correlation coefficient between 𝑆𝑖 and 𝐴𝑖
was 0.94; would that change your answer? How?
d. Is it possible that the unexpected sign of 𝛽̂𝑠 could have been caused by multicollinearity?
Why?
Answer:

(a) Coefficient F S A

Hypothesized sign:
Calculated t-score: 2.90 1.07 5.07
tC 2.447, so: sig. insig. sig
unexpected sign
(b) All three are possibilities.
(c) Multicollinearity is a stronger possibility.
(d) Yes; the distribution of the S is wider with multicollinearity.
TUTORIAL 8 (WEEK 9): Serial Correlation

Activities:

Note: please use the same Google Slide that you used for Activity 1.

Step 1: after estimating your model, run the Durbin-Watson Test


Conduct a Durbin-Watson test for positive serial correlation.
Carefully write down the null and alternative hypotheses.
Run a Durbin-Watson test for positive serial correlation at the 5-percent level. What are
the upper and lower critical values in this case? What can you conclude? Explain.

Due date: Before Next Tutorial.

Tutorial Exercises:
The end of Chapter 9 pp 315-320 (Using Econometrics A Practical Guide, AH
Studenmund 6th edition), Questions 1, 3, and 7.

Question 1:
Write the meaning of each of the following terms.
a. Durbin-Watson test f. negative serial correlation
testing for serial correlation using the implies that the error term has a tendency to
Durbin–Watson d test switch signs from negative to positive and
back again in consecutive observations

b. first order auto correlation coefficient g. Newey-West standard error


in this equation: εt = ρεt–1 + ut, Newey–West standard errors take account
ρ = the first-order autocorrelation coefficient. of serial correlation by correcting the
The magnitude of ρ indicates the strength of standard errors without changing the
the serial correlation and the sign of ρ estimated coefficients
indicates the nature of the serial correlation in
an equation
c. first order serial correlation h. positive serial correlation
The most commonly assumed kind of serial implies that the error term tends to have the
correlation is first-order serial correlation, in same sign from one time period to the next
which the current value of the error term is a
function of the previous value of the error
term:
εt = ρεt–1 + ut

d. Generalized Least Square


It is a method that solve the serial correlation
problem by reforming the equation.
e. impure serial correlation i. pure serial correlation
Impure serial correlation is serial Pure serial correlation occurs when
correlation that is caused by a specification Classical Assumption IV, which assumes
error such as: uncorrelated observations of the error term,
an omitted variable and/or is violated (in a correctly specified equation!)
an incorrect functional form

Question 3:
Recall from section 9.5 that switching the order of a data set will not change its coefficient
estimates. A revised order will change the Durbin-Watson statistic, however. To see both these
points, run regression (𝐻𝑆 = 𝛽0 + 𝛽1 𝑃 + 𝜖) and compare the coefficient estimates and DW
statistics for this data set:
Year Housing starts Population
1 9090 2200
2 8942 2222
3 9755 2244
4 10327 2289
5 10513 2290

in the following three orders (in terms of year):


a. 1, 2, 3, 4, 5
b. 5, 4, 3, 2, 1
c. 2, 4, 3, 5, 1
Answer:
The coefficient estimates for all three orders are the same:
HSt 28187 16.86Pt .
The Durbin-Watson d results differ, however:
(a) DW 3.08
(b) DW 3.08
(c) DW 0.64

Question 7:
You are hired by Farmer Vin, a famous producer of bacon and ham, to test the possibility that
feeding pigs at night allows them to grow faster than feeding them during the day. You take 200
pigs (from newborn piglets to extremely old porkers) and randomly assign them to feeding only
during the day or feeding only at night and, after six months, end up with the following (admittedly
very hypothetical) equation:
̂𝑖
𝑊 12+3.5 𝐺𝑖 +7.0 𝐷𝑖 −0.25 𝐹𝑖
=
(1.0) (1.0) (0.10)
𝑁 = 200 𝑅̅ 2 = 0.70 𝐷𝑊 = 0.50
where: 𝑊𝑖 = the percentage weight gain of the ith pig
𝐺𝑖 = a dummy variable equal to 1 if the ith pig is male, 0 otherwise
𝐷𝑖 = a dummy variable equal to 1 if the ith pig was fed only at night, 0 if only during the
day
𝐹𝑖 = the amount of food (pounds) eaten per day by the ith pig
a. Test for serial correlation at the 5-percent level in this equation.
b. What econometrics problems appear to exist in this equation? (Hint: Be sure to make and
test appropriate hypotheses about the slop coefficients.)
c. The goal of your experiment is to determine whether feeding at night represents a
significant improvement over feeding during the day. What can conclude?
d. The observations are ordered from the youngest pig to the oldest pig. Does this information
change any of your answers to the previous parts of equations? Is this ordering a mistake?
Explain your answer.

Answer:

(a) This is a cross-sectional dataset and we normally wouldn’t expect autocorrelation, but we’ll
test anyway since that’s what the question calls for. DL for a 5% one-sided, K 3, test is
approximately 1.61, substantially higher than the DW of 0.50. (Sample sizes in Table B-4
only go up to 100, but the critical values at those sample sizes turn out to be reasonable
estimates of those at 200.) As a result, we can reject the null hypothesis of no positive serial
correlation, which in this case seems like evidence of impure serial correlation caused by an
omitted variable or an incorrect functional form.
(b) Coefficient: G D F

Hypothesized sign: –?
t-value: 3.5 7.0 –2.5
tC 1.645 reject reject reject
(5% one-sided with infinite d.f.)
We certainly have impure serial correlation. In addition, some students will conclude that F
has a coefficient that is significant in the unexpected direction. (As it turns out, the negative
coefficient could have been anticipated because the dependent variable is in percentage terms
but F is in aggregate terms. We’d guess that the more food a pig eats, the bigger it is,
meaning that its chances of growing at a high rate are low, thus the negative sign.)
(c) The coefficient of D is significant in the expected direction, but given the problems with this
equation, we’d be hesitant to conclude much of anything just yet.

(d) In this case, the accidental ordering was a lucky stroke (not a mistake), because it allowed us
to realize that younger pigs will gain weight at a higher rate than their older counterparts. If
the data are ordered by age, positive residuals will be clustered at one end of the dataset,
while negative ones will be clustered at the other end, giving the appearance of serial
correlation.
TUTORIAL 9 (WEEK 10): Heteroskedasticity

Activities:

Note: please use the same Google Slide that you used for Activity 1.

Step 1: after estimating your model, plot the residuals from your OLS regression against your
independent variable/s (if you have more than one independent variable, plot the graph against
each of them one by one) do the residuals look heteroscedastic? Explain.

Step 2: conduct a Breusch-Pagan Test for Heteroscedasticity.


Use all the right-hand variables in the original model to run the Breusch-Pagan auxiliary
regression.
Write the null and alternative hypotheses, compare the test statistic, and conduct the test at
5-percent level. Does heteroscedasticity appear to be present?

Step 3: conduct a White Test for Heteroscedasticity


How many variables are on the right-hand of the auxiliary regression?
What is the chi-square critical value?
Write the null and alternative hypotheses, compare the test statistic, and conduct the test at
5-percent level. Does heteroscedasticity appear to be present?

Step 4: how can you solve the problem?

Due date: Before Next Tutorial.


Tutorial Exercises:
The end of Chapter 10 pp 349-355 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, 3, and 4.

Question 1:
Write the meaning of each of the following terms.
a. the Breusch-Pegan test g. the White test
The Breusch-Pagan test is a method of testing is a test to check heteroskedasticity
for heteroskedasticity in the error term by
investigating whether the squared residuals
can be explained by possible proportionality
factors

b. heteroskedasticity
Classical Assumption V, which assumes
constant variance of the error term, is
violated
c. impure heteroskedasticity
impure heteroskedasticity is
heteroskedasticity that is caused by a
specification error. impure
heteroskedasticity almost always originates
from an omitted variable (rather than an
incorrect functional form)

d. heteroskedasticity-correlated standard
errors
Heteroskedasticity-corrected errors take
account of heteroskedasticity correcting the
standard errors without changing the
estimated coefficients

e. pure heteroskedasticity
Pure heteroskedasticity occurs when
Classical Assumption V, which assumes
constant variance of the error term, is
violated (in a correctly specified equation!)

f. proportionality factor
Perhaps the most frequently specified model
of pure heteroskedasticity relates the variance
of the error term to an exogenous variable Zi
which is called proportionality factor and it
may or may not be in the equation
Question 3:
Consider the following estimated regression equation for average annual hours worked (per capita)
for young (16-21 years) black men in 94 standard metropolitan statistical areas (standard errors in
parenthesis):

𝐵̂𝑖 300+0.5 𝑊𝑖 −7.5 𝑈𝑖 −18.3 𝑙𝑛𝑃𝑖


=
(0.05) (7.5) (6.1)
𝑁 = 94 𝑅̅ 2 = 0.64 𝐷𝑊 = 2.00
where:
𝐵𝑖 = average annual hours worked (per capita) by young (age 16-21) black men in the ith
city
𝑊𝑖 = average annual hours worked (per capita) by young white men in the ith city
𝑈𝑖 = black unemployment rate in the ith city
𝐿𝑛𝑃𝑖 = natural log of the black population of the ith city
a. Develop and test (5-percent level) your own hypotheses with respect to the individual
estimated slope coefficients.
b. Since this is cross-sectional model, is it reasonable to worry about heteroskedasticity?
c. Supposed you ran a Breusch-Pagan test and found 𝑅 2 = 0.08. Does this support or refute
your answer to part c? (Hint: Be sure to complete the Breusch-Pagan test.)
Answer:

(a) Coefficient: W U lnP

Hypothesized sign: ?
t-value: 10.0 1.0 3.0
tC 1.66 reject do not reject
(5% one-sided reject
with 90 d.f.—interpolating)
(b) Heteroskedasticity seems reasonably unlikely, despite the cross-sectional nature of the
dataset, because the dependent variable is stated in per capita terms.
(d) The breusch-peagn statistic = 0.08*94=5.9. the critical chi square for 3 degree of freedom
approximately 7.99, so we cannot reject the null hypothesis of homoskedasticity.

Question 4:
A. Ando and F. Modigliani collected the following data on the income and consumption of non-
self-employed homeworkers:
Income Bracket ($) Average Income ($) Average Consumption ($)
0-999 556 2760
1000-1999 1622 1930
2000-2999 2664 2740
3000-3999 3587 3515
4000-4999 4535 4350
5000-5999 5538 5320
6000-7499 6585 6250
7500-9999 8582 7460
10000- above 14033 11500

a. Run a regression to explain average consumption as a function of average income.


b. Use the Breusch-Pagan test to test the residuals from the equation you ran in part a for
heteroskedasticity at the 5-percent level.
c. Run a 5-percent White test on the same residuals.
d. If the test runs in parts b or c show evidence of heteroskedasticity, then what, if anything,
should be done about it?

Answer:

10-5. ̂ 𝑖 = 1273.2 + 0.72𝐼𝑖


(a) 𝐶𝑂 R2 0.97 t=16.21

where: CO average consumption


I average income.
̂2 ) = 29.54 − 2.34 𝐿𝑛𝐼 𝑅̅ 2 = 0.39 𝑡 = −2.49
(b) 𝐿𝑛(𝑒𝑖 𝑖

tC 2.365 at the 5% level (two-tailed) so we can reject the null hypothesis of


homoskedasticity.
(c) The White test confirms the Park test result of heteroskedasticity.
(d) Most econometricians would switch to HC standard errors or redefine the variables if the
tests indicated heteroskedasticity.
TUTORIAL 10 (WEEK 11): Introduction to Time Series Analysis: Part I

Tutorial Exercises:
The end of Chapter 12 pp 404-407 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, 3, and 4.

Question 1:
Write the meaning of each of the following terms.
a. distributed lag model
An (ad hoc) distributed lag model explains
the current value of Y as a function of current
and past values of X, thus “distributing” the
impact of X over a number of time periods
b. dynamic model
A dynamic model explains the current value
of Y as a function of current value of X and
past value of Y.

Question 3 (the data needed to answer this question is provided on TIMeS):


You have been hired to determine the impact of advertising on gross sales revenue for “Four
Musketeers” candy bar. Four Musketeers has the same price and more or less the same ingredients
as competing candy bars, so it seems likely that only advertising affects sales. You decide to build
a model of sales as a function of advertising, but you are not sure whether a distributed lag model
or dynamic model is appropriate.
Using data on Four Musketeers candy bar from the Table uploaded on TIMeS, estimate both of
the following equations from 1985-2009 and compare the lag structures implied by the estimated
coefficients.
a. distributed lag model (4 lags)
b. a dynamic model
Answer:
SALES 243 5.2ADt 1.9ADt 3.1ADt 1.0ADt 3.3ADt
(a) 1 2 3 4

SALES 38.86 2.98ADt 0.79SALESt


(b) 1

The lag structure in the ad hoc distributed lag equation makes no economic sense, because the
estimated coefficients don’t follow the smoothly declining pattern that economic theory would suggest
and that results from using a dynamic model.
Question 4 (the data needed to answer this question is provided on TIMeS):
Test for serial correlation in the estimated dynamic model you got as your answer to Question 3
part b.
Answer:
LM NR2 24 0.005622 0.135 < 3.84 5% critical chi-square value with one degree of freedom, so
we cannot reject the null hypothesis of no serial correlation.
TUTORIAL 11 (WEEK 12): Introduction to Time Series Analysis, Part II

Tutorial Exercises:
The end of Chapter 12 pp 404-407 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, and 6.
Question 1:
Write the meaning of each of the following terms.
a. cointegration e. spurious correlation
it means even though individual variables
might be nonstationary, it’s possible for
linear combinations of nonstationary
variables to be stationary

b. Dickey-Fuller test f. stationary


to test whether a variable is stationary a time-series variable, Xt, is stationary if:
1. the mean of Xt is constant over time,
2. the variance of Xt is constant over time,
and
3. the simple correlation coefficient between
Xt and Xt–k depends on the length of the lag
(k) but on no other variable (for all k)

c. nonstationary g. unit root


If one or more of these stationarity properties consider the case where Yt is generated by an
is not met, then Xt is nonstationary equation that includes only past values of
itself (an autoregressive equation):
Yt = γYt–1 + vt
The circumstance where γ = 1 in above
equation (or similar equations), is called a
unit root

d. random walk
Yt = Yt–1 + vt (12.23)
This is a random walk: the expected value of
Yt does not converge on any value, meaning
that it is nonstationary

Question 6 (the data needed to answer this question is provided on TIMeS):


Run 5-percent Dickey-Fuller tests for the following variables from the chicken demand equation,
using dataset CHICK9 on TIMeS, and determine which variables, if any, you think are
nonstationary.
a. 𝑌𝑡
b. 𝑃𝐶𝑡
c. 𝑃𝐵𝑡
d. 𝑌𝐷𝑡
Answer:

(a) Y: t 6.54 and tC 3.12, so we cannot reject the null hypothesis of nonstationarity at the
5% level (the sign of t does not agree with HA).
(b) PC: t 0.36 and tc 3.12, so we cannot reject the null hypothesis of nonstationarity at the
5% level.
(c) PB: t 0.02 and tC 3.12, so we cannot reject the null hypothesis of nonstationarity at the
5% level.
(d) YD: t 12.55 and tC 3.12, so we cannot reject the null hypothesis of nonstationarity at the
5% level (the sign of t does not agree with HA).
Thus all the variables in the chicken demand equation are nonstationary.
TUTORIAL 12 (WEEK 13): Introduction to Panel Data Analysis

Tutorial Exercises:
The end of Chapter 16 pp 503-507 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, and 5.

Question 1:
Write the meaning of each of the following terms.
a. fixed effect model
Most researchers use the fixed effects model,
which allows each cross-sectional unit to have
a different intercept

b. panel data
Panel (or longitudinal) data combine time-
series and cross-sectional data such that
observations on the same variables from the
same cross sectional sample are followed
over two or more different time periods

c. random effect model


• The random effects model instead is based on the assumption that the intercept for each
cross-sectional unit is drawn from a distribution (that is centered around a mean
intercept)
• Thus each intercept is a random draw from an “intercept distribution” and therefore is
independent of the error term for any particular observation

Question 5:
Suppose that you are interested in the effect of price on demand for a “salon” haircut and that
you collect the data in the Excel file on TIMeS for four US cities for 2003 and 2008.
a. Estimate a cross-sectional OLS regression of per capita quantity as a function of average
price for 2003.
b. Now estimate a cross-sectional OLS regression on data for 2008. How is the result
different?
c. Now estimate a fixed effects model and random effects model on combined data and
compare your results with part a and b. apply Hausman test to decided which model
offers the best approach to answering your question.
Answer:
Q̂ 1.41 0.0457 P
(a) The estimated slope is positive, which certainly runs counter to our expectations:
(0.014)
t 3.28
N 4 R2 0.76

(b) While the fit and the size of theQ̂estimated


0.22 coefficients
0.0237P differ from those in part (a), the sign of
the estimated slope coefficient continues to be unexpected.
(0.014)
t 1.64
N 4 R2 0.36

(c) As expected, the sign reverses.


𝑄̂ = 2.58 − 0.02𝑃
𝑁=8 𝑅̅ 2 = 0.99
(d) As expected, the fixed effects model is superior.
TUTORIAL 13 (WEEK 14): Qualitative Dependent Variable (Dummy Dependent Variable)

Tutorial Exercises:
The end of Chapter 13 pp 425-428 (Using Econometrics A Practical Guide, AH
Studenmund 7th edition), Questions 1, 3, and 5.

Question 1:
Write the meaning of each of the following terms.
a. binomial logit
The binomial logit is an estimation technique for equations with dummy dependent variables that
avoids the unboundedness problem of the linear probability model

b. binomial probit
The Binomial Probit Model is Similar to the logit model this an estimation technique for
equations with dummy dependent variables that avoids the unboundedness problem of the linear
probability model. However, rather than the logistic function, this model uses a variant of the
cumulative normal distribution

c. interpreting estimated logit coefficients


1. Change an average observation:
– Create an “average” observation by plugging the means of all the independent
variables into the estimated logit equation and then calculating an “average”
– Then increase the independent variable of interest by one unit and recalculate the
– The difference between the two s then gives the marginal effect
2. Use a partial derivative:
– Taking a derivative of the logit yields the result that the change in the expected
value of caused by a one unit increase in holding constant the other independent
variables in the equation equals
– To use this formula, simply plug in your estimates of and Di
– From this, again, the marginal impact of X does indeed depend on the value of
3. Use a rough estimate of 0.25:
– Plugging in into the previous equation, we get the (more handy!) result that
multiplying a logit coefficient by 0.25 (or dividing by 4) yields an equivalent linear
probability model coefficient

d. linear probability model


The linear probability model is simply running OLS for a regression, where the dependent
variable is a dummy (i.e. binary) variable

e. maximum likelihood
maximum likelihood (ML) is an iterative estimation technique that is especially useful for
equations that are nonlinear in the coefficients
Question 3:
Because their college had just upgraded its residence halls, two seniors decided to build a model
of decision to live on campus.they collected data from 553 upper-class students (first-year students
were required to live on campus) and estimated the following equation:

𝐿: Pr̂
(𝐷𝑖 = 1) = 3.26−2.0 𝑈𝑁𝐼𝑇𝑖 −0.13𝐴𝐿𝐶𝑂𝑖 −0.99𝑌𝐸𝐴𝑅𝑖 −0.39𝐺𝑅𝐸𝐾𝑖
(0.04) (0.08) (0.12) (0.21

𝑁 = 533 𝑅̅𝑃2 = 0.668 𝑖𝑛𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 = 4


where: 𝐷𝑖 = 1 if the ith student lived on campus, 0 otherwise
𝑈𝑁𝐼𝑇𝑖 = the number of academic units the ith student was taking
𝐴𝐿𝐶𝑂𝑖 = the nights per week that the ith student consumed alcohol
𝑌𝐸𝐴𝑅𝑖 = 2 if the ith student was a sophomore, 3 if a junior, and 4 if a senior
𝐺𝑅𝐸𝐾𝑖 = 1 if the ith student was a member of a fraternity/sorority, 0 otherwise
a. The two seniors expected UNIT to have a positive coefficient and the other variables to
have negative coefficients. Do the results support these hypotheses?
b. What problem do you see with the definition of YEAR variable? What constraint does this
definition place on the estimated coefficient?
c. Carefully state the meaning of the coefficient of ALCO and analyze the size of the
coefficient.
Answer:
(a) Coefficient UNIT ALCO YEAR GREEK

Hypothesized sign:
Calculated t-score: 0.84 1.55 8.25 1.38
tC 1.289, so: insig. insig. sig. sig.

(b) Defining YEAR this way constrains the coefficients of three classes to be related to each other
when there is no reason to expect that to be the case. For example, the definition forces a
junior to be exactly 1.33 times more likely to live off campus than a sophomore when there is
no reason to expect this relationship. In fact, we’d expect seniors to be by far more likely to
live off campus than juniors or sophomores, and this definition wouldn’t allow that to
happen.
A much better approach would have been to define two dummy variables, one equal to 1 for seniors (0
otherwise) and one equal to 1 for juniors (0 otherwise), which would make being a
sophomore the omitted condition. We’d expect a positive coefficient for each variable, with
the coefficient of senior being substantially larger than the coefficient of junior.
(c) The estimate of ALCO tells us that for each additional night (per week) that a student
consumes alcohol, the log of the odds that that student will live on campus will decrease by
0.13, holding constant the other independent variables in the equation. If we divide 0.13 by 4,
this turns out to be equivalent to saying that that for each additional night (per week) that a
student consumes alcohol, probability of that student living on campus will decrease by 3.25
percentage points, holding constant the other independent variables in the equation. This is a
little lower than we might have expected, but it certainly is plausible.

Question 5: (the data needed to answer this question is provided on TIMeS):


In 2008, Goldman and Romley studied hospital demand by analyzing how 8,721 Medical-covered
pneumonia patients chose from among 117 hospitals in the greater Los Angeles area. The authors
concluded that clinical quality (as measured by a low pneumonia mortality rate) played a smaller
role in hospital choice that did a variety of other factors.
Let’s focus on a subset of Goldman-Romley sample: the 499 patients who chose either the UCLA
Medical Center or the nearby Cedars Sinai Medical Center. Typically, Economists would expect
price to have a major influence on such choice, but Medicare patients pay roughly the same price
no matter what hospital they choose. Instead, factors like the distance the patient lives from the
hospital and the age and income of the patient become potentially important factors.
𝐷𝑖 = 𝑓(𝐷𝐼𝑆𝑇𝐴𝑁𝐶𝐸𝑖 , 𝐼𝑁𝐶𝑂𝑀𝐸𝑖 , 𝑂𝐿𝐷𝑖 )
where: 𝐷𝑖 = 1 if the ith patient chose Cedars Sinia, 0 if they chose UCLA
𝐷𝐼𝑆𝑇𝐴𝑁𝐶𝐸𝑖 = the distance from the ith patient’s home (according to zip code) to Cedars
Sinia minus the distance from that point to the UCLA Medical Center (in miles)
𝐼𝑁𝐶𝑂𝑀𝐸𝑖 = the income of the ith patient (thousand dollars)
𝑂𝐿𝐷𝑖 = 1 if the ith patient was older that 75, 0 otherwise
a. Estimated the model based on both linear probability model a binomial logit method.
b. Create and test appropriate hypotheses.
c. Carefully estate the meaning of the estimated coefficients for both models.
Answer:

(a) Linear Probability Model:


̂𝑖
𝐷 1.19−0.071𝐷𝐼𝑆𝑇𝐴𝑁𝐶𝐸𝑖 −0.01𝐼𝑁𝐶𝑂𝑀𝐸𝑖 −0.05 𝑂𝐿𝐷𝑖
=
(0.007) (0.006) (0.04)
Logit Model:

𝐿: Pr̂
(𝐷𝑖 = 1) = 4.41−0.38𝐷𝐼𝑆𝑇𝐴𝑁𝐶𝐸𝑖 −0.072𝐼𝑁𝐶𝑂𝑀𝐸𝑖 −0.29 𝑂𝐿𝐷𝑖
(0.05) (0.036) (0.31)
(b) The trick here is getting the expected sign right, because it won’t be obvious to everyone that
DISTANCE can be negative (if the patient lives farther from Cedars Sinai than he does from
UCLA). Then we use the t-test to check the significance.
(c) the point here is:
Students must know that they have to interpret the DV as probability,
And, the estimated coefficients in logit model must be divided by 4 first and the be interpreted.
Therefore:
The linear probability model:
For every extra mile that it takes a patient to get to Cedars Sinai as compared to UCLA, the
probability that patient choses Cedars Sinia decreases by 7.1 Percentage, holding other
variables constant.
If the patient’s income increases by one thousand dollars, the probability that patient choses
Cedars Sinia decreases by 1 Percentage, holding other variables constant.
If the patient was over 75 years old, the probability that patient choses Cedars Sinia decreases
by 5 Percentage, holding other variables constant.

The logit model:


For every extra mile that it takes a patient to get to Cedars Sinai as compared to UCLA, the
probability that patient choses Cedars Sinia decreases by 9.5 Percentage, holding other
variables constant.
If the patient’s income increases by one thousand dollars, the probability that patient choses
Cedars Sinia decreases by 1.8 Percentage, holding other variables constant.
If the patient was over 75 years old, the probability that patient choses Cedars Sinia decreases
by 7.3 Percentage, holding other variables constant.

Potrebbero piacerti anche