Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Fall 2019
Your solutions should be typed and well organized. You need to explain / show all of the steps you used to
arrive at your answer. Submit your work through Blackboard as a Word or pdf file.
1. Indicate whether the following statements are true or false, along with a brief explanation.
c. A dummy variable trap occurs when you include a nonsensical variable in your
regression and it appears to be statistically significant.
e. Seasonality is an important issue when working with yearly time series data.
Solution:
(a) False: the term dummy variable is synonymous with indicator variable. It refers to a binary
variable.
(b) False: the coding of a variable is based on the arbitrary wishes of a researcher; the coding
has no impact on the variable’s statistical significance in a regression.
(c) False: a dummy variable trap occurs when a researcher includes m dummies in a regression
when there are m groups being studied.
(d) True: multicollinearity can be a problem with specifying interaction terms in a regression.
(e) False: by definition, seasonality are regular movements in a variable that repeat every year.
Annual data renders void the issue of seasonality.
2. Use the data file BOOKCOST posted on blackboard under Assignments. A major publishing
company would like to develop an equation that will help it in determining the cost of books
that it publishes. It has a sample of 200 books that have been published recently. Of the 200
books in the sample, 80 are hardcover and 120 are softcover. Hardcover books are priced at a
premium, so some adjustment for this will need to be made. The variables in the data are as
follows:
cost the cost of producing the book
pages the number of pages in the book
softcover a dummy variable coded as: 0 = hardcover; 1 = softcover
a. Regress cost on pages and softcover. Show the regression output.
i. How much does an additional page cost the publisher? Does this cost depend on
whether softcover or hardcover is being produced? If so, how?
ii. How much more/less, on average, does it cost to publish a softcover than a
hardcover?
b. Regress COST on PAGES, SOFTCOVER, and an interaction term. Show the regression
output.
i. How much does additional page cost the publisher? Does this cost depend on
whether softcover or hardcover is being produced? If so, how?
Solution:
(i) Each additional page costs the publisher about 1.4 cents. The cost of each additional page
is the same for hardcover books as it is for softcover.
(ii) A softcover book costs $10.60 less, on average, than a hardcover book.
(b) First, the interaction term needs to be created in Stata (generate abc = pages*softcover).
Then then regression command is: regress cost pages softcover abc.
Source SS df MS Number of obs = 200
F(3, 196) = 126.01
Model 6495.29131 3 2165.0971 Prob > F = 0.0000
Residual 3367.73006 196 17.1822962 R-squared = 0.6585
Adj R-squared = 0.6533
Total 9863.02137 199 49.5629214 Root MSE = 4.1452
(i) The cost depends on which type of cover is being produced. For a softcover, each additional page
costs the publisher about 0.8 cents. For a hardcover, each additional page costs the publisher about
1.9 cents.
3. Use the data file BEERPROD posted on blackboard under Assignments. This file contains monthly
U.S. beer production in millions of barrels for January 1983 through December 1991. The
objective in this problem is to develop an extrapolative model for beer production.
a. Construct a model which uses a linear trend, monthly dummies, and a lagged dependent
variable.1 Show Stata’s regression output in your solutions.
b. Discuss the results of this model. Be sure to explain if and how beer production is
trending. Also, explain which months are associated with the highest and lowest beer
production, and by how much.
c. What does your model predict beer production will be for January 1992? Remember to
show me your work.
Solution:
(a) The regression output from having regressed beer production on a linear time trend, monthly
dummies, and a lagged dependent variable:
Source SS df MS Number of obs = 107
F(13, 93) = 94.02
Model 333.215024 13 25.6319249 Prob > F = 0.0000
Residual 25.3532024 93 .27261508 R-squared = 0.9293
Adj R-squared = 0.9194
Total 358.568226 106 3.38271912 Root MSE = .52213
month
2 -.9342516 .3416837 -2.73 0.007 -1.612768 -.2557355
3 1.08746 .3161917 3.44 0.001 .459566 1.715354
4 .7974535 .460326 1.73 0.087 -.1166628 1.71157
5 1.866525 .4717713 3.96 0.000 .9296807 2.803369
6 1.847002 .5685113 3.25 0.002 .718051 2.975952
7 1.698118 .5886596 2.88 0.005 .5291566 2.867079
8 1.143453 .5795242 1.97 0.051 -.0073668 2.294274
9 -1.236346 .5277144 -2.34 0.021 -2.284283 -.1884103
10 -.5326708 .328952 -1.62 0.109 -1.185904 .1205626
11 -2.287266 .3389528 -6.75 0.000 -2.96036 -1.614173
12 -2.440852 .2588506 -9.43 0.000 -2.954878 -1.926826
beerprod
L1. .2199934 .1010703 2.18 0.032 .0192879 .420699
(b) The model “explains” almost 93% of the variation in beer production. All of the explanatory
variables are statistically significant at the 5% level except for the dummies on April and
October. The linear trend is statistically significant, but the size of the coefficient is quite small
(i.e., 0.007) suggesting that trend is not of much economic importance. The regression model
uses January as the base category. Thus, beer production is highest in May (month 5) with an
additional 1.87 million barrels of beer produced in this month relative to January. Beer
production is lowest in December (month 12) where production is about 2.4 million barrels less
than production in January.
1
This is time series data, thus you need to first format the data as such: (i) generate time =
tm(1983m1)+_n-1; (ii) format time %tm; and (iii) tsset time, monthly. Next, create a variable “month” which
contains observations 1, 2, 3, …, 12 and then repeats this sequence: egen month = fill(1 2 3 4 5 6 7 8 9 10
11 12 1 2 3 4 5 6 7 8 9 10 11 12). Now, you can execute the regression command using Stata’s factor
notation as a short-cut for the monthly dummies: regress beerprod time i.month L.beerprod.
(c) For January 1992, trend equals 109; the value of beer production in the previous month is 13.64;
and all of the monthly seasonal dummies equal zero since January is the base category. The
predicted beer production is about 14 million barrels for January 1992.
𝑏𝑝̂𝑡 = 10.159 + .0078 × 𝑡𝑟𝑒𝑛𝑑𝑡 + .22 × 𝑏𝑝𝑡−1
𝑏𝑝̂𝑡 = 10.159 + .0078 × 109 + .22 × 13.64 = 14.01
4. The purpose of this question is to help prepare you for the research project. Please read the
section “Literature review” below before answering (a). Similarly, please read the section
“Formatted regression results” before answering (b).
a. Consider the abstract to Webber and Ehrenberg (2010). Use a few sentences to
summarize what you learned about the relationship between university expenditures
and graduation rates.
b. Open Stata and load the practice dataset on 1978 automobiles (command: sysuse
auto.dta). Use Stata to construct a formatted regression table that contains the
following regressions:
regress mpg weight
regress mpg weight foreign
regress mpg weight foreign headroom
regress mpg weight foreign headroom trunk
Paste your formatted regression table into your solutions. What is the relationship
between weight and mpg, if any? How does this relationship change as you add control
variables to the model?
2
In the parlance of regression analysis, your Y variable is a university’s graduation rate and the X variable is
university expenditures (or a type thereof).
3
Webber, DA, and RG Ehrenberg, (2010). “Do expenditures other than instructional expenditures affect graduation
rates and persistence rates in American higher education?” Economics of Education Review, 29:6, 947-58.
*READ THIS BEFORE ANSWERING PART (B)*
Formatted regression results.
It is considered sloppy—and thus a big no-no—to merely copy/paste regression results from Stata (or
whatever program) directly into your research paper. Instead, regression results are usually formatted in
columns: the left-most column contains the names of explanatory variables; and the adjoining column(s)
contains the (i) estimated coefficients, the standard errors in parentheses, and statistical significance at
the 10%, 5%, or 1% level indicated via the use of asterisks.
For example, open Webber and Ehrenberg’s working paper and go to Table 2 (on p. 31). This table
summarizes the results from 3 regressions all of which use graduation rate as the dependent variable.
The left-most column contains the explanatory variables (STUDENT, ACADEMIC, etc.). Column (1)
contains the regression results when graduation rate (Y) is regressed on 5 explanatory variables
(STUDENT, ACADEMIC, RESEARCH, INSTRUCTION, and PELL) plus the intercept/constant. Specifically,
column (1) shows the estimated coefficient and the standard error in parentheses. Asterisks are used to
show statistical significance at the 10% level (*); 5% level (**); and 1% level (***). If there is not an
asterisk next to an estimated coefficient, then it is not statistically significant (i.e., the t statistic is so
small that we cannot reject 𝐻0 : 𝛽 = 0).
Column (2) shows a similar regression except that some additional explanatory variables are used in the
regression (HBCU, HISPANIC, etc). And then column (3) regresses graduation rate on all of the
explanatory variables shown. What is the purpose of this? Remember that when you add an explanatory
variable you are now controlling for it.
For example, the variable STUDENT represents student service expenditures at a university (e.g., student
organizations, student health services). Column (1) shows that STUDENT is estimated to be positively
related with graduation rates (i.e., the coefficient is 0.263), and this is statistically significant at the 1%
level. However, columns (2) and (3) show us how this relationship between STUDENT and graduation
rates change as more and more variables are controlled for. In this case, STUDENT remains statistically
significant (albeit at the 5% level in column (3)), but the size of the estimated coefficient decreases
which is expected since more and more variables are being controlled for.
Creating a regression table analogous to Table 2 in Webber and Ehrenberg’s paper takes time.
Fortunately, Stata has a command, called outreg2, which will do much of the formatting for you.
i. First, the outreg2 command needs to be installed on your computer (you need only do
this once on a computer). Run the command: ssc install outreg2
ii. Run your first regression. Afterwards, run the command: outreg2 using
Stata_outreg2.doc, replace
iii. Run subsequent regressions. Afterward each regression, run the command: outreg2
using Stata_outreg2.doc, append
After you run all of the regression and outreg2 commands, look in the results window in Stata. You
should see “Stata_outreg2.doc” in blue font. Select this, and a word document opens up containing a
formatted regression table.
Solution:
(a) SUMMARY: Webber and Ehrenberg (2010) test whether different types of university spending
(i.e., instruction, academic support, student services, and research) are related with a
university’s graduation rate. The study finds that spending on student services is positively
related with graduation rates, especially for universities with low graduation rates to begin with.
(b) The formatted regression table is shown below. The estimated relationship between mpg and
weight is negative for all regressions. Regression (1) shows that, without any controls, a 1,000
pound increase in a car’s weight is associated with 6.01 fewer miles per gallon. The magnitude
of this relationship weakens slightly as more control variables are added to the model (except
for (3)). Regression (4) has the full set of controls, a 1,000 pound increase in a car’s weight is
associated with 5.5 fewer miles per gallon.
Observations 74 74 74 74
R-squared 0.652 0.652 0.663 0.671
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1