Sei sulla pagina 1di 5

ECON 556X Problem Set 2

Ivan Korolev∗

Due Date: in class on April 8, 2019


Note: please submit hard copies of your answers and send me your codes via email.
Submissions without codes will be considered incomplete and will only receive partial credit
even if all answers are correct.
Reminder: Collaboration in homework assignments is encouraged. However, you are
required to work out details by yourself. Identical assignments are not allowed and will be
penalized. Late answers cannot be accepted.

1. In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order
to compute the LOOCV test error estimate. Alternatively, one could compute those
quantities using just the glm() and predict.glm() functions, and a for loop. You
will now take this approach in order to compute the LOOCV error for a simple logistic
regression model on the Weekly data set. Recall that in the context of classification
problems, the LOOCV error is given in (5.4).

(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2.
(b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using
all but the first observation.
(c) Use the model from (b) to predict the direction of the first observation. You can do
this by predicting that the first observation will go up if P(Direction="Up"|Lag1,
Lag2) > 0.5. Was this observation correctly classified?
(d) Write a for loop from i = 1 to i = n, where n is the number of observations in
the data set, that performs each of the following steps:
i. Fit a logistic regression model using all but the ith observation to predict
Direction using Lag1 and Lag2.
ii. Compute the posterior probability of the market moving up for the ith obser-
vation.

Department of Economics, Binghamton University. E-mail: ikorolev@binghamton.edu. The problems
are borrowed from ISLR.

1
iii. Use the posterior probability for the ith observation in order to predict whether
or not the market moves up.
iv. Determine whether or not an error was made in predicting the direction for
the ith observation. If an error was made, then indicate this as a 1, and
otherwise indicate it as a 0.
(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV
estimate for the test error. Comment on the results.

2. We will now perform cross-validation on a simulated data set.

(a) Generate a simulated data set as follows:

> set.seed(1)
> x=rnorm(100)
> y=x-2*x^2+rnorm(100)

In this data set, what is n and what is p? Write out the model used to generate
the data in equation form.
(b) Create a scatterplot of X against Y . Comment on what you find.
(c) Set a random seed, and then compute the LOOCV errors that result from fitting
the following four models using least squares:
i. Y = β0 + β1 X + ε
ii. Y = β0 + β1 X + β2 X 2 + ε
iii. Y = β0 + β1 X + β2 X 2 + β3 X 3 + ε
iv. Y = β0 + β1 X + β2 X 2 + β3 X 3 + β4 X 4 + ε
Note you may find it helpful to use the data.frame() function to create a single
data set containing both X and Y .
(d) Repeat (c) using another random seed, and report your results. Are your results
the same as what you got in (c)? Why?
(e) Which of the models in (c) had the smallest LOOCV error? Is this what you
expected? Explain your answer.
(f) Comment on the statistical significance of the coefficient estimates that results
from fitting each of the models in (c) using least squares. Do these results agree
with the conclusions drawn based on the cross-validation results?

3. We will now consider the Boston housing data set, from the MASS library.

2
(a) Based on this data set, provide an estimate for the population mean of medv. Call
this estimate µ̂.
(b) Provide an estimate of the standard error of µ̂. Interpret this result. Hint: We can
compute the standard error of the sample mean by dividing the sample standard
deviation by the square root of the number of observations.
(c) Now estimate the standard error of µ̂ using the bootstrap. How does this compare
to your answer from (b)?
(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for
the mean of medv. Compare it to the results obtained using t.test(Boston$medv).
Hint: You can approximate a 95% confidence interval using the formula [µ̂ −
2SE(µ̂), µ̂ + 2SE(µ̂)].
(e) Based on this dataset, provide an estimate, µ̂med , for the median value of medv in
the population.
(f) We now would like to estimate the standard error of µ̂med . Unfortunately, there
is no simple formula for computing the standard error of the median. Instead,
estimate the standard error of the median using the bootstrap. Comment on your
findings.
(g) Based on this data set, provide an estimate for the tenth percentile of medv in
Boston suburbs. Call this quantity µ̂0.1 . (You can use the quantile() function.)
(h) Use the bootstrap to estimate the standard error of µ̂0.1 . Comment on your find-
ings.

4. In this exercise, we will generate simulated data, and will then use this data to perform
best subset selection.

(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as
a noise vector ε of length n = 100.
(b) Generate a response vector Y of length n = 100 according to the model

Y = β0 + β1 X + β2 X 2 + β3 X 3 + ε,

where β0 , β1 , β2 , and β3 are constants of your choice.


(c) Use the regsubsets() function to perform best subset selection in order to choose
the best model containing the predictors X, X 2 , ..., X 10 . What is the best model
obtained according to Cp , BIC, and adjusted R2 ? Show some plots to provide

3
evidence for your answer, and report the coefficients of the best model obtained.
Note you will need to use the data.frame() function to create a single data set
containing both X and Y .
(d) Repeat (c), using forward stepwise selection and also using backwards stepwise
selection. How does your answer compare to the results in (c)?
(e) Now fit a lasso model to the simulated data, again using X, X 2 , ..., X 10 as pre-
dictors. Use cross-validation to select the optimal value of λ. Create plots of the
cross-validation error as a function of λ. Report the resulting coefficient estimates,
and discuss the results obtained.
(f) Now generate a response vector Y according to the model

Y = β0 + β7 X 7 + ε

and perform best subset selection and the lasso. Discuss the results obtained.

5. In this exercise, we will predict the number of applications received using the other
variables in the College data set.

(a) Split the data set into a training set and a test set.
(b) Fit a linear model using least squares on the training set, and report the test error
obtained.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation.
Report the test error obtained.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report
the test error obtained, along with the number of non-zero coefficient estimates.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report
the test error obtained, along with the value of M selected by cross-validation.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report
the test error obtained, along with the value ofM selected by cross-validation.
(g) Comment on the results obtained. How accurately can we predict the number
of college applications received? Is there much difference among the test errors
resulting from these five approaches?

6. We have seen that as the number of features used in a model increases, the training
error will necessarily decrease, but the test error may not. We will now explore this in
a simulated data set.

4
(a) Generate a data set with p = 20 features, n = 1, 000 observations, and an associ-
ated quantitative response vector generated according to the model
p
X
0
Y =X β+ε= Xj βj + ε,
j=1

where β has some elements that are exactly equal to zero.


(b) Split your dataset into a training set containing 100 observations and a test set
containing 900 observations.
(c) Perform best subset selection on the training set, and plot the training set MSE
associated with the best model of each size.
(d) Plot the test set MSE associated with the best model of each size.
(e) For which model size does the test set MSE take on its minimum value? Comment
on your results. If it takes on its minimum value for a model containing only an
intercept or a model containing all of the features, then play around with the way
that you are generating the data in (a) until you come up with a scenario in which
the test set MSE is minimized for an intermediate model size.
(f) How does the model at which the test set MSE is minimized compare to the true
model used to generate the data?

7. We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best
subset selection, the lasso, ridge regression, and PCR. Present and discuss results
for the approaches that you consider.
(b) Propose a model (or set of models) that seem to perform well on this data set,
and justify your answer. Make sure that you are evaluating model performance
using validation set error, cross- validation, or some other reasonable alternative,
as opposed to using training error.
(c) Does your chosen model involve all of the features in the data set? Why or why
not?

Potrebbero piacerti anche