Sei sulla pagina 1di 11

STATS 330 / STATS 762

Midterm Test Version 1


8:00-9:00, Tue 17th September, 2013

Instruction
Answer ALL 15 questions on the answer sheet provided. Start by writing your name
and upi in the fields given.

All questions have a single correct answer and carry the same mark value.

If you give more than one answer to any question you will receive zero marks for that
question.

Incorrect answers are not penalised.

1. Suppose we have a data set consisting of three continuous variables W, X and Y .


We want to fit a regression model using Y as the response. Five types of plot are
listed below. Which of the following plots is the most suitable to assess whether the
explanatory variables need transformations.

(a) A GAM plot


(b) An Added variable plot
(c) A Coplot
(d) A Box-Cox plot
(e) A Leverage-Residual plot

1
Given : W
2 4 6 8

2 4 6 8 10 2 4 6 8 10

20
10
0
10
Y
20
10
0
10

2 4 6 8 10

Figure 1: Plot for Question 2.

2. Figure 1 shows a plot of the data for Question 1. Which of the following statements is
not correct:

(a) The values of Y are smaller than 30.


(b) For fixed X, as W goes up, the response tends to go up.
(c) The regression coefficient of X is negative.
(d) The plot indicates that a linear model is appropriate.
(e) The fifth panel shows a high-leverage point.

3. In a linear regression model, which of the following is the most important assump-
tion?

(a) The errors are uncorrelated.


(b) The mean of the response is a linear function of the explanatory variables.
(c) The variances are equal.
(d) The observations are independent.
(e) The responses are normally distributed.

2
High High
groundnut soybean
8000

7500

7000

6500
Chicken weight

Low Low
groundnut soybean
8000

7500

7000

6500

0 1 2 0 1 2

Protein level

Figure 2: Plot for Question 4.

4. Figure 2 shows a plot of the chicken weight data set discussed in lecture. The response
is the weight of 24 chickens, which is thought to depend on the type of diet (groundnut
or soybean), amount fish solubles (High or Low) and protein level (0,1,2). Which of
the following statements is correct:

(a) A protein level of 0 affects all diets the same.


(b) Chicken weights decrease with increased protein levels in a soybean diet.
(c) A high fish soluble content diet leads to lower chicken weights compared to
low fish soluble content diet.
(d) Chicken weights decrease with increased protein levels in a groundnut diet.
(e) groundnut fed chickens are heavier than soybean fed chickens.

5. Which of the following statements is correct:

(a) In exceptional cases, R2 can be negative.


(b) Deleting an observation from the data always increases R2 .
(c) R2 is the most important statistic to assess a model.
(d) R2 will be small if serial correlation is present in the data.
(e) The adjusted R2 is never larger than R2 .

3
6. In a regression, a pair of explanatory variables, X1 and X2 have a correlation of 0.95
and p-values considerably greater than 0.05. Which of the following is the worst
interpretation.

(a) One of more of the VIFs will be large.


(b) Either X1 or X2 could be deleted from the regression.
(c) Both variables are unimportant in the regression.
(d) It is possible that X1 is related to the response.
(e) It is possible that X2 is related to the response.

7. The data for the following questions come from a study of the operation of a plant oxi-
dising ammonia into nitric acid. The measurements have been taken over the duration
of 21 days. The variables measured are

air.flow: Rate of plant operation


water.temp: temperature of cooling water circulating ( C)
acid.conc: Concentration of acid circulating (%)
stack.loss: Percentage of incoming ammonia escaping unabsorbed (%, response).

Examine the R output below and select the most correct statement on the basis
of this output only.

Call:
lm(formula = stack.loss ~ air.flow + water.temp + acid.conc)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.61416 8.90213 0.406 0.68982
air.flow 0.07156 0.01349 5.307 5.8e-05 ***
water.temp 0.12953 0.03680 3.520 0.00263 **
acid.conc -0.15212 0.15629 -0.973 0.34405
---
Residual standard error: 0.3243 on 17 degrees of freedom
Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09

(a) If the variables acid.conc and water.temp are held constant, the amount of
ammonia escaping tends to be smaller with increased air.flow.
(b) The estimate of the error variance is 0.3243.
(c) The variable acid.conc should be kept in the model.
(d) If the variables acid.conc and air.flow are held constant, the amount of am-
monia escaping tends to be higher with increased water.temp.
(e) The higher acid.conc the more ammonia escapes.

4
Residuals vs Fitted Normal QQ

2
4 4
0.5

3 3

1
Standardized residuals
Residuals

0.0

0
1
0.5

2
21
21

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 2 1 0 1 2

Fitted values Theoretical Quantiles

ScaleLocation Residuals vs Leverage


21 1

2
1.5

4
0.5
4
1
3

1
Standardized residuals

Standardized residuals
1.0

0
1
0.5

0.5
2

21
0.0

Cook's distance
3

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.1 0.2 0.3 0.4

Fitted values Leverage

Figure 3: Diagnostic plot for Question 8.

8. Figure 3 shows diagnostic plots for model given above. What is most strongly indi-
cated? Hint: The threshold for hat leverage is 3(k + 1)/n.

(a) There are several high-leverage points.


(b) The response has positive autocorrelation.
(c) There is a low-level outlier.
(d) The model looks planar.
(e) Some of the variables might need transforming.

5
95%

5
0
logLikelihood

5
10

2 1 0 1 2

Figure 4: Box-Cox plot for Question 9.

9. After seeing the diagnostic plots in the previous question, we look whether we should
transform the response. The resulting Box-Cox plot is shown in Figure 4. What should
we do?

(a) We should transform using the log.


(b) We shouldnt transform the response, we should transform the explanatory vari-
ables.
(c) We should transform using the reciprocal.
(d) We should do nothing, no transformation is indicated.
(e) We should square the response.

6
10. After taking some corrective action, a new model was fitted. Some influence plots for
the new model are shown in Figure 5. Which of the following statements is not a

dfb.1_ dfb.ar.f dfb.wtr. dfb.acd.


0.6

21 21

1.2

1.2

0.5
0.5

1.0

1.0

0.4
0.4

0.8

0.8

dfb.acd.
dfb.wtr.
dfb.ar.f
dfb.1_

0.3
0.3

0.6

0.6

0.2
0.2

0.4

0.4

0.1
0.1

0.2

0.2
0.0

0.0

0.0

0.0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Obs. number Obs. number Obs. number Obs. number

DFFITS ABS(COV RATIO1) Cook's D Hats

0.5
21 17 21

0.4
1
0.8
1.5

0.4
4

0.3
8
0.6
ABS(COV RATIO1)

14 21
1.0

0.3
Cook's D
DFFITS

Hats

0.2
0.4

0.2
0.5

0.1
0.2

0.1
0.0

0.0

0.0

0.0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Obs. number Obs. number Obs. number Obs. number

Figure 5: Influence plots for Question 10.

correct interpretation of these plots?


(a) Observation 17 is having an effect on the standard errors of the estimated coeffi-
cients.
(b) Observation 21 is having an effect on the coefficient for air.flow.
(c) Observation 21 is having an effect on the fitted values.
(d) Observation 17 is a high-leverage point.
(e) Observation 21 is having an effect on the coefficient for water.temp.
11. We look at all possible regressions for the above datasets. Which of the following
statements is not correct, based on the output.
rssp sigma2 adjRsq Cp AIC BIC CV air.flow water.temp acid.conc
1 0.066 0.003 0.848 12.383 33.383 35.472 0.008 1 0 0
2 0.038 0.002 0.907 2.000 23.000 26.134 0.005 1 1 0
3 0.038 0.002 0.901 4.000 25.000 29.178 0.006 1 1 1

(a) Mallows Cp and Cross Validation prefer the same model.


(b) AIC and BIC prefer the same model.
(c) The full model is preferred based on all criteria.
(d) 2 cannot distinguish between the models stack.loss air.flow+water.temp
and the full model.
(e) The model stack.loss air.flow is not a good model.

7
5 (i) (ii)

4
4

3
3
y

2
2

1
1

0
A
0

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0

x x

(iii) (iv)

8
D
2.5
2.0

6
1.5
y

4
1.0

2
0.5

C
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5

x x

Figure 6: Plots for Question 12.

12. Figure 6 shows four scatter plots of a variable y vs. x for four different data sets. In
each case a regression of y on x was fitted. Which of the following is false?

(a) In plot (i) removing point A will increase R2 .


(b) In plot (ii) point B is not likely to be influential.
(c) In plot (iv) point D is likely to be very influential.
(d) In plot (iii) removing point C will increase the R2 .
(e) In plot (ii) the point B is high leverage point.

13. Suppose we have a continuous response Y , a categorical explanatory variable A, and


a continuous explanatory variable X. Which of the following is false?

(a) The code lm(YA) fits a one-way analysis of variance model.


(b) In the non-parallel lines model, the constant term corresponds to the slope of
the baseline.
(c) The code lm(YX+A) fits a parallel lines model.
(d) The coefficient of X when fitting the parallel lines model represents the slope of
the lines.
(e) The code lm(YX) fits a simple linear regression model.

8
14. The last two questions concern a data set, where the response (the percent conversation
of n-heptane to acetylene) is related to a categorical variable ratio (the ratio between
n-heptane and acetylene with levelslow, medium, high) and a continuous variable
temperature (in C). 14 measurements have been taken. We first fitted a parallel
lines model with the following results:

Call:
lm(formula = percent.conv ~ temp + ratio)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.8723 2.6040 6.479 3.03e-05 ***
temp 0.3557 0.1767 2.013 0.067083 .
ratiomedium 12.7956 2.4727 5.175 0.000231 ***
ratiohigh 26.6930 2.4959 10.695 1.73e-07 ***
---
Residual standard error: 3.76 on 12 degrees of freedom
Multiple R-squared: 0.9201, Adjusted R-squared: 0.9002
F-statistic: 46.08 on 3 and 12 DF, p-value: 7.348e-07

Which of the following statements is false?

(a) The fitted line for medium ratio is 13.8974 below the fitted line for high ratio.
(b) The fitted line for low ratio has intercept 16.8723.
(c) The slope is not significantly different from 0.
(d) The fitted slope for all ratios is the same.
(e) The fitted line for medium ratio has slope 29.6679.

9
15. Finally, we fit the non-parallel lines model and compare both models.

> summary(acet.full)
Call:
lm(formula = percent.conv ~ temp * ratio)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.7663 1.5847 4.901 0.000622 ***
temp 1.2484 0.1425 8.758 5.28e-06 ***
ratiomedium 19.0732 2.0083 9.497 2.55e-06 ***
ratiohigh 45.2763 2.1285 21.271 1.17e-09 ***
temp:ratiomedium -0.6732 0.1670 -4.031 0.002397 **
temp:ratiohigh -1.5948 0.1731 -9.215 3.34e-06 ***
---
Residual standard error: 1.26 on 10 degrees of freedom
Multiple R-squared: 0.9925, Adjusted R-squared: 0.9888
F-statistic: 265.4 on 5 and 10 DF, p-value: 2.721e-10

> anova(acet.par,acet.full)
Analysis of Variance Table

Model 1: percent.conv ~ temp + ratio


Model 2: percent.conv ~ temp * ratio
Res.Df RSS Df Sum of Sq F Pr(>F)
1 12 169.620
2 10 15.886 2 153.73 48.388 7.205e-06 ***

Which of the following statements is false?

(a) There is strong evidence that the lines for the ratios are non-parallel.
(b) Under the parallel lines model, the estimated percentage of conversion for a
medium mixture at a temperature of 15 C is 22.20718%.
(c) The parallel lines model has a residual sum of squares that is 153.73 higher than
that of the non-parallel lines model.
(d) Under the non-parallel lines model, the estimated percentage of conversion for a
low ratio mixture at a temperature of 10 C is 20.25032%.
(e) The fitted line for a high ratio in the non-parallel lines model has slope 0.3464.

10
STATS 330 / STATS 762

Midterm Test Version 1


Tue 17th September, 2013

Fill in your name and upi

Name:

UPI:

a b c d e
1 O O O O O
2 O O O O O
3 O O O O O
4 O O O O O
5 O O O O O
6 O O O O O
7 O O O O O
8 O O O O O
9 O O O O O
10 O O O O O
11 O O O O O
12 O O O O O
13 O O O O O
14 O O O O O
15 O O O O O

Potrebbero piacerti anche