2013 Test V1

STATS 330 / STATS 762
Midterm Test Version 1

8:00-9:00, Tue 17th September, 2013
Instruction
Answer ALL 15 questions on the answer sheet provided. Start by writing your name
and upi in the fields given.
All questions have a single correct answer and carry the same mark value.
If you give more than one answer to any question you will receive zero marks for that
question.
Incorrect answers are not penalised.
1. Suppose we have a data set consisting of three continuous variables W, X and Y .

We want to fit a regression model using Y as the response. Five types of plot are
listed below. Which of the following plots is the most suitable to assess whether the
explanatory variables need transformations.
(a) A GAM plot

(b) An Added variable plot
(c) A Coplot
(d) A Box-Cox plot
(e) A Leverage-Residual plot
1
Given : W
2 4 6 8
2 4 6 8 10 2 4 6 8 10
20
10
0
10
Y
20
10
0
10
2 4 6 8 10
Figure 1: Plot for Question 2.
2. Figure 1 shows a plot of the data for Question 1. Which of the following statements is
not correct:
(a) The values of Y are smaller than 30.

(b) For fixed X, as W goes up, the response tends to go up.
(c) The regression coefficient of X is negative.
(d) The plot indicates that a linear model is appropriate.
(e) The fifth panel shows a high-leverage point.
3. In a linear regression model, which of the following is the most important assump-
tion?
(a) The errors are uncorrelated.

(b) The mean of the response is a linear function of the explanatory variables.
(c) The variances are equal.
(d) The observations are independent.
(e) The responses are normally distributed.
2
High High
groundnut soybean
8000
7500
7000
6500
Chicken weight
Low Low
groundnut soybean
8000
7500
7000
6500
0 1 2 0 1 2
Protein level
Figure 2: Plot for Question 4.
4. Figure 2 shows a plot of the chicken weight data set discussed in lecture. The response
is the weight of 24 chickens, which is thought to depend on the type of diet (groundnut
or soybean), amount fish solubles (High or Low) and protein level (0,1,2). Which of
the following statements is correct:
(a) A protein level of 0 affects all diets the same.

(b) Chicken weights decrease with increased protein levels in a soybean diet.
(c) A high fish soluble content diet leads to lower chicken weights compared to
low fish soluble content diet.
(d) Chicken weights decrease with increased protein levels in a groundnut diet.
(e) groundnut fed chickens are heavier than soybean fed chickens.
5. Which of the following statements is correct:
(a) In exceptional cases, R2 can be negative.

(b) Deleting an observation from the data always increases R2 .
(c) R2 is the most important statistic to assess a model.
(d) R2 will be small if serial correlation is present in the data.
(e) The adjusted R2 is never larger than R2 .
3
6. In a regression, a pair of explanatory variables, X1 and X2 have a correlation of 0.95
and p-values considerably greater than 0.05. Which of the following is the worst
interpretation.
(a) One of more of the VIFs will be large.

(b) Either X1 or X2 could be deleted from the regression.
(c) Both variables are unimportant in the regression.
(d) It is possible that X1 is related to the response.
(e) It is possible that X2 is related to the response.
7. The data for the following questions come from a study of the operation of a plant oxi-
dising ammonia into nitric acid. The measurements have been taken over the duration
of 21 days. The variables measured are
air.flow: Rate of plant operation

water.temp: temperature of cooling water circulating ( C)
acid.conc: Concentration of acid circulating (%)
stack.loss: Percentage of incoming ammonia escaping unabsorbed (%, response).
Examine the R output below and select the most correct statement on the basis
of this output only.
Call:
lm(formula = stack.loss ~ air.flow + water.temp + acid.conc)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.61416 8.90213 0.406 0.68982
air.flow 0.07156 0.01349 5.307 5.8e-05 ***
water.temp 0.12953 0.03680 3.520 0.00263 **
acid.conc -0.15212 0.15629 -0.973 0.34405
---
Residual standard error: 0.3243 on 17 degrees of freedom
Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
(a) If the variables acid.conc and water.temp are held constant, the amount of
ammonia escaping tends to be smaller with increased air.flow.
(b) The estimate of the error variance is 0.3243.
(c) The variable acid.conc should be kept in the model.
(d) If the variables acid.conc and air.flow are held constant, the amount of am-
monia escaping tends to be higher with increased water.temp.
(e) The higher acid.conc the more ammonia escapes.
4
Residuals vs Fitted Normal QQ
2
4 4
0.5
3 3
1
Standardized residuals
Residuals
0.0
0
1
0.5
2
21
21
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 2 1 0 1 2
Fitted values Theoretical Quantiles
ScaleLocation Residuals vs Leverage

21 1
2
1.5
4
0.5
4
1
3
1
1.0
0
1
0.5
0.5
2
21
0.0
Cook's distance
3
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.1 0.2 0.3 0.4
Fitted values Leverage
Figure 3: Diagnostic plot for Question 8.
8. Figure 3 shows diagnostic plots for model given above. What is most strongly indi-
cated? Hint: The threshold for hat leverage is 3(k + 1)/n.
(a) There are several high-leverage points.

(b) The response has positive autocorrelation.
(c) There is a low-level outlier.
(d) The model looks planar.
(e) Some of the variables might need transforming.
5
95%
5
0
logLikelihood
5
10
2 1 0 1 2
Figure 4: Box-Cox plot for Question 9.
9. After seeing the diagnostic plots in the previous question, we look whether we should
transform the response. The resulting Box-Cox plot is shown in Figure 4. What should
we do?
(a) We should transform using the log.

(b) We shouldnt transform the response, we should transform the explanatory vari-
ables.
(c) We should transform using the reciprocal.
(d) We should do nothing, no transformation is indicated.
(e) We should square the response.
6
10. After taking some corrective action, a new model was fitted. Some influence plots for
the new model are shown in Figure 5. Which of the following statements is not a
dfb.1_ dfb.ar.f dfb.wtr. dfb.acd.

0.6
21 21
1.2
1.2
0.5
0.5
1.0
1.0
0.4
0.4
0.8
0.8
dfb.acd.
dfb.wtr.
dfb.ar.f
dfb.1_
0.3
0.3
0.6
0.6
0.2
0.2
0.4
0.4
0.1
0.1
0.2
0.2
0.0
0.0
0.0
0.0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Obs. number Obs. number Obs. number Obs. number
DFFITS ABS(COV RATIO1) Cook's D Hats
0.5
21 17 21
0.4
1
0.8
1.5
0.4
4
0.3
8
0.6
ABS(COV RATIO1)
14 21
1.0
0.3
Cook's D
DFFITS
Hats
0.2
0.4
0.2
0.5
0.1
0.2
0.1
0.0
0.0
0.0
0.0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Obs. number Obs. number Obs. number Obs. number
Figure 5: Influence plots for Question 10.
correct interpretation of these plots?

(a) Observation 17 is having an effect on the standard errors of the estimated coeffi-
cients.
(b) Observation 21 is having an effect on the coefficient for air.flow.
(c) Observation 21 is having an effect on the fitted values.
(d) Observation 17 is a high-leverage point.
(e) Observation 21 is having an effect on the coefficient for water.temp.
11. We look at all possible regressions for the above datasets. Which of the following
statements is not correct, based on the output.
rssp sigma2 adjRsq Cp AIC BIC CV air.flow water.temp acid.conc
1 0.066 0.003 0.848 12.383 33.383 35.472 0.008 1 0 0
2 0.038 0.002 0.907 2.000 23.000 26.134 0.005 1 1 0
3 0.038 0.002 0.901 4.000 25.000 29.178 0.006 1 1 1
(a) Mallows Cp and Cross Validation prefer the same model.

(b) AIC and BIC prefer the same model.
(c) The full model is preferred based on all criteria.
(d) 2 cannot distinguish between the models stack.loss air.flow+water.temp
and the full model.
(e) The model stack.loss air.flow is not a good model.
7
5 (i) (ii)
4
4
3
3
y
2
2
1
1
0
A
0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0
x x
(iii) (iv)
8
D
2.5
2.0
6
1.5
y
4
1.0
2
0.5
C
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5
x x
Figure 6: Plots for Question 12.
12. Figure 6 shows four scatter plots of a variable y vs. x for four different data sets. In
each case a regression of y on x was fitted. Which of the following is false?
(a) In plot (i) removing point A will increase R2 .

(b) In plot (ii) point B is not likely to be influential.
(c) In plot (iv) point D is likely to be very influential.
(d) In plot (iii) removing point C will increase the R2 .
(e) In plot (ii) the point B is high leverage point.
13. Suppose we have a continuous response Y , a categorical explanatory variable A, and

a continuous explanatory variable X. Which of the following is false?
(a) The code lm(YA) fits a one-way analysis of variance model.

(b) In the non-parallel lines model, the constant term corresponds to the slope of
the baseline.
(c) The code lm(YX+A) fits a parallel lines model.
(d) The coefficient of X when fitting the parallel lines model represents the slope of
the lines.
(e) The code lm(YX) fits a simple linear regression model.
8
14. The last two questions concern a data set, where the response (the percent conversation
of n-heptane to acetylene) is related to a categorical variable ratio (the ratio between
n-heptane and acetylene with levelslow, medium, high) and a continuous variable
temperature (in C). 14 measurements have been taken. We first fitted a parallel
lines model with the following results:
Call:
lm(formula = percent.conv ~ temp + ratio)
---
Coefficients:
(Intercept) 16.8723 2.6040 6.479 3.03e-05 ***
temp 0.3557 0.1767 2.013 0.067083 .
ratiomedium 12.7956 2.4727 5.175 0.000231 ***
ratiohigh 26.6930 2.4959 10.695 1.73e-07 ***
---
Which of the following statements is false?
(a) The fitted line for medium ratio is 13.8974 below the fitted line for high ratio.
(b) The fitted line for low ratio has intercept 16.8723.
(c) The slope is not significantly different from 0.
(d) The fitted slope for all ratios is the same.
(e) The fitted line for medium ratio has slope 29.6679.
9
15. Finally, we fit the non-parallel lines model and compare both models.
> summary(acet.full)
Call:
lm(formula = percent.conv ~ temp * ratio)
---
Coefficients:
(Intercept) 7.7663 1.5847 4.901 0.000622 ***
temp 1.2484 0.1425 8.758 5.28e-06 ***
ratiomedium 19.0732 2.0083 9.497 2.55e-06 ***
ratiohigh 45.2763 2.1285 21.271 1.17e-09 ***
temp:ratiomedium -0.6732 0.1670 -4.031 0.002397 **
temp:ratiohigh -1.5948 0.1731 -9.215 3.34e-06 ***
---
> anova(acet.par,acet.full)
Analysis of Variance Table
Model 1: percent.conv ~ temp + ratio

Model 2: percent.conv ~ temp * ratio
Res.Df RSS Df Sum of Sq F Pr(>F)
1 12 169.620
2 10 15.886 2 153.73 48.388 7.205e-06 ***
Which of the following statements is false?
(a) There is strong evidence that the lines for the ratios are non-parallel.
(b) Under the parallel lines model, the estimated percentage of conversion for a
medium mixture at a temperature of 15 C is 22.20718%.
(c) The parallel lines model has a residual sum of squares that is 153.73 higher than
that of the non-parallel lines model.
(d) Under the non-parallel lines model, the estimated percentage of conversion for a
low ratio mixture at a temperature of 10 C is 20.25032%.
(e) The fitted line for a high ratio in the non-parallel lines model has slope 0.3464.
10
STATS 330 / STATS 762
Midterm Test Version 1

Tue 17th September, 2013
Fill in your name and upi
Name:
UPI:
a b c d e
1 O O O O O
2 O O O O O
3 O O O O O
4 O O O O O
5 O O O O O
6 O O O O O
7 O O O O O
8 O O O O O
9 O O O O O
10 O O O O O
11 O O O O O
12 O O O O O
13 O O O O O
14 O O O O O
15 O O O O O

2013 Test V1

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

2013 Test V1

Caricato da

Copyright:

Formati disponibili

STATS 330 / STATS 762

Midterm Test Version 1

Incorrect answers are not penalised.

1. Suppose we have a data set consisting of three continuous variables W, X and Y .

(a) A GAM plot

Figure 1: Plot for Question 2.

(a) The values of Y are smaller than 30.

(a) The errors are uncorrelated.

Figure 2: Plot for Question 4.

(a) A protein level of 0 affects all diets the same.

5. Which of the following statements is correct:

(a) In exceptional cases, R2 can be negative.

(a) One of more of the VIFs will be large.

air.flow: Rate of plant operation

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 2 1 0 1 2

Fitted values Theoretical Quantiles

ScaleLocation Residuals vs Leverage

Fitted values Leverage

Figure 3: Diagnostic plot for Question 8.

(a) There are several high-leverage points.

Figure 4: Box-Cox plot for Question 9.

(a) We should transform using the log.

dfb.1_ dfb.ar.f dfb.wtr. dfb.acd.

Obs. number Obs. number Obs. number Obs. number

DFFITS ABS(COV RATIO1) Cook's D Hats

Obs. number Obs. number Obs. number Obs. number

Figure 5: Influence plots for Question 10.

correct interpretation of these plots?

(a) Mallows Cp and Cross Validation prefer the same model.

Figure 6: Plots for Question 12.

(a) In plot (i) removing point A will increase R2 .

13. Suppose we have a continuous response Y , a categorical explanatory variable A, and

(a) The code lm(YA) fits a one-way analysis of variance model.

Which of the following statements is false?

Model 1: percent.conv ~ temp + ratio

Which of the following statements is false?

Midterm Test Version 1

Fill in your name and upi

Potrebbero piacerti anche