Sei sulla pagina 1di 2

UNIVERSITY OF CAPE TOWN DEPARTMENT OF STATISTICAL SCIENCES STA2020F: BUSINESS STATISTICS Test 2 - memo

 Internal examiners: Hannah Gerber Date: 24 April 2011 Total number of questions: 1 Time: 1 hour 30 minutes Total number of pages: 13 (6 + 1 + 6) Total marks: 50

Instructions:

and formulae have been provided.

1.

2.

3.

4.

5.

6.

values
mg CO. 
vs.
not 0.  We have a -value

and for the stated

(2)

that at least one regression coefficient is which is less than any reasonable level

of significance. We therefore conclude that we have sufficient evidence to say that

at least one of the explanatory variables is linearly related to the response variable,

 sales.  The model is valid.  Note: the F-stat value has been removed, it is 78.9838. (4) vs.  We have a -value which is

much greater than any reasonable level of significance. We therefore fail to reject and we have insufficient evidence to say that the nicotine variable is significant

in the regression mode.

, which means that on average for every additional mg of nicotine, there is 2.6317 mg less CO content assuming all other variable remain constant. It is worth noting though that the nicotine variable is very highly correlated with the tar variable (0.9766), which strictly speaking means we shouldn’t interpret this variable since multicolinearity is a concern. (3)

The CI about the expected value: We expect the true mean CO content associated

(4)

with
,

and , with 95% confidence . For single observation at

do not penalize if

to

occur

and

in
the
interval,
(3)
, we

(2)

The PI:

expect the true CO content 95% confidence .

Notes:

typo) and be lenient on the level of confidence since it wasn’t stated in the

question.

It is 0.9259 and this means that there is a very strong linear relationship between the

nicotine and CO content.

was used in interpretation (due to

, to have a value between -0.3348 and 19.0533, with

(2)

1

7. Missing values (actual values, allow for rounding):

a) 0.9070

b) 3

c) 21

d) 2.0901 (no mark allocated since provided on memo)

e) 78.9838

f) 3.4618 ½

g) 3.9736 ½

h) -10.7433

i) 5.4800

j) 0.9735 
(8)
8. Consider the all subsets regression:
a)
Any one of the following combinations with following reason:
,
and
– model has the highest
OR
and
– simplest model with a relatively high
– simplest model with a relatively high
OR
and
OR
– simplest model with a relatively high
(2)
b)
– simplest model with highest
.
(2)
c)
The measure increases automatically as the number of variables increase.
The accounts for the number of variables in the regression model,

giving us a better indication of whether or not a new variable truly contributes

to explain the variability in the response variable.

9. variable since it has the highest correlation to the CO content (0.9575)

(2)

The

10. variable since it has the highest p-value (0.9735).

The

(2)

(2)

11. Note: Need to clearly indicate that the assumptions are associated with the random ERROR (not the residuals). If this is not clear mark assumption portion the question out of 3 full marks can still be obtained for assessments.

a) Errors are normally distributed (one of q-q plot, histogram, Chi square test,

Lillifors, K-S test, Shapiro-Wilk’s test)

b) Errors have an expected value (or mean) of 0 (one of residual plot or t-test about the mean)

c) Errors have a constant by unknown variance (Residual plot)

d) Errors are independent of each other (Durbin-Watson test)

12. Some concern about heteroscedasticity as the variability seems to become larger as

(8)

the predicted responses increase.

(2)

13. Errors seem to be relatively normally distributed as there is a straight line through

the origin.

2

(2)