5 Model Building

VARIABLE SELECTION AND MODEL BUILDING
In most practical problems, the analyst has a large pool of possible candidate regressors, of which
only a few are likely to be important. Finding an appropriate subset of regressors for the model is
called the variable selection problem.
If we include more regressor variables in the model, R2 will increase but at the same time, variance
of the predicted response value will also increase. Therefore, using certain criteria, we need to
develop the best regression equation for the sample data. The model must be scrutinized for
residual analysis, outliers and influential observations and prediction capability.
By deleting variables from the model, we may improve the precision of the parameter estimates of
the retained variables and the variance of predicted response. Deleting variables potentially
introduces bias into the estimates of coefficients of retained variables and the response, but the
mean square error (MSE) of the biased estimates will be less than the variance of the unbiased
estimates of a full model. That is, the amount of bias introduced is much less than the reduction in
the variance.
EVALUATION CRITERIA
We assume that there are k candidate regressor variables X 1 , X 2 ,..., X k and n ≥ k+1 observations
on these regressors and the response Y. The full model, containing all k regressors, is
k
Yi  0    j X ij   i , i  1, 2,..., n
j 1
Let p number of regressors (including the intercept) is retained in the model. That is the subset
model contains p-1 of the original regression variables.
Coefficient of Multiple Determination
A measure of the adequacy of a regression model is the coefficient of multiple determination, R2.
Let Rp2 denote the coefficient of multiple determination for a subset regression model with p terms,
that is, p-1 regression variables and an intercept  0 . Computationally,
SS R ( p ) SS ( p )
R p2   1 E
SST SST
Where SSR(p) and SSE(p) denote the regression sum of squares and the residual sum of squares,
respectively, for a p-term subset model. Now Rp2 increases as p increases and is a maximum when p
= k+1. Therefore, the analyst uses this criterion by adding regressors to the model up to the point
where an additional variable is not useful in that it provides only a small increase in Rp2 .
Generally, it is not straightforward to use R2 as a criterion for choosing number of regressors to
include in the model. Also, there is no specified optimum value of R2 for a subset regression model.
Models with high values of Rp2 are preferred.
ADJUSTED R2
We prefer to use the adjusted R2 statistic defined for a p-term equation as
 n 1 
, p  1   (1  Rp )
2 2
RAdj
n p
2
The RAdj , p statistic does not necessarily increase as additional regressors are introduced in the
2 2
model. It can be shown that if s regressors are added to the model , RAdj , p  s will exceed RAdj , p if
and only if the partial F statistic for testing the significance of the s additional regressors exceeds 1.
Consequently, one criterion for selection of an optimum subset model is to choose the model that
2
has a maximum RAdj ,p .
RESIDUAL MEAN SQUARE
SS E ( p)
The residual mean square for a subset regression model MS E ( p)  may be used as a
n p
model evaluation criterion. SSE ( p) always decreases as p increases, but MSE ( p) initially
decreases, then stabilizes, and eventually may increase. The eventual increase in MSE ( p) occurs
when the reduction in SSE ( p) from adding a regressor to the model is not sufficient to compensate
for the loss of one degree of freedom in the denominator. The value of p is chosen where the
smallest MSE ( p) occurs.
2
The subset regression model that minimizes MSE ( p) will also maximize RAdj ,p .
n 1 n  1 SS E ( p)
2
RAdj , p  1
n p
1  R p2   1 
n  p SST
 1
MS E ( p)
SST /(n  1)
2
Thus, the criteria minimum MSE ( p) and maximum RAdj , p are equivalent.
MALLOW’S Cp STATISTIC
Mallow’s Cp statistic is computed as
SS E ( p )
Cp  n 2p
ˆ 2
Where ˆ 2 is the mean square error for the full model with K regressors and SSE(p) is the sum of
squares for error for a subset model with p parameters. The sum of squares for error for the subset
model can be written as E[SSE ( p)]  (n  p) p2 . If the error variance  p2 for a p-term model is
close to the error variance ˆ 2 for a full model, then Cp-statistic will be closer to P. Generally, small
values of Cp are desirable.
The basic objective of a regression model is to describe a process or a complex system. We generally
would like to describe the system with as few regressors as possible while simultaneously explaining
the substantial portion of the variability in the response variable. We also use the model for
prediction of future observations and hence we choose a model for which the mean square error of
prediction is quite small. If the regression model is to be used for control, then it is very important to
estimate the regression coefficients very accurately. This implies that the standard errors of the
regression coefficients should be small. Considering all these aspects, one has to finalize the subset
regression model.
No single criterion is always best for subset selection. Mallow’s Cp criterion is favoured because it
tends to simplify the decision about how many variables to retain in the final model.
COMPUTATIONAL TECHNIQUES FOR VARIABLE SELECTION
1. All Possible Regressions
This procedure requires that the analyst fit all the regression equations involving one candidate
regressor, two candidate regressors, and so on. These equations are evaluated according to some
suitable criterion and the “best” regression model selected. If there are are K candidate regressors,
then there will be 2K total regression equations to be estimated and examined. Clearly, the number
of equations to be examined increases rapidly as the number of candidate regressors increases.
Example: The Hald Cement Data
Hald presents data concerning the heat evolved in calories per gram of cement (Y) as a function of
the amount of each of four ingredients in the mix: tricalcium aluminate (X1), tricalcium silicate (X2),
tetracalcium alumina ferrite (X3), and dicalcium silicate (X4). Develop a suitable regression model.
Obs No. yi x1i x2i x3i x4i

1 78.5 7 26 6 60
2 74.3 1 29 15 52
3 104.3 11 56 8 20
4 87.6 11 31 8 47
5 95.9 7 52 6 33
6 109.2 11 55 9 22
7 102.7 3 71 17 6
8 72.5 1 31 22 44
9 93.1 2 54 18 22
10 115.9 21 47 4 26
11 83.8 1 40 23 34
12 113.3 11 66 9 12
13 109.4 10 68 8 12
For Hald cement data, the final model using all possible regressions procedure is
Yˆ  52.5773  1.4683 X 1  0.6623 X 2
2. Stepwise Regression Methods
Three different procedures are available for subset model selection by adding or deleting regressors
one at a time. Those are i) forward selection, ii) backward elimination, and iii) stepwise regression,
which is a combination of both forward selection and backward elimination.
Forward Selection: This procedure begins with the assumption that there are no regressors in the
model other than the intercept. An effort is made to find an optimal subset by inserting regressors
into the model one at a time. The first regressor selected for entry into the equation is the one that
has the largest simple correlation with the response variable Y. Suppose that this regressor is X1. This
is also the regressor that will produce the largest value of F-statistic for testing significance of
regression. This regressor is entered if the F-statistic exceeds a preselected F- value, say F-to-enter.
The second regressor chosen for entry is the one that now has the largest correlation with Y after
adjusting for the effect of the first regressor entered (X1) on Y. These correlations are referred as
partial correlations. They are the simple correlations between the residuals from the regression
model ŷ  ˆ0  ˆ1 x1 and the residuals from the regressions of the other candidate regressors on X1,
say xˆ j  ˆ oj  ˆ1 j x1 , j  2,3,..., K .
Suppose that at step 2 the regressor with the highest partial correlation with Y is X2. This implies
that the largest partial F-statistic is
SS R ( x2 | x1 )
F
MS E ( x1 , x2 )
If this F-value exceeds F-to-enter, then X2 is added to the model. In general, at each step the
regressor having the highest partial correlation with Y is added to the model if its partial F-statistic
exceeds the preselected entry level F-to-enter. The procedure terminates either when the partial F-
statistic at a particular step does not exceed F-to-enter or when the last candidate regressor is added
to the model.
Minitab reports t-statistics for entering or removing variables. This is a perfectly acceptable variation
of the procedure, because t12 / 2,  F1 ,1, .
For Hald cement data, the final model using forward selection procedure is
Yˆ  71.6483  1.4519 X 1  0.4161X 2  0.2365 X 4
Backward Elimination: Forward selection begins with no regressors in the model and attempts to
insert variables until a suitable model is obtained. Backward elimination attempts to find a good
model by working in the opposite direction. That is, it starts with a model inclusive of all K
regressors. Then, the partial F statistic (or equivalent t-statistic) is computed for each regressor as if
it were the last variable to enter the model. The smallest of these partial F-statistics is compared
with a preselected value F-to-remove. If the smallest partial F-value is less than F-to-remove, then
that regressor is removed fom the model. Now, a regression model with K-1 regressors is fit, the
partial F-statistics for this new model calculated, and the procedure repeated. The backward
elimination procedure terminates when the smallest partial F-value is not less than the preselected
cutoff value F-to-remove.
For Hald cement data, the final model using backward elimination procedure is
Yˆ  52.5773  1.4683 X 1  0.6623 X 2
Stepwise Regression: Stepwise regression is a modification of forward selection in which at each

step all regressors entered into the model previously are reassessed via their partial F (or T)
statistics. A regressor added at an earlier step may now be redundant because of the relationships
between it and the regressors now in the equation. If the partial F (or t) statistic for a variable is less
than F-to-remove (or t-to-remove) value, that variable is dropped from the model. Stepwise
regression requires two cutoff values, one for entering variables and one for removing them.
Generally, we choose F-to-enter value same as F-to-remove value, although this is not necessary.
For Hald cement data, the final model using stepwise regression procedure is
Yˆ  52.5773  1.4683 X 1  0.6623 X 2
This is the same equation identified by the all-possible-regressions and backward elimination
procedure.
The stepwise methods are fast, easy to implement, and readily available in many software packages.
These stepwise methods may be used first to screen a few important regressors, eliminating those
regressors that clearly have negligible effects. Then, the all-possible-regression approach may be
used to the reduced set of candidate regressors. A proper application of the all-possible-regression
approach should produce a few final candidate models. At this point, it is critical to perform
thorough residual and other diagnostic analyses of each of these final models.
Although the equation fits the data well and passes the usual diagnostic checks, there is no
assurance that it will predict new observations accurately. The predictive ability of a model must be
assessed by observing its performance on new data not used to build the model. This is called
validation of regression model. A model must be validated before giving it to the actual user.
There is a distinction between model adequacy checking and model validation. Model adequacy
checking includes residual analysis, testing for lack of fit, searching for high-leverage and overly-
influential observations, and other internal analyses that investigate the fit of the regression
equation to the available data. Model validation is directed toward determining if the function
successfully in its intended operating environment. Most often, a regression model is validated using
newly collected data and comparing with the predictive ability of the model. The planned
experiments used to collect data for checking the predictive performance of the model, are
generally called confirmation runs.
AIC, AICC, and BIC

Akaike (1973) derived a criterion from information theories, known as the Akaike
information criterion (AIC). It has the following general form:
AIC = n * log-likelihood + 2 * number of parameters:
When applied to Gaussian or normal models, it becomes, up to a constant,
AIC ≈ n * log(SSE) + 2 p:
The first part n*log(SSE) measures the goodness-of-fit of the model, which is
penalized by model complexity in the second part 2p. The constant 2 in the penalty
term is often referred to as complexity or penalty parameter.The smaller AIC results
in the better candidate model.
Observing that AIC tends to overfit when the sample size is relatively small, Hurvich
and Tsai (1989) proposed a bias-corrected version, called AICC, which is given by
AICC ≈ n * log(SSE) + n * (n + p + 1) /(n - p - 3 ):
Within the Bayesian framework, Schwarz (1978) developed another criterion, labeled
as BIC for Bayesian information criterion (or also SIC for Schwarz information
criterion and SBC for Schwarz-Bayesian criterion). The BIC, given as
BIC ≈ n * log(SSE) + log(n) p;
applies a larger penalty for overfitting. The information-based criteria have received
wide popularity in statistical applications mainly because of their easy extension to
other regression models. There are many other information-based criteria introduced
in the literature.
In large samples, a model selection criterion is said to be asymptotically efficient if it

is aimed to select the model with minimum mean squared error, and consistent if it
selects the true model with probability one. No criterion is both consistent and
asymptotically efficient. According to this categorization, MSE, R2,adj R2, Cp, AIC,
and AICC are all asymptotically efficient criteria while BIC is a consistent one.
Among many other factors, the performance of these criteria depends on the
available sample size and the signal-to-noise ratio. Based on the extensive
simulation studies, McQuarrie and Tsai (1998) provided some general advice on the
use of various model selection criteria, indicating that AIC and Cp work best for
moderately-sized sample, AICC provides the most effective selection with small
samples, while BIC is most suitable for large samples with relatively strong signals.
Example: GORMAN and TOMAN ASPHALT DATA
Gorman and Toman present data concerning the rut depth of 31 asphalt pavements prepared under
different conditions specified by five regressors. The variables are: Y is the rut depth per million
wheel passes, X1 is the viscosity of the asphalt, X2 is the percentage of asphalt in the surface course,
X3 is the percentage of asphalt in the base course, X4 is the percentage of fines in the surface course,
and X5 is the percentage of voids in the surface course. Develop a suitable model.
Obs No. Y X1 X2 X3 X4 X5
1 6.75 2.8 4.68 4.87 8.4 4.916
2 13.00 1.4 5.19 4.50 6.5 4.563
3 14.75 1.4 4.82 4.73 7.9 5.321
4 12.60 3.3 4.85 4.76 8.3 4.865
5 8.25 1.7 4.86 4.95 8.4 3.776
6 10.67 2.9 5.16 4.45 7.4 4.397
7 7.28 3.7 4.82 5.05 6.8 4.867
8 12.67 1.7 4.86 4.70 8.6 4.828
9 12.58 0.9 4.78 4.84 6.7 4.865
10 20.60 0.7 5.16 4.76 7.7 4.034
11 3.58 6.0 4.57 4.82 7.4 5.450
12 7.00 4.3 4.61 4.65 6.7 4.853
13 26.20 0.6 5.07 5.10 7.5 4.257
14 11.67 1.8 4.66 5.09 8.2 5.144
15 7.67 6.0 5.42 4.41 5.8 3.718
16 12.25 4.4 5.01 4.74 7.1 4.715
17 0.76 88.0 4.97 4.66 6.5 4.625
18 1.35 62.0 4.01 4.72 8.0 4.977
19 1.44 50.0 4.96 4.90 6.8 4.322
20 1.60 58.0 5.20 4.70 8.2 5.087
21 1.10 90.0 4.80 4.60 6.6 5.971
22 0.85 66.0 4.98 4.69 6.4 4.647
23 1.20 140.0 5.35 4.76 7.3 5.115
24 0.56 240.0 5.04 4.80 7.8 5.939
25 0.72 420.0 4.80 4.80 7.4 5.916
26 0.47 500.0 4.83 4.60 6.7 5.471
27 0.33 180.0 4.66 4.72 7.2 4.602
28 0.26 270.0 4.67 4.50 6.3 5.043
29 0.76 170.0 4.72 4.70 6.8 5.075
30 0.80 98.0 5.00 5.07 7.2 4.334
31 2.00 35.0 4.70 4.80 7.7 5.705

5 Model Building

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

5 Model Building

Caricato da

Copyright:

Formati disponibili

VARIABLE SELECTION AND MODEL BUILDING

Coefficient of Multiple Determination

We prefer to use the adjusted R2 statistic defined for a p-term equation as

RESIDUAL MEAN SQUARE

Mallow’s Cp statistic is computed as

COMPUTATIONAL TECHNIQUES FOR VARIABLE SELECTION

1. All Possible Regressions

Obs No. yi x1i x2i x3i x4i

Yˆ  52.5773  1.4683 X 1  0.6623 X 2

2. Stepwise Regression Methods

Yˆ  71.6483  1.4519 X 1  0.4161X 2  0.2365 X 4

Yˆ  52.5773  1.4683 X 1  0.6623 X 2

Stepwise Regression: Stepwise regression is a modification of forward selection in which at each

Yˆ  52.5773  1.4683 X 1  0.6623 X 2

AIC, AICC, and BIC

AIC = n * log-likelihood + 2 * number of parameters:

When applied to Gaussian or normal models, it becomes, up to a constant,

AICC ≈ n * log(SSE) + n * (n + p + 1) /(n - p - 3 ):

BIC ≈ n * log(SSE) + log(n) p;

In large samples, a model selection criterion is said to be asymptotically efficient if it

Potrebbero piacerti anche