Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
232
Chapter 17
Polynomial Regression
• For instance, we look at the plot of residuals versus the fitted values.
Sometimes, a plot of the response versus a predictor may also show some
curvature in that relationship. Such plots may suggest there is a nonlin-
ear relationship. If we believe there is a nonlinear relationship between the
response and predictor(s), then one way to account for it is through a poly-
nomial regression model:
Y = β0 + β1 X + β2 X 2 + . . . + βh X h + , (17.1)
where h is called the degree of the polynomial. For lower degrees, the
relationship has a specific name (i.e., h = 2 is called quadratic, h = 3 is
called cubic, h = 4 is called quartic, and so on). As for a bit of semantics,
it was noted at the beginning of the previous course how nonlinear regression
233
234 CHAPTER 17. POLYNOMIAL REGRESSION
where the entries in Y and X would consist of the raw data. So as you can
see, we are in a setting where the analysis techniques used in multiple linear
regression (e.g., OLS) are applicable here.
● ●
●
● ● ●● ●
20
●●● ●● ● ●
● ● ● ● ●● ●
●●
● ● ● ●
−100
● ●●
●● ●
●●● ●●
●
●● ●
● ●
● ●
● ●
0
●
●
● ● ●
−200
●● ●
●
●● ●
Residuals
●●
●
● ●
●
●
y
● ● ●
−20
● ●
−300
●
●
●● ●
●
●
●● ●
●●
−400
● −40
● ●
●
●
●
−500
x Fitted y
(a) (b)
●
●● ● ●
1
●●●● ●
●
●●●●●
15
●●●
●●●
●●
●
●
●●
●●
0
●●
Sample Quantiles
●
●
●●●
●●
10
Frequency
●
●●
●
−1
●
5
−2
●
● ●
●
●
0
−3
(c) (d)
Figure 17.1: (a) Scatterplot of the quadratic data with the OLS line. (b)
Residual plot for the OLS fit. (c) Histogram of the residuals. (d) NPP for
the Studentized residuals.
i xi yi i xi yi i xi yi
1 6.6 -45.4 21 8.4 -106.5 41 8 -95.8
2 10.1 -176.6 22 7.2 -63 42 8.9 -126.2
3 8.9 -127.1 23 13.2 -362.2 43 10.1 -179.5
4 6 -31.1 24 7.1 -61 44 11.5 -252.6
5 13.3 -366.6 25 10.4 -194 45 12.9 -338.5
6 6.9 -53.3 26 10.8 -216.4 46 8.1 -97.3
7 9 -131.1 27 11.9 -278.1 47 14.9 -480.5
8 12.6 -320.9 28 9.7 -162.7 48 13.7 -393.6
9 10.6 -204.8 29 5.4 -21.3 49 7.8 -87.6
10 10.3 -189.2 30 12.1 -284.8 50 8.5 -105.4
11 14.1 -421.2 31 12.1 -287.5
12 8.6 -113.1 32 12.1 -290.8
13 14.9 -482.3 33 9.2 -137.4
14 6.5 -42.9 34 6.7 -47.7
15 9.3 -144.8 35 12.1 -292.3
16 5.2 -14.2 36 13.2 -356.4
17 10.7 -211.3 37 11 -228.5
18 7.5 -75.4 38 13.1 -354.4
19 14.9 -482.7 39 9.2 -137.2
20 12.2 -295.6 40 13.2 -361.6
Table 17.1: The simulated 2-degree polynomial data set with n = 50 values.
• In general, you should obey the hierarchy principle, which says that
if your model includes X h and X h is shown to be a statistically signif-
icant predictor of Y , then your model should also include each X j for
all j < h, whether or not the coefficients for these lower-order terms
are significant.
• The number of factor levels must be greater than the order of the model
(i.e., p > h).
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
Figure 17.2: (a) The points of a square portion of a design with factor levels
coded at ±1. This is how a 22 factorial design is coded. (b) Illustration of
the axial (or star) points of a design at (+a,0), (-a,0), (0,-a), and (0,+a). (c)
A diagram which shows the combination of the previous two diagrams with
the design center at (0,0). This final diagram is how a composite design is
coded.
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
Figure 17.3: (a) The points of a cube portion of a design with factor levels
coded at the corners of the cube. This is how a 23 factorial design is coded.
(b) Illustration of the axial (or star) points of this design. (c) A diagram
which shows the combination of the previous two diagrams with the design
center at (0,0). This final diagram is how a composite design is coded.
the model (e.g., Xih1 Xjh2 such that h2 ≤ h1 ), then the hierarchy principle
says that at least the main factor effects for 1, . . . , h1 must appear in the
model, that all h1 -order interactions with the factor powers of 1, . . . , h2 must
appear in the model, and all order interactions less than h1 must appear
in the model. Luckily, response surface regression models (and polynomial
models for that matter) rarely go beyond h = 3.
For the next step, an ANOVA table is usually constructed to assess the
significance of the model. Since the factor levels are all essentially treated as
categorical variables, the designed experiment will usually result in replicates
for certain factor level combinations. This is unlike multiple regression where
the predictors are usually assumed to be continuous and no predictor level
combinations are assumed to be replicated. Thus, a formal lack of fit test
is also usually incorporated. Furthermore, the SSR is also broken down
into the components making up the full model, so you can formally test the
contribution of those components to the fit of your model.
An example of a response surface regression ANOVA is given in Table
17.3. Since it is not possible to compactly show a generic ANOVA table nor
to compactly express the formulas, this example is for a quadratic model
with linear interaction terms. The formulas will be similar to their respec-
tive quantities defined earlier. For this example, assume that there are k
Table 17.2: A table showing all of the terms that could be included in a
response surface regression model. In the above, the indices for the factor
are given by i = 1, . . . , k and j = 1, . . . , k.
SSLIN = SSR(X1 , X2 , . . . , Xk ).
Source df SS MS F
Regression q−1 SSR MSR MSR/MSE
Linear k SSLIN MSLIN MSLIN/MSE
Quadratic k SSQUAD MSQUAD MSQUAD/MSE
Interaction q − 2k − 1 SSINT MSINT MSINT/MSE
Error n−q SSE MSE
Lack of Fit m−q SSLOF MSLOF MSLOF/MSPE
Pure Error n−m SSPE MSPE
Total n−1 SSTO
Table 17.3: ANOVA table for a response surface regression model with linear,
quadratic, and linear interaction terms.
17.3 Examples
Example 1: Yield Data Set
This data set of size n = 15 contains measurements of yield from an exper-
iment done at five different temperature levels. The variables are y = yield
and x = temperature in degrees Fahrenheit. Table 17.4 gives the data used
for this analysis. Figure 17.4 give a scatterplot of the raw data and then
another scatterplot with lines pertaining to a linear fit and a quadratic fit
overlayed. Obviously the trend of this data is better suited to a quadratic
fit.
Here we have the linear fit results:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.306306 0.469075 4.917 0.000282 ***
temp 0.006757 0.005873 1.151 0.270641
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
i Temperature Yield
1 50 3.3
2 50 2.8
3 50 2.9
4 70 2.3
5 70 2.6
6 70 2.1
7 80 2.5
8 80 2.9
9 80 2.4
10 90 3.0
11 90 3.1
12 90 2.8
13 100 3.3
14 100 3.5
15 100 3.0
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9604811 1.2589183 6.323 3.81e-05 ***
temp -0.1537113 0.0349408 -4.399 0.000867 ***
temp2 0.0010756 0.0002329 4.618 0.000592 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
3.4 ● ●
3.4
● ● ● ●
3.2
3.2
● ●
3.0
3.0
● ● ● ●
● ● ● ●
Yield
Yield
2.8
2.8
● ● ● ●
2.6
2.6
● ●
● ●
2.4
2.4
● ●
● ●
2.2
2.2
● ●
50 60 70 80 90 100 50 60 70 80 90 100
Temperature Temperature
(a) (b)
Figure 17.4: The yield data set with (a) a linear fit and (b) a quadratic fit.
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
Regression 2 1.47656 0.73828 12.36 0.001218 **
Residuals 12 0.71677 0.05973
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##########
Table 17.5: The odor data set measurements with the factor levels already
coded.
First we will fit a response surface regression model consisting of all of the
first-order and second-order terms. The summary of this fit is given below:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.667 10.840 -2.829 0.0222 *
temp -12.125 6.638 -1.827 0.1052
ratio -17.000 6.638 -2.561 0.0336 *
As you can see, the square of height is the least statistically significant, so
we will drop that term and rerun the analysis. The summary of this new fit
is given below:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.923 8.707 -3.092 0.012884 *
temp -12.125 6.408 -1.892 0.091024 .
ratio -17.000 6.408 -2.653 0.026350 *
height -21.375 6.408 -3.336 0.008720 **
temp2 31.615 9.404 3.362 0.008366 **
ratio2 47.365 9.404 5.036 0.000703 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
By omitting the square of height, the temperature main effect has now be-
come marginally significant. Note that the square of temperature is statisti-
cally significant. Since we are building a response surface regression model,
we must obey the hierarchy principle. Therefore temperature will be retained
in the model.
Finally, contour and surface plots can also be generated for the response
surface regression model. Figure 17.5 gives the contour plots (with odor as
the contours) for each of the three levels of height (Figure 17.6 gives color
versions of the plots). Notice how the contours are increasing as we go out to
the corner points of the design space (so it is as if we are looking down into
a cone). The surface plots of Figure 17.7 all look similar (with the exception
of the temperature scale), but notice the curvature present in these plots.
Height=−1 Height=0
1
1
Ratio
Ratio
0
0
−1
−1
−1 0 1 −1 0 1
Temperature Temperature
(a) (b)
Height=1
1
Ratio
0
−1
−1 0 1
Temperature
(c)
Figure 17.5: The contour plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.
Height=−1 Height=0
1.0 1.0
100 80
80 60
0.5 0.5
60 40
Ratio
Ratio
0.0 0.0
40 20
−0.5 20 −0.5 0
0 −20
−1.0 −1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Temperature Temperature
(a) (b)
Height=1
1.0 60
40
0.5
20
Ratio
0.0
0
−20
−0.5
−40
−1.0
−1.0 −0.5 0.0 0.5 1.0
Temperature
(c)
Figure 17.6: The contour plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.
(a) (b)
(c)
Figure 17.7: The surface plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.
251
CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION
252 SHRINKAGE
for j = 1, 2, . . . , (p − 1) and
sP
n
− Ȳ )2
i=1 (Yi
sY = .
n−1
Y∗ = X∗ β ∗ + ∗ ,
where 0 < k < 1, but usually less than 0.3. The amount of bias in this
estimator is given by
back to the original scale (sometimes these are called the ridge regression
estimates) by
† sY
β̃j = β̃j
sXj
p−1
X
β̃0† = ȳ − β̃j† x̄j ,
j=1
where j = 1, 2, . . . , p − 1.
How do we choose k? Many methods exist, but there is no agreement
on which to use, mainly due to instability in the estimates asymptotically.
Two methods are primarily used: one graphical and one analytical. The first
method is called the fixed point method and uses the estimates provided
by fitting the correlation transformation via ordinary least squares. This
method suggests using
(p − 1)MSE∗
k= ∗T ∗
,
β̂ β̂
where MSE∗ is the mean square error obtained from the respective fit.
Another method is the Hoerl-Kennard iterative method. This method
calculates
(p − 1)MSE∗
k (t) = T ,
β̃ k(t−1) β̃ k(t−1)
where t = 1, 2, . . .. Here, β̃ k(t−1) pertains to the ridge regression estimates
obtained when the biasing constant is k (t−1) . This process is repeated until
the difference between two successive estimates of k is negligible. The starting
value for this method (k (0) ) is chosen to be the value of k calculated using
the fixed point method.
Perhaps the most common method is a graphical method. The ridge
trace is a plot of the estimated ridge regression coefficients versus k. The
value of k is picked where the regression coefficients appear to have stabilized.
The smallest value of k is chosen as it introduces the smallest amount of bias.
There are criticisms regarding ridge regression. One major criticism is
that ordinary inference procedures are not available since exact distribu-
tional properties of the ridge estimator are not known. Another criticism
is in the subjective choice of k. While we mentioned a few of the methods
here, there are numerous methods found in the literature, each with their
own limitations. On the flip-side of these arguments lie some potential ben-
efits of ridge regression. For example, it can accomplish what it sets out to
do, and that is reduce multicollinearity. Also, occasionally ridge regression
can provide an estimate of the mean response which is good for new values
that lie outside the range of our observations (called extrapolation). The
mean response found by ordinary least squares is known to not be good for
extrapolation.
This is a solution to
Y∗ = X∗ β ∗ + ∗ .
Notice that we have not reduced the dimension of β̂ Z from the original cal-
culation, but we have only set certain values equal to 0. Furthermore, as in
ridge regression, we can transform back to the original scale by
† sY
β̂PC,j = β̂PC,j
sXj
p−1
†
X †
β̂PC,0 = ȳ − β̂PC,j x̄j ,
j=1
where j = 1, 2, . . . , p − 1.
How do you choose the number of eigenvalues to omit? This can be
accomplished by looking at the cumulative percent variation explained by
each of the (p − 1) components. For the j th component, this percentage is
Pj
λi
i=1
× 100,
λ1 + λ2 + . . . + λp−1
zi = X∗ ri ,
β̂ Z = (ZT Z)−1 ZT Y∗ .
β̂ PLS = Rβ̂ Z ,
which is a solution to
Y∗ = X∗ β ∗ + ∗ .
where j = 1, 2, . . . , p − 1.
The method described above is sometimes referred to as the SIMPLS
method. Another method commonly used is nonlinear iterative partial
least squares (NIPALS). NIPALS is more commonly used when you have
a vector of responses. While we do not discuss the differences between these
algorithms any further, we do discuss later the setting where we have a vector
of responses.
of the data.1 The tool commonly used is called Sliced Inverse Regres-
sion (or SIR). SIR uses the inverse regression curve E(X|Y = y), which
falls into a reduced dimension space under certain conditions. SIR uses this
curve to perform a weighted principal components analysis such that one can
determine an effective subset of the predictors. The reason for reducing the
dimensions of the predictors is because of the curse of dimensionality,
which means that drawing inferences on the same number of data points in a
higher dimensional space becomes difficult due to the sparsity of the data in
the volume of the higher dimensional space compared to the volume of the
lower dimensional space.2
When working with the classic linear regression model
Y = Xβ +
Y = f (Xβ) +
3. Compute
n
X
m̂h (yi ) = n−1
h x∗i Ih {yi },
i=1
which are the means of the H slices.
4. Calculate the estimate for Cov(m(y)) by
H
X
−1
V̂ = n nh m̂h (yi )m̂h (yi )T .
h=1
β̂ CLS = β̂ OLS − (XT X)1 AT [A(XT X)−1 AT ]−1 [Aβ̂ OLS − a],
18.6 Examples
Example 1: GNP Data
This data set of size n = 16 contains macroeconomic data taken between
the years 1947 and 1962. The economic indicators recorded were the GNP
implicit price deflator (IPD), the GNP, the number of people unemployed,
the number of people in the armed forces, the population, and the number
of people employed. We wish to see if the GNP IPD can be modeled as a
function of the other variables. The data set is given in Table 18.1.
First we run a multiple linear regression procedure to obtain the following
output:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2946.85636 5647.97658 0.522 0.6144
GNP 0.26353 0.10815 2.437 0.0376 *
Unemployed 0.03648 0.03024 1.206 0.2585
Armed.Forces 0.01116 0.01545 0.722 0.4885
Population -1.73703 0.67382 -2.578 0.0298 *
Year -1.41880 2.94460 -0.482 0.6414
Employed 0.23129 1.30394 0.177 0.8631
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Table 18.1: The macroeconomic data set for the years 1947 to 1962.
20
10
t(x$coef)
0
−10
x$lambda
Figure 18.1: Ridge regression trace plot with the ridge regression coefficient
on the x-axis.
regression coefficients shrink drastically until about 0.02. When using the
Hoerl-Kennard method, a value of about k = 0.0068 is obtained. Other
methods will certainly yield different estimates which illustrates some of the
criticism surrounding ridge regression.
The resulting estimates from this ridge regression analysis are
##########
GNP Unemployed Armed.Forces Population Year
25.3615288 3.3009416 0.7520553 -11.6992718 -6.5403380
Employed
0.7864825
##########
i Y X1 X2 X3
1 49.0 1300 7.5 0.0120
2 50.2 1300 9.0 0.0120
3 50.5 1300 11.0 0.0115
4 48.5 1300 13.5 0.0130
5 47.5 1300 17.0 0.0135
6 44.5 1300 23.0 0.0120
7 28.0 1200 5.3 0.0400
8 31.5 1200 7.5 0.0380
9 34.5 1200 11.0 0.0320
10 35.0 1200 13.5 0.0260
11 38.0 1200 17.0 0.0340
12 38.5 1200 23.0 0.0410
13 15.0 1100 5.3 0.0840
14 17.0 1100 7.5 0.0980
15 20.5 1100 11.0 0.0920
16 29.5 1100 17.0 0.0860
0.10
● ●
● ●
● ●
● ●
0.08
0.08
Contact Time
Contact Time
0.06
0.06
0.04
0.04
● ● ●
●
● ●
● ●
● ●
● ●
0.02
0.02
● ● ●
●
● ● ● ● ●
●
(a) (b)
● ●
20
● ● ●
H2 Ratio
15
● ●
● ● ●
10
● ● ●
● ●
5
Reactor Temparature
(c)
Figure 18.2: Pairwise scatterplots for the predictors from the acetylene data
set. LOESS curves are also provided. Does there appear to be any possible
linear relationships between pairs of predictors?
268
CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 269
For simplicity, we construct the piecewise linear regression model for the
case of simple linear regression and also briefly discuss how this can be ex-
tended to the multiple regression setting. First, let us establish what the
simple linear regression model with one knot value (k1 ) looks like:
E(Y ) = β0 + β1 X1
Such a regression model is fitted in the upper left-hand corner of Figure 19.1.
For more than one knot value, we can extend the above regression model
to incorporate other indicator values. Suppose we have c knot values (i.e.,
k1 , k2 , . . . , kc ) and we have n observations. Then the piecewise linear regres-
sion model is written as:
y = Xβ + ,
Furthermore, you can see how for more than one predictor you can construct
the X matrix to have columns as functions of the other predictors.
11
● ● ●
● ● ●
● ● ●● ●
●
● ● ●● ●● ● ●
●● ●
● ●●● ● ●● ● ● ● ● ●
● ● ●● ● ●●● ● ●●
10
● ● ● ●
● ● ● ●● ● ● ●● ●● ● ●
8
● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ●●
● ● ●●
● ● ● ● ●
● ● ● ●
● ●●
● ●
9
●
● ●
●
● ●
7
● ●
8
● ●
● ●●
Y
Y
●●●● ●
● ●
●●● ●
● ● ●
● ●● ● ●●
●●
7
●● ●
● ● ● ●
● ●● ●● ●●
6
● ● ●● ●
●●
● ● ●
● ● ● ●●
● ● ●
● ●
● ● ● ●
6
●
● ● ●● ●
● ● ●
●
● ● ●
●
● ●
● ●●
5
5
●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
(a) (b)
●● ●
● ●
10
●
● ● ● ● ●●●
10
●● ●● ●
● ● ●● ● ●
● ● ● ●●
● ●
● ● ● ●
●
●
● ●
●●●
9
●
●
●●
● ●●
● ●
● ● ●
8
●● ● ● ● ● ● ●
8
● ● ● ● ● ● ●●
● ●●● ●
●
●●
● ●● ● ● ● ● ●
Y
● ●● ●
●
7
● ●
●
● ● ● ● ●●●
●● ●
● ●
● ●●
7
●● ●● ●
●● ●
● ●
● ● ●● ●●
● ●●
●●● ●●
●
6
● ●
● ●● ● ●
● ●
● ●
● ●● ● ● ●
●
● ●
●● ● ●
6
●
●●
●
●●
5
● ● ● ●
●
● ●●●
●●● ● ●●
● ● ●
● ●
● ●
●
● ● ●● ●●● ●
4
5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
(c) (d)
yi = m(xi ) + i ,
Here, K(·) is called the kernel function and h is called the bandwidth.
K(·) is a function often resembling a probability density function, but with
no parameters (some common kernel functions are provided in Table 19.1). h
controls the window width around x which we perform the density estimation.
Thus, a kernel density estimator is essentially a weighting scheme (dictated
by the choice of kernel) which takes into consideration the proximity of a
point in the data set near x when given a bandwidth h. Furthermore, more
weight is given to points near x and less weight is given to points further
from x.
With the formalities established, one can perform a kernel regression of
yi on Xi estimate mh (·) with the Nadaraya-Watson estimator:
Pn xi −x
i=1 K h
yi
m̂h (x) = ,
Pn xi −x
i=1 K h
Kernel K(u)
Triangle (1 − |u|)I(|u| ≤ 1)
(1−u2 )g
Beta Beta(0.5,g+1)
I(|u| ≤ 1)
1 2
Gaussian √1 e− 2 u
2π
1
Cosinus 2
(1 + cos(πu))I(|u| ≤ 1)
π
cos( πu
Optcosinus 4 2
) I(|u| ≤ 1)
2
Pn xi −x
the standard normal distribution, kKk22 = i=1 K h
, and
1
Pn xi −x
n i=1 K h
{yi m̂h (x)}2
σ̂h2 (x) = .
Pn xi −x
i=1 K h
where 1/2
− log{− 12 log(1 − α)}
zn,α = + dn
(2δ log(n))1/2
and p 0 2 1/2
kK k
dn = (2δ log(n))1/2 + (2δ log(n))−1/2 log p 2 .
2π kKk22
Some final notes about kernel regression include:
• Choice of kernel and bandwidth are still major issues in research. There
are some general guidelines to follow and procedures that have been
developed, but are beyond the scope of this course.
• What we developed in this section is only for the case of one predictor.
If you have multiple predictors (i.e., x1,i , . . . , xp,i ), then one needs to
use a multivariate kernel density estimator at a point x = (x1 , . . . , xp )T ,
which is defined as
n
ˆ 1X 1 xi,1 − x1 xi,p − xp
fh (x) = K ,..., .
n i=1 pj=1 hj
Q
h1 hp
m(xi ) ≈ β0 (x)+β1 (x)(xi −x)+β2 (x)(xi −x)2 +. . .+βq (x)(xi −x)q , |xi −x| ≤ h,
so that
β0 (x) = m(x)
β1 (x) = m0 (x)
β2 (x) = m00 (x)/2
..
.
βq (x) = m(q) (x)/q!.
Note that the β parameters are considered functions of x, hence the “local”
aspect of this methodology.
Local polynomial fitting minimizes
n q 2
X X
j xi − x
yi − βj (x)(xi − x) K
i=1 j=0
h
is not a simple mathematical model, but rather an algorithm that, when given
a value of X, computes an appropriate value of Y. The algorithm was de-
signed so that the LOESS curve travels through the middle of the data and
gives points closest to each X value the greatest weight in the smoothing
process, thus limiting the influence of outliers.
Suppose we have a set of observations ((x1 , y1 ), . . . , (xn , yn ). LOESS fol-
lows a basic algorithm as follows:
1. Select a set of values partitioning [x(1) , x(n) ]. Let x0 be an individual
value in this set.
2. For each observation, calculate the distance
di = |xi − x0 |.
Since outliers can have a large impact on least squares estimates, a ro-
bust weighted regression procedure may also be used to lessen the influence
of outliers on the LOESS curve. This is done by replacing Step 3 in the
algorithm above with a new set of weights. These weights are calculated by
taking the q LOESS residuals
|ri∗ |
wi∗ = wi∗ B .
6M
For wi , the value wi∗0 is the previous weight for this observation (where the first
time you calculate this weight can be done by the original LOESS procedure
we outlined), M is the median of the q absolute values of the residuals, and
B(·) is the bisquare weight function given by
(1 − |u|2 )2 , if |u| < 1;
B(u) =
0, if |u| ≥ 1.
yi = m(xi ) + i
2. For j = 1, . . ., maximize
2
Pn (j−1)
i=1 ri − m̂(j) (α̂T
(j) xi )
2
R(j) =1− 2
Pn (j−1)
i=1 ri
2 2
4. Repeat steps 2 and 3 until R(j) becomes small. A small R(j) implies
T
that m̂(j) (α̂(j) xi ) is approximately the zero function and we will not
find any other useful direction.
The advantages of using PPR for estimation is that we are using univari-
ate regressions which are quick and easy to estimate. Also, PPR is able to
approximate a fairly rich class of functions as well as ignore variables provid-
ing little to no information about m(·). Some disadvantages of using PPR
include having to examine a p-dimensional parameter space to estimate α̂(j)
and interpretation of a single term may be difficult.
Notice that the cubic smoothing spline introduced above is only capable of
handling one predictor. Suppose now that we have p predictors X1 , . . . , Xp .2
We wish to consider the model
yi = φ(xi,1 , . . . , xi,p ) + i
= φ(xi ) + i ,
where A(ω) is called the smoothing matrix. Then the GCV is defined as
k(In×n − A(ω))yk2 /n
V (ω) =
[tr(In×n − A(ω))/n]2
regression coefficients for a simple linear regression, then the bootstrap will
∗ ∗ ∗ ∗ ∗ ∗
yield (β0,1 , β1,1 ), (β0,2 , β1,2 ), . . . , (β0,5000 , β1,5000 ) as your sample.
Now suppose that you want the standard errors and confidence intervals
for the regression coefficients. The standard deviation of the B estimates
provided by the bootstrapping scheme is the bootstrap estimate of the stan-
dard error for the respective regression coefficient. Furthermore, a bootstrap
confidence interval is found by sorting the B estimates of a regression co-
efficient and selecting the appropriate percentiles from the sorted list. For
example, a 95% bootstrap confidence interval would be given by the 2.5th
and 97.5th percentiles from the sorted list. Other statistics may be computed
in a similar manner.
One assumption which bootstrapping relies heavily on is that your sam-
ple approximates the population fairly well. Thus, bootstrapping does not
usually work well for small samples as they are likely not representative of
the underlying population. Bootstrapping methods should be relegated to
medium sample sizes or larger (what constitutes a medium sample size is
somewhat subjective).
Now we can turn our attention to the two bootstrapping techniques avail-
able in the regression setting. Assume for both methods that our sample
consists of the pairs (x1 , y1 ), . . . , (xn , yn ). Extending either method to the
case of multiple regression is analogous.
We can first bootstrap the observations. In this setting, the bootstrap
samples are selected from the original pairs of data. So the pairing of a
response with its measured predictor is maintained. This method is appro-
priate for data in which both the predictor and response were selected at
random (i.e., the predictor levels were not predetermined).
We can also bootstrap the residuals. The bootstrap samples in this setting
are selected from what are called the Davison-Hinkley modified residu-
als, given by
n
ei 1X e
e∗i = p − p j ,
1 − hi,i n j=1 1 − hj,j
where the ei ’s are the original regression residuals. We do not simply use
the ei ’s because these lead to biased results. In each bootstrap sample, the
randomly sampled modified residuals are added to the original fitted values
forming new values of y. Thus, the original structure of the predictors will
remain the same while only the response will be changed. This method is
appropriate for designed experiments where the levels of the predictor are
predetermined. Also, since the residuals are sampled and added back at
random, we must assume the variance of the residuals is constant. If not,
this method should not be used.
Finally, a 100 × (1 − α)% bootstrap confidence interval for the regression
coefficient βi is given by
∗ ∗
(βi,b α
×nc , βi,d(1− α )×ne ),
2 2
1. Draw a sample of size n (x1 , y1 ), . . . , (xn , yn ) and divide the sample into
s independent groups, each of size d.
2. Omit the first set of d observations from the sample and estimate β0
and β1 from the (n − d) remaining observations (call these estimates
(J ) (J )
β̂0 1 and β̂1 1 , respectively). The remaining set of (n − d) observations
is called the delete-d jackknife sample.
samples.
(J) (J)
4. Obtain the (joint) probability distribution F (β0 , β1 ) of delete-d jack-
knife estimates. This may be done empirically or through use of inves-
tigating an appropriate distribution.
where βˆj is the estimate obtained when using the full sample of size n. The
jackknife variance for each regression is
(n − 1) (J) ˆ 2
var
c J (β̂j ) = (β̂j − βj ) ,
n
which implies that the jackknife standard error is
q
s.e.
c J (β̂j ) = var
c J (β̂j ).
While for moderately sized data the jackknife requires less computation,
there are some drawbacks to using the jackknife. Since the jackknife is us-
ing fewer samples, it is only using limited information about β̂. In fact,
the jackknife can be viewed as an approximation to the bootstrap (it is a
linear approximation to the bootstrap in that the two are roughly equal for
linear estimators). Moreover, the jackknife can perform quite poorly if the
estimator of interest is not sufficiently “smooth” (intuitively, smooth can be
thought of as small changes to the data result in small changes to the calcu-
lated statistic), which can especially occur when your sample is too small.
19.5 Examples
Example 1: Packaging Data Set
This data set of size n = 15 contains measurements of yield from a packaging
plant where the manager wants to model the unit cost (y) of shipping lots
of a fragile product as a linear function of lot size (x). Table 19.2 gives the
data used for this analysis. Because of the economics of scale, the manager
believes that the cost per unit will decrease at a fast rate for lot sizes of more
than 1000.
Based on the description of this data, we wish to fit a (continuous) piece-
wise regression with one knot value at k1 = 1000. Figure 19.2 gives a scatter-
plot of the raw data with a vertical line at the lot size of 1000. This appears
to be a good fit.
We can also obtain summary statistics regarding the fit
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0240268 0.1766955 22.774 3.05e-11 ***
lot.size -0.0020897 0.0002052 -10.183 2.94e-07 ***
lot.size.I -0.0013937 0.0003644 -3.825 0.00242 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
2.5
●
●
●
●
●
2.0
Cost ●●
●
1.5
●
●
●
1.0
●
0.5
Lot Size
Figure 19.2: A scatterplot of the packaging data set with a piecewise linear
regression fitted to the data.
2. How do you think more data would affect the smoothness of the fits?
3. If we drive the span to 0, what type of regression line would you expect
to see?
4. If we drive the span to 1, what type of regression line would you expect
to see?
Figure 19.3(b) shows two kernel regression curves with two different band-
widths. A Gaussian kernel is used. Some things to think about when fitting
the data (as with the LOESS fit) are:
2. How do you think more data would affect the smoothness of the fits?
3. What type of regression line would you expect to see as we change the
bandwidth?
When performing local fitting (as with kernel regression), the last two points
above are issues where there are still no clear solutions.
100
● ● ● ●
● ●
90
90
● ●
● ● ● ●
80
80
Score 2
Score 2
● ●
● ● ● ●
70
70
● ● ● ●
● ●
● ●
60
● Span 60 ● Bandwidth
0.4 5
● 0.9 ● 15
50
50
60 70 80 90 60 70 80 90
Score 1 Score 1
(a) (b)
Figure 19.3: (a) A scatterplot of the quality data set and two LOESS fits
with different spans. (b) A scatterplot of the quality data set and two kernel
regression fits with different bandwidths.
Next, let us return to the orthogonal regression fit of this data. Recall
that the slope term for the orthogonal regression fit was 1.4835. Using a
nonparametric bootstrap (with B = 5000 bootstraps), we can obtain the
following bootstrap confidence intervals for the orthogonal slope parameter:
290
CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 291
• Interval censoring: This occurs when a study has discrete time points
and an observation reaches a terminal event between two of the time
points. In other words, for discrete time increments 0 = t1 < t2 <
. . . < tr < ∞, we have Y1 < T < Y2 such that for j = 1, . . . , r − 1,
tj , tj < T < tj+1 ;
Y1 =
0, otherwise
and
tj+1 , tj < T < tj+1 ;
Y2 =
∞, otherwise.
These are only the basics when it comes to survival (reliability) analysis.
However, they provide enough of a foundation for our interests. We are
interested in when a set of predictors (or covariates) are also measured
with the observed time.
where αi = (t − XT
i β)/σ and
φ(αi )
λ(αi ) =
1 − Φ(αi )
such that φ(·) and Φ(·) are the probability density function and cumulative
distribution function of a standard normal random variable (i.e., N (0, 1)),
respectively. Moreover, the quantity λ(αi ) is called the inverse Mills ratio,
which reappears later in our discussion about the truncated regression model.
If we let i1 be the index of all of the uncensored values and i2 be the index
of all of the left-censored values, then we can define a log-likelihood function
for the estimation of the regression parameters (see Appendix C for further
details on likelihood functions):
X
`(β, σ) = −(1/2) [log(2π) + log(σ 2 ) + (yi − XT 2
i β)/σ ]
i1
X
+ log(1 − Φ(XT
i β/σ)).
i2
• Manufacturer, metal used for the drive mechanisms, and running tem-
perature of the machines.
Let X∗ be the matrix of covariates as in the standard multiple regression
model, but without the first column consisting of 1’s (so it is an n × (p − 1)
matrix). Then we model
T ∗ = β0 + X∗T β ∗ + ,
where T ] = e() . So, the covariate acts multiplicatively on the survival time
T.
The distribution of will allow us to determine the distribution of T ∗ .
Each possible probability distribution has a different h(t). Furthermore, in
a survival regression setting, was assume the hazard rate at time t for an
individual has the form:
h(t|X∗ ) = h0 (t)k(X∗T β ∗ )
∗T
β∗
= h0 (t)eX .
In the above, h0 (t) is called the baseline hazard and is the value of the
hazard function when X∗ = 0 or when β ∗ = 0. Note in the expression for
T ∗ that we separated out the intercept term β0 as it becomes part of the
baseline hazard. Also, k(·) in the equation for h(t|X∗ ) is a specified link
function, which for our purposes will be e(·) .
Next we discuss some of the possible (and common) distributions assumed
for . We do not write out the density formulas here, but they can be found in
most statistical texts. The parameters for your distribution help control three
primary aspects of the density curve: location, scale, and shape. You will
want to consider the properties your data appear to exhibit (or historically
have exhibited) when determining which of the following to use:
Note that the above is not an exhaustive list, but provides some of the more
commonly used distributions in statistical texts and software. Also, there is
an abuse of notation in that duplication of certain characters (e.g., µ, σ, etc.)
does not imply a mathematical relationship between all of the distributions
where that character appears.
Estimation of the parameters can be accomplished in two primary ways.
One way is to construct a probability of the chosen distribution with your
data and then apply least squares regression to this plot. Another, perhaps
more appropriate, approach is to use maximum likelihood estimation as it
Exponentiating both sides yields a ratio of the actual hazard rate and baseline
hazard rate, which is called the relative risk:
h(t) ∗T ∗
= eX β
h0 (t)
p−1
Y
= eβi xi .
i=1
Thus, the regression coefficients have the interpretation as the relative risk
when the value of a covariate is increased by 1 unit. The estimates of the
regression coefficients are interpreted as follows:
• The ratio of the estimated risk functions for two different sets of covari-
ates (i.e., two groups) can be used to examine the likelihood of Group
1’s survival (failure) time to Group 2’s survival (failure) time.
Remember, for this model the intercept term has been absorbed by the base-
line hazard.
The model we developed above is the Cox Proportional Hazards re-
gression model and does not include t on the right-hand side. Thus, the
relative risk is constant for all values of t. Estimation for this regression model
is usually done by maximum likelihood and Newton-Raphson is usually the
algorithm used. Usually, the baseline hazard is found non parametrically, so
the estimation procedure for the entire model is said to be semi parametric.
Additionally, if there are failure time ties in the data, then the likelihood gets
more complex and an approximation to the likelihood is usually used (such
as the Breslow Approximation or the Efron Approximation).
Cox-Snell Residuals
In the previous regression models we studied, residuals were defined as a
difference between observed and fitted values. For survival regression, in
order to check the overall fit of a model, the Cox-Snell residual for the ith
observation in a data set is used and defined as:
∗T ∗
rCi = Ĥ0 (ti )eX β̂
.
∗
In the above, β̂ is the maximum likelihood estimate of the regression co-
efficient vector. Ĥ0 (ti ) is a maximum likelihood estimate of the baseline
cumulative hazard function H0 (ti ), defined as:
Z t
H0 (t) = h0 (x)dx.
0
Notice that rCi > 0 for all i. The way we check for a goodness-of-fit with the
Cox-Snell residuals is to estimate the cumulative hazard rate of the residuals
(call this ĤrC (trCi )) from whatever distribution you are assuming, and then
plot ĤrC (trCi ) versus rCi . A good fit would be suggested if they form roughly
a straight line (like we looked for in probability plots).
Martingale Residuals
Define a censoring indicator for the ith observation as
0, if observation i is censored;
δi =
1, if observation i is uncensored.
In order to identify the best functional form for a covariate given the assumed
functional form of the remaining covariates, we use the Martingale residual
for the ith observation, which is defined as:
M̂i = δi − rCi .
The M̂i values fall between the interval (−∞, 1] and are always negative
for censored values. The M̂i values are plotted against the xj,i , where j
represents the index of the covariate for which we are trying to identify
the best functional form. Plotting a smooth-fitted curve over this data set
will indicate what sort of function (if any) should be applied to xj,i . Note
that the martingale residuals are not symmetrically distributed about 0, but
asymptotically they have mean 0.
Deviance Residuals
Outlier detection in a survival regression model can be done using the de-
viance residual for the ith observation:
q
Di = sgn(M̂i ) −2(`i (θ̂) − `Si (θi )).
For Di , `i (θ̂) is the ith log likelihood evaluated at θ̂, which is the maximum
likelihood estimate of the model’s parameter vector θ. `Si (θi ) is the log
likelihood of the saturated model evaluated at the maximum likelihood θ.
A saturated model is one where n parameters (i.e., θ1 , . . . , θn ) fit the n
observations perfectly.
The Di values should behave like a standard normal sample. A normal
probability plot of the Di values and a plot of Di versus the fitted ln(t)i
values, will help to determine if any values are fairly far from the bulk of
the data. It should be noted that this only applies to cases where light to
moderate censoring occur.
Partial Deviance
Finally, we can also consider hierarchical (nested) models. We start by defin-
ing the model deviance:
X n
∆= Di2 .
i=1
Λ = ∆R − ∆F
= −2(`(θ̂ R ) − `(θ̂ F ))
`(θ̂ R )
= −2 log ,
`(θ̂ F )
where `(θ̂ R ) and `(θ̂ F ) are the log likelihood functions evaluated at the maxi-
mum likelihood estimates of the reduced and full models, respectively. Luck-
ily, this is a likelihood ratio statistic and has the corresponding asymptotic χ2
distribution. A large value of Λ (large with respect to the corresponding χ2
distribution) indicates the additional covariates improve the overall fit of the
model. A small value of Λ means they add nothing significant to the model
and you can keep the original set of covariates. Notice that this procedure
is similar to the extra sum of squares procedure developed in the previous
course.
are known. For example, suppose we had wages and years of schooling for
a sample of employees. Some persons for this study are excluded from the
sample because their earned wages fall below the minimum wage. So the
data would be missing for these individuals.
Truncated regression models are often confused with the censored regres-
sion models that we introduced earlier. In censored regression models, only
the value of the dependent variable is clustered at a lower and/or upper
threshold value, while values of the independent variable(s) are still known.
In truncated regression models, entire observations are systematically omit-
ted from the sample based on the lower and/or upper threshold values. Re-
gardless, if we know that the data has been truncated, we can adjust our
estimation technique to account for the bias introduced by omitting values
from the sample. This will allow for more accurate inferences about the en-
tire population. However, if we are solely interested in the population that
does not fall outside the threshold value(s), then we can rely on standard
techniques that we have already introduced, namely ordinary least squares.
Let us formulate the general framework for truncated distributions. Sup-
pose that X is a random variable with a probability density function fX
and associated cumulative distribution function FX (the discrete setting is
defined analogously). Consider the two-sided truncation a < X < b. Then
the truncated distribution is given by
gX (x)
fX (x|a < X < b) = ,
FX (b) − FX (a)
where
fX (x), a < x < b;
gX (x) =
0, otherwise.
gX (x) is then defined accordingly for whichever distribution with which you
are working.
Consider the canonical multiple linear regression model
Yi = XT
i β + i ,
Yi |Xi ∼ N (XT 2
i β, σ ).
When truncating the response, the distribution, and consequently the mean
and variance of the truncated distribution, must be adjusted accordingly.
Consider the three possible truncation settings of a < Y < b (two-sided
truncation), a < Yi (bottom-truncation), and Yi < b (top-truncation). Let
αi = (a − XT T T
i β)/σ, γi = (b − Xi β)/σ, and ψi = (yi − Xi β)/σ, such that yi
is the realization of the random variable Yi . Moreover, recall that λ(z) is the
inverse Mills ratio applied to the value of z and let
Then using established results for the truncated normal distribution, the
three different truncated probability density functions are
1
φ(ψi )
σ
1−Φ(αi )
, Θ = {a < Yi } and a < yi ;
1
φ(ψi )
fY |X (yi |Θ, XT
i , β, σ) =
σ
Φ(γi )−Φ(αi )
, Θ = {a < Yi < b} and a < yi < b;
1
φ(ψi )
σ
, Θ = {Yi < b} and yi < b,
Φ(γi )
20.7 Examples
Example 1: Motor Dataset
This data set of size n = 16 contains observations from a temperature-
accelerated life test for electric motors. The motorettes were tested at four
different temperature levels and when testing terminated, the failure times
were recorded. The data can be found in Table 20.1.
Table 20.1: The motor data set measurements with censoring occurring if a
0 appears in the Censor column.
This data set is actually a very common data set analyzed in survival
analysis texts. We will proceed to fit it with a Weibull survival regression
model. The results from this analysis are
##########
Value Std. Error z p
(Intercept) 17.0671 0.93588 18.24 2.65e-74
count 0.3180 0.15812 2.01 4.43e-02
temp -0.0536 0.00591 -9.07 1.22e-19
Log(scale) -1.2646 0.24485 -5.17 2.40e-07
Scale= 0.282
Weibull distribution
Loglik(model)= -95.6 Loglik(intercept only)= -110.5
Chisq= 29.92 on 2 degrees of freedom, p= 3.2e-07
Number of Newton-Raphson Iterations: 9
n= 16
##########
1 ● ●
1
● ● ● ●
● ●
● ●
● ● ● ●
● ● ●
0
0
●
Deviance Residuals
● ●
Sample Quantiles
● ●
● ●
● ●
−1
−1
● ●
● ●
−2
−2
● ●
−3
−3
6.0 6.5 7.0 7.5 8.0 8.5 9.0 −2 −1 0 1 2
(a) (b)
Figure 20.1: (a) Plot of the deviance residuals. (b) NPP plot for the deviance
residuals.
or censored (as in Figure 20.2(b)). The dark blue line on the left is the
truncated regression line that is estimated using ordinary least squares. So
the interpretation of this line will only apply tho those data that were not
truncated, which is what the researcher is interested in. The estimated model
is given below:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.8473 1.1847 42.921 < 2e-16 ***
x 1.6884 0.1871 9.025 7.84e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Suppose now that the researcher is interested in all of the data and, say,
there is some problem with recovering those participants that were in the
truncated portion of the sample. Then the truncated regression line can be
estimated via the method of maximum likelihood estimation, which is the
light blue line in Figure 20.2(a). This line can (and will likely) go beyond
the level of truncation since the estimation method is accounting for the
truncation. The estimated model is given below:
##########
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 47.69938 1.89772 25.1351 < 2.2e-16 ***
x 2.09223 0.26979 7.7551 8.882e-15 ***
sigma 4.81855 0.46211 10.4273 < 2.2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Log-Likelihood: -231.21 on 3 Df
##########
This will be helpful for the researcher to say something about the broader
population of individuals tested other than those who only received a math
score of 50 or higher. Moreover, notice that both methods yield highly sig-
nificant slope and intercept terms for this data, as would be expected by
observing the strong linear trend in this data.
Figure 20.2(b) shows the estimate obtained when using a survival regres-
sion fit when assuming normal errors. Suppose the data had inadvertently
been censored at y = 50. So all of the red open circles now correspond to
a solid red circle in Figure 20.2(b). Since the data is now treated as left-
censored, we are actually fitting a Tobit regression model. The Tobit fit is
given by the green line and the results are given below:
##########
Value Std. Error z p
(Intercept) 52.21 1.0110 51.6 0.0e+00
x 1.50 0.1648 9.1 8.9e-20
Log(scale) 1.46 0.0766 19.0 7.8e-81
Scale= 4.3
Gaussian distribution
Loglik(model)= -241.6 Loglik(intercept only)= -267.1
Chisq= 51.03 on 1 degrees of freedom, p= 9.1e-13
Number of Newton-Raphson Iterations: 5
n= 100
##########
Moreover, the dashed red line in both figures is the ordinary least squares fit
(assuming all of the data values are known and used in the estimation) and
is simply provided for comparative purposes. The estimates for this fit are
given below:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.2611 0.9484 49.83 <2e-16 ***
x 2.1611 0.1639 13.19 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
As you can see, the structure of your data and underlying assumptions
can change your estimate - namely because you are attempting to estimate
different models. The regression lines in Figure 20.2 are a good example of
how different assumptions can alter the final estimates that you report.
80
Truncated MLE Fit ●
Survival Fit ●
Truncated OLS Fit ● ● Regular OLS Fit ● ●
Regular OLS Fit
● ●
● ● ● ●
70
70
● ●
● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ●●● ● ● ●●●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
● ● ● ●
60
60
● ● ● ●
y
● ● ● ● ● ● ● ●
● ● ● ●● ●● ● ● ● ● ●● ●● ●
● ●● ● ●● ● ●● ● ●●
● ● ● ● ● ●
● ●
● ● ● ●
●● ● ●● ●● ● ●●
● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
50
50
40
● ●
0 2 4 6 8 10 0 2 4 6 8 10
x x
(a) (b)
Figure 20.2: (a) A plot of the logical reasoning data. The red circles have
been truncated as they fall below 50. The maximum likelihood fit for the
truncated regression (solid light blue line) and the ordinary least squares fit
for the truncated data set (solid dark blue line) are shown. The ordinary
least squares line (which includes the truncated values for the estimation) is
shown for reference. (b) The logical reasoning data with a Tobit regression
fit provided (solid green line). The data has been censored at 50 (i.e., the
solid red dots are included in the data). Again, the ordinary least squares
line has been provided for reference.
x y x y x y x y
0.00 46.00 2.53 49.95 5.05 64.31 7.58 50.91
0.10 56.38 2.63 51.58 5.15 68.22 7.68 65.51
0.20 45.59 2.73 59.50 5.25 58.39 7.78 61.32
0.30 53.66 2.83 50.84 5.35 58.55 7.88 71.37
0.40 40.05 2.93 55.65 5.45 60.40 7.98 76.97
0.51 46.62 3.03 51.55 5.56 57.10 8.08 56.72
0.61 44.56 3.13 49.16 5.66 58.64 8.18 67.90
0.71 47.20 3.23 58.59 5.76 58.93 8.28 65.30
0.81 57.06 3.33 51.90 5.86 61.30 8.38 61.62
0.91 49.18 3.43 62.95 5.96 60.75 8.48 68.68
1.01 51.06 3.54 57.74 6.06 58.67 8.59 69.43
1.11 51.75 3.64 54.37 6.16 60.67 8.69 64.82
1.21 46.73 3.74 58.21 6.26 59.46 8.79 63.81
1.31 42.04 3.84 55.44 6.36 65.49 8.89 59.27
1.41 48.83 3.94 58.62 6.46 60.96 8.99 62.23
1.52 51.81 4.04 53.63 6.57 57.36 9.09 64.78
1.62 57.35 4.14 43.46 6.67 59.83 9.19 64.88
1.72 49.91 4.24 57.42 6.77 57.40 9.29 72.30
1.82 49.82 4.34 60.64 6.87 62.96 9.39 65.18
1.92 61.53 4.44 50.99 6.97 67.02 9.49 78.35
2.02 47.40 4.55 50.42 7.07 65.93 9.60 64.62
2.12 54.78 4.65 54.68 7.17 63.55 9.70 76.85
2.22 48.94 4.75 54.40 7.27 61.99 9.80 68.57
2.32 55.13 4.85 60.21 7.37 64.48 9.90 61.29
2.42 43.57 4.95 58.70 7.47 62.61 10.00 71.46
Table 20.2: The test scores from n = 100 participants for a logical reasoning
section (x) and a mathematics section (y).
Nonlinear Regression
All of the models we have discussed thus far have been linear in the parame-
ters (i.e., linear in the beta terms). For example, polynomial regression was
used to model curvature in our data by using higher-ordered values of the
predictors. However, the final regression model was just a linear combination
of higher-ordered predictors.
Now we are interested in studying the nonlinear regression model:
Y = f (X, β) + ,
However, there are some nonlinear models which are actually called in-
trinsically linear because they can be made linear in the parameters by a
simply transformation. For example:
β0 X
Y =
β1 + X
311
312 CHAPTER 21. NONLINEAR REGRESSION
can be rewritten as
1 1 β1
= + X
Y β0 β0
= θ0 + θ1 X,
Yi = f (Xi , β) + i ,
where the i are iid normal with mean 0 and constant variance σ 2 . For this
setting, we can rely on some of the least squares theory we have developed
over the course. For other nonnormal error terms, different techniques need
to be employed.
First, let
Xn
Q= (yi − f (Xi , β))2 .
i=1
In order to find
β̂ = arg min Q,
β
Then, we set each of the above partial derivatives equal to 0 and the para-
meters βk are each replaced by β̂k . This yields:
n n
X ∂f (Xi , β) X ∂f (Xi , β)
yi − f (Xi , β̂) = 0,
i=1
∂βk
β=β̂ i=1
∂βk
β=β̂
for k = 0, 1, . . . , p − 1.
The solutions to the critical values of the above partial derivatives for
nonlinear regression are nonlinear in the parameter estimates β̂ k and are
often difficult to solve, even in the simplest cases. Hence, iterative numerical
methods are often employed. Even more difficulty arises in that multiple
solutions may be possible!
• Let
∂1 ∂1
∂β1
... ∂βk
J(β) = ... ... ..
.
∂n ∂n
∂β1
... ∂βk
Nominal Logistic Regression: Used when there are three or more cat-
egories with no natural ordering to the levels. Examples of nominal
responses could include departments at a business (e.g., marketing,
sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN),
and color (black, red, blue, orange).
Ordinal Logistic Regression: Used when there are three or more cate-
gories with a natural ordering to the levels, but the ranking of the
levels do not necessarily mean the intervals between them are equal.
Examples of ordinal responses could be how you rate the effectiveness
of a college course on a scale of 1-5, levels of flavors for hot wings, and
medical condition (e.g., good, stable, serious, critical).
The problems with logistic regression include nonnormal error terms, non-
constant error variance, and constraints on the response function (i.e., the
• With the logistic model, estimates of π from equations (21.1) will al-
ways be between 0 and 1. The reasons are:
• With one X variable, the theoretical model for π has an elongated “S”
shape (or sigmoidal shape) with asymptotes at 0 and 1, although in
sample estimates we may not see this “S” shape if the range of the X
variable is limited.
• First is
π
= eβ0 +β1 X1 +...+βp−1 Xp−1 , (21.2)
1−π
• Second is
π
log = β0 + β1 X1 + . . . + βp−1 Xp−1 , (21.3)
1−π
which states that the logarithm of the odds is a linear function of the
X variables (and is often called the log odds).
In order to discuss goodness-of-fit measures and residual diagnostics for
binary logistic regression, it is necessary to at least define the likelihood (see
Appendix C for a further discussion). For a sample of size n, the likelihood
for a binary logistic regression is given by:
n
Y
L(β; y, X) = πiyi (1 − πi )1−yi
i=1
n T y i 1−yi
Y eXi β 1
= T T .
i=1
1 + eXi β 1 + eXi β
Odds Ratio
The odds ratio (which we will write as θ) determines the relationship between
a predictor and response and is available only when the logit link is used.
The odds ratio can be any nonnegative number. An odds ratio of 1 serves
as the baseline for comparison and indicates there is no association between
the response and predictor. If the odds ratio is greater than 1, then the odds
of success are higher for the reference level of the factor (or for higher levels
of a continuous predictor). If the odds ratio is less than 1, then the odds of
success are less for the reference level of the factor (or for higher levels of
a continuous predictor). Values farther from 1 represent stronger degrees of
association. For binary logistic regression, the odds of success are:
π T
= eX β .
1−π
This exponential relationship provides an interpretation for β. The odds
increase multiplicatively by eβj for every one-unit increase in Xj . More for-
mally, the odds ratio between two sets of predictors (say X(1) and X(2) ) is
given by
(π/(1 − π))|X=X(1)
θ= .
(π/(1 − π))|X=X(2)
Wald Test
The Wald test is the test of significance for regression coefficients in logistic
regression (recall that we use t-tests in linear regression). For maximum
likelihood estimates, the ratio
β̂i
Z=
s.e.(β̂i )
can be used to test H0 : βi = 0. The standard normal curve is used to
determine the p-value of the test. Furthermore, confidence intervals can be
constructed as
β̂i ± z1−α/2 s.e.(β̂i ).
Raw Residual
The raw residual is the difference between the actual response and the
estimated probability from the model. The formula for the raw residual is
ri = yi − π̂i .
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
pi = p .
π̂i (1 − π̂i )
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these
residuals is the deviance statistic. The formula for the deviance residual is
s
yi 1 − yi
di = ± 2 yi log + (1 − yi ) log .
π̂i 1 − π̂i
Hat Values
The hat matrix serves a similar purpose as in the case of linear regression -
to measure the influence of each observation on the overall fit of the model -
but the interpretation is not as clear due to its more complicated form. The
hat values are given by
T
hi,i = π̂i (1 − π̂i )xT
i (X WX)xi ,
Studentized Residuals
We can also report Studentized versions of some of the earlier residuals. The
Studentized Pearson residuals are given by
pi
spi = p
1 − hi,i
C and C̄
C and C̄ are extensions of Cook’s distance for logistic regression. C̄ measures
the overall change in fitted log its due to deleting the ith observation for all
points excluding the one deleted while C includes the deleted point. They
are defined by:
p2i hi,i
Ci =
(1 − hi,i )2
and
p2i hi,i
C̄i = .
(1 − hi,i )
Goodness-of-Fit Tests
Overall performance of the fitted model can be measured by two different
chi-square tests. There is the Pearson chi-square statistic
n
X
P = p2i
i=1
One additional test is Brown’s test, which has a test statistic to judge
the fit of the logistic model to the data. The formula for the general alter-
native with two degrees of freedom is:
T = sT C −1 s,
The formula for the symmetric alternative with 1 degree of freedom is:
(s1 + s2 )2
.
Var(s1 + s2 )
To interpret the test, if the p-value is less than your accepted significance
level, then reject the null hypothesis that the model fits the data adequately.
RA2
The calculation of R2 used in linear regression does not extend directly to
logistic regression. The version of R2 used in logistic regression is defined as
`(β̂) − `(βˆ0 )
R2 = ,
`(βˆ0 ) − `S (β)
where `(βˆ0 ) is the log likelihood of the model when only the intercept is
included and `S (β) is the log likelihood of the saturated model (i.e., where a
model is fit perfectly to the data). This R2 does go from 0 to 1 with 1 being
a perfect fit.
1+Pkj=2 eXT βj , j = 2, . . . , k;
T
eX βj
πj = (21.4)
Pk XT β , j = 1,
1
1+ j e
j=2
where again πj denotes a probability and not the irrational number. Notice
that k − 1 of the groups have their own set of β values. Furthermore, since
P k
j=1 πj = 1, we set the β values for group 1 to be 0 (this is what we call the
reference group). Notice that when k = 2, we are back to binary logistic
regression.
πj is the probability that an observation is in one of k categories. The
likelihood for the nominal logistic regression model is given by:
n Y
k
y
Y
L(β; y, X) = πi,ji,j (1 − πi,j )1−yi,j ,
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.
This yields the log likelihood:
n X
X k
`(β) = yi,j πi,j .
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.
e−λ λx
P(X = x|λ) = ,
x!
for x = 0, 1, 2, . . .. Notice that the Poisson distribution is characterized by
the single parameter λ, which is the mean rate of occurrence for the even
being measured. For the Poisson distribution, it is assumed that large counts
(with respect to the value of λ) are rare.
Poisson regression is similar to logistic regression in that the dependent
variable (Y ) is a categorical response. Specifically, Y is an observed count
that follows the Poisson distribution, but the rate λ is now determined by
Goodness-of-Fit
Overall performance of the fitted model can be measured by two different
chi-square tests. There is the Pearson statistic
n
X (yi − exp{XT β̂})2 i
P = T
i=1 exp{Xi β̂}
and the deviance statistic
n
X yi
G= yi log − (yi exp{XT
i β̂}) .
i=1 exp{XT
i β̂}
Deviance
Recall the measure of deviance introduced in the study of survival regressions
and logistic regression. The measure of deviance for the Poisson regression
setting is given by
D(y, β̂) = 2`S (β) − `(β̂),
where `S (β) is the log likelihood of the saturated model (i.e., where a model
is fit perfectly to the data). This measure of deviance (which differs from the
deviance statistic defined earlier) is a generalization of the sum of squares
from linear regression. The deviance also has an approximate chi-square
distribution.
Pseudo R2
The value of R2 used in linear regression also does not extend to Poisson
regression. One commonly used measure is the pseudo R2 , defined as
`(β̂) − `(βˆ0 )
R2 = ,
`S (β) − `(βˆ0 )
where `(βˆ0 ) is the log likelihood of the model when only the intercept is
included. The pseudo R2 goes from 0 to 1 with 1 being a perfect fit.
Raw Residual
The raw residual is the difference between the actual response and the
estimated value from the model. Remember that the variance is equal to the
mean for a Poisson random variable. Therefore, we expect that the variances
of the residuals are unequal. This can lead to difficulties in the interpretation
of the raw residuals, yet it is still used. The formula for the raw residual is
ri = yi − exp{XT
i β}.
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
pi = q ,
φ̂ exp{XT
i β}
where n
1 X (yi − exp{XT i β̂})
2
φ̂ = .
n − p i=1 exp{XTi β̂}
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these
residuals is the deviance statistic. The formula for the deviance residual is
s
T yi T
di = sgn(yi − exp{Xi β̂}) 2 yi log − (yi − exp{Xi β̂}) .
exp{XT i β̂}
Hat Values
The hat matrix serves the same purpose as in the case of linear regression -
to measure the influence of each observation on the overall fit of the model.
The hat values, hi,i , are the diagonal entries of the Hat matrix
H = W1/2 X(XT WX)−1 XT W1/2 ,
where W is an n × n diagonal matrix with the values of exp{XT
i β̂} on the
diagonal. As before, a hat value is large if hi,i > 2p/n.
Studentized Residuals
Finally, we can also report Studentized versions of some of the earlier resid-
uals. The Studentized Pearson residuals are given by
pi
spi = p
1 − hi,i
di
sdi = p .
1 − hi,i
g(µ) = log(µ) = XT β
T
⇒ µ = eX β
,
which can also be used in logistic regression. This link function is also
sometimes called the gompit link.
g(µ) = µλ = XT β
⇒ µ = (XT β)1/λ ,
where σ 2 is a scale parameter. There are also tests using likelihood ratio sta-
tistics for model development to determine if any predictors may be dropped
from the model.
21.6 Examples
Example 1: Nonlinear Regression Example
A simple model for population growth towards an asymptote is the logistic
model
β1
yi = + i ,
1 + eβ2 +β3 xi
where yi is the population size at time xi , β1 is the asymptote towards which
the population grows, β2 reflects the size of the population at time x =
0 (relative to its asymptotic size), and β3 controls the growth rate of the
population.
We fit this model to Census population data for the United States (in
millions) ranging from 1790 through 1990 (see Table 21.1). The data are
graphed in Figure 21.1(a) and the line represents the fit of the logistic pop-
ulation growth model.
To fit the logistic model to the U. S. Census data, we need starting values
for the parameters. It is often important in nonlinear least squares estimation
to choose reasonable starting values, which generally requires some insight
into the structure of the model. We know that β1 represents asymptotic
population. The data in Figure 21.1(a) show that in 1990 the U. S. population
stood at about 250 million and did not appear to be close to an asymptote;
year population
1790 3.929
1800 5.308
1810 7.240
1820 9.638
1830 12.866
1840 17.069
1850 23.192
1860 31.443
1870 39.818
1880 50.156
1890 62.948
1900 75.995
1910 91.972
1920 105.711
1930 122.775
1940 131.669
1950 150.697
1960 179.323
1970 203.302
1980 226.542
1990 248.710
so as not to extrapolate too far beyond the data, let us set the starting value
of β1 to 350. It is convenient to scale time so that x1 = 0 in 1790, and so
that the unit of time is 10 years. Then substituting β1 = 350 and x = 0 into
the model, using the value y1 = 3.929 from the data, and assuming that the
error is 0, we have
350
3.929 = .
1 + eβ2 +β3 (0)
So now we have starting values for the nonlinear least squares algorithm
that we use. Below is the output from running a Gauss-Newton algorithm
for optimization. As you can see, the starting values resulted in convergence
with values not too far from our guess.
##########
Formula: population ~ beta1/(1 + exp(beta2 + beta3 * time))
Parameters:
Estimate Std. Error t value Pr(>|t|)
beta1 389.16551 30.81197 12.63 2.20e-10 ***
beta2 3.99035 0.07032 56.74 < 2e-16 ***
beta3 -0.22662 0.01086 -20.87 4.60e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
250 ● ●
5
●
●
●
200
●
●
●
●
● ●
150
●
Population
Residuals
● ●
0
● ●
●
●
100
●
● ●
●
●
●
● ● ● ●
−5
●
50
●
●
● ●
●
●
● ●
● ● ● ●
0
Year Year
(a) (b)
Figure 21.1: (a) Plot of the Census data with the logistic functional fit. (b)
Plot of the residuals versus the year.
Figure 21.1(b) is a plot of the residuals versus the year. As you can see,
the logistic functional form that we chose did catch the gross characteristics
of this data, but some of the nuances appear to not be as well characterized.
Since there are indications of some cyclical behavior, a model incorporating
correlated errors or, perhaps, trigonometric functions could be investigated.
The following gives the estimated logistic regression equation and associ-
ated significance tests. The reference group of remission is 1 for this data.
##########
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.25808 74.96480 0.857 0.391
cell 30.83006 52.13520 0.591 0.554
smear 24.68632 61.52601 0.401 0.688
infil -24.97447 65.28088 -0.383 0.702
li 4.36045 2.65798 1.641 0.101
blast -0.01153 2.26634 -0.005 0.996
temp -100.17340 77.75289 -1.288 0.198
As you can see, the index of the bone marrow leukemia cells appears to be
closest to a significant predictor of remission occurring. After looking at
various subsets of the data, it is found that a significant model is one which
only includes the labeling index as a predictor.
##########
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
● ●
● ● ● ●
● ●
0.5
0.5
● ●
● ●
● ●
Deviance Residuals
Pearson Residuals
● ● ● ●
0.0
0.0
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●
● ●
−0.5
−0.5
● ●
● ●
0 5 10 15 20 25 0 5 10 15 20 25
Observation Observation
(a) (b)
Figure 21.2: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.
10
● ●
● ● ● ● ●
6
Y
● ● ● ●
● ●
4
● ● ● ●
● ● ● ● ● ●
2
● ●
●
0
5 10 15 20
Figure 21.2 also gives plots of the deviance residuals and the Pearson
residuals. These plots seem to be okay.
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
4 ● ●
4
● ● ● ●
Deviance Residuals
● ●
Pearson Residuals
● ● ● ●
2
2
● ●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
0
0
● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ●
● ●
−2
−2
● ●
● ●
● ●
● ● ● ●
1 2 3 4 5 6 7 1 2 3 4 5 6 7
(a) (b)
Figure 21.4: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.
Table 21.2: The leukemia data set. Descriptions of the variables are given in
the text.
i xi yi i xi yi
1 2 0 16 16 7
2 15 6 17 13 6
3 19 4 18 6 2
4 14 1 19 16 5
5 16 5 20 19 5
6 15 2 21 24 6
7 9 2 22 9 2
8 17 10 23 12 5
9 10 3 24 7 1
10 23 10 25 9 3
11 14 2 26 7 3
12 14 6 27 15 3
13 9 5 28 21 4
14 5 2 29 20 6
15 17 2 30 20 9
Multivariate Multiple
Regression
Up until now, we have only been concerned with univariate responses (i.e.,
the case where the response Y is simply a single value for each observation).
However, sometimes you may have multiple responses measured for each ob-
servation, whether it be different characteristics or perhaps measurements
taken over time. When our regression setting must accommodate multiple
responses for a single observation, the technique is called multivariate regres-
sion.
341
342 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION
Yi = (1 XT
i )B + i ,
where
β0,1 β0,2 ... β0,m
β1,1 β1,2 ... β1,m
B= β1 β2 . . . βm =
.. .. .. ..
. . . .
βp−1,1 βp−1,2 . . . βp−1,m
and
i,1
i = ... .
i,m
Y = XB + ε.
which is the vector of errors for the j th trial of all n observations. We assume
that E((j) ) = 0 and Cov((i) , (k) ) = σi,k Im×m for each i, k = 1, . . . , n. Notice
that the j th trial of the n observations have variance-covariance matrix Σ =
{σi,k }, but observations from different entries of the vector are uncorrelated.
The least squares estimate for B is simply given by:
B̂ = (XT X)−1 XT Y.
Ŷ = XB̂
Hypothesis Testing
Suppose we are interested in testing the hypothesis that our multivariate
responses do not depend on the predictors Xi,q+1 , . . . , Xi,p−1 . We can par-
tition B to consist of two matrices: one with the regression coefficients of
the predictors we assume will remain in the model and one with the regres-
sion coefficients we wish to test. Similarly, we can partition X in a similar
manner. Formally, the test is
H0 : β (2) = 0,
where
β (1)
B=
β (2)
and
X= X1 X2 .
Here X2 is an n × (p − q − 1) matrix of predictors corresponding to the null
hypothesis and X1 is an n × (q) matrix of predictors we assume will remain
in the model. Furthermore, β (2) and β (1) are (p − q − 1) × m and q × m
matrices, respectively, for these predictor matrices.
Under the null hypothesis, we can calculate
−1 T
β̂ (1) = (XT
1 X1 ) X1 Y
and
Σ̂1 = (Y − X1 β̂ (1) )T (Y − X1 β̂ (1) )/n.
These values (which are maximum likelihood estimates under the null hy-
pothesis) can be used to calculate one of four commonly used multivariate
test statistics:
|nΣ̂|
Wilks’ Lambda =
|n(Σ̂1 − Σ̂)|
Pillai’s Trace = tr[(Σ̂1 − Σ̂)Σ̂−1
1 ]
Confidence Regions
One problem is to predict the mean responses corresponding to fixed values
T
xh of the predictors. Using various distributional results concerning B̂ xh
and Σ̂, it can be shown that the 100 × (1 − α)% simultaneous confidence
intervals for E(Yi |X = xh ) = xT
h β̂ i are
s
T m(n − p − 2)
xh β̂ i ± Fm,n−p−1−m;1−α
n−p−1−m
s
T T −1
n
× xh (X X) xh σ̂i,i ,
n−p−2
for i = 1, . . . , m. Here, β̂ i is the ith column of B̂ and σ̂i,i is the ith diagonal
element of Σ̂. Also, notice that the simultaneous confidence intervals are
constructed for each of the m entries of the response vector, thus why they are
considered “simultaneous”. Furthermore, the collection of these simultaneous
T
intervals yields what we call a 100 × (1 − α)% confidence region for B̂ xh .
Prediction Regions
Another problem is to predict new responses Yh = BT xh + εh . Again,
skipping over a discussion on various distributional assumptions, it can be
shown that the 100 × (1 − α)% simultaneous prediction intervals for the
individual responses Yh,i are
s
m(n − p − 2)
xT
h β̂ i ± Fm,n−p−1−m;1−α
n−p−1−m
s
T n
× (1 + xT −1
h (X X) xh ) σ̂i,i ,
n−p−2
for i = 1, . . . , m. The quantities here are the same as those in the simultane-
ous confidence intervals. Furthermore, the collection of these simultaneous
prediction intervals are called a 100 × (1 − α)% prediction region for yh .
MANOVA
The multivariate analysis of variance (MANOVA) table is similar to
it’s univariate counterpart. The sum of squares values in a MANOVA are
no longer scalar quantities, but rather matrices. Hence, the entries in the
MANOVA table are called sum of squares and cross-products (SSCPs).
These quantities are described in a little more detail below:
• The
Pn sum of squares and cross-products for total is SSCPTO =
T
i=1 (Yi − Ȳ)(Yi − Ȳ) , which is the sum of squared deviations from
the overall mean vector of the Yi ’s. SSCPTO is a measure of the overall
variation in the Y vectors. The corresponding total degrees of freedom
are n − 1.
• P
The sum of squares and cross-products for the errors is SSCPE =
n T
i=1 (Yi − Ŷi )(Yi − Ŷi ) , which is the sum of squared observed errors
(residuals) for the observed data vectors. SSE is a measure of the vari-
ation in Y that is not explained by the multivariate regression. The
corresponding error degrees of freedom are n − p.
Source df SSCP
Pn
Regression p−1 (Ŷ − Ȳ)(Ŷi − Ȳ)T
Pni=1 i
Error n−p Pi=1 (Yi − Ŷi )(Yi − Ŷi )T
n T
Total n−1 i=1 (Yi − Ȳ)(Yi − Ȳ)
Table 22.1: MANOVA table for the multivariate multiple linear regression
model.
Notice in the MANOVA table that we do not define any mean square
values or an F -statistic. Rather, a test of the significance of the multivari-
ate multiple regression model is carried out using a Wilks’ lambda quantity
similar to
Pn T
i=1 (Yi − Ŷi )(Yi − Ŷi )
Λ∗ = ,
n (Yi − Ȳ)(Yi − Ȳ)T
P
i=1
Y = XB + ε,
Y = XAC + WD + ε.
In order to get estimates for the reduced rank regression model, first note
that E((j) ) = 0 and Var(0(j) ) = Im×m ⊗ Σ. For simplicity in the following,
let Z0 = Y, Z1 = X, and Z2 = W. Next, we define the moment matrices
−1
Mi,j = ZT i Zj /m for i, j = 0, 1, 2 and Si,j = Mi,j − Mi,2 M2,2 M2,i , i, j = 0, 1.
Then, the parameters estimates for the reduced rank regression model are as
follows:
 = (ν̂1 , . . . , ν̂t )Φ
T T
Ĉ = S0,1 Â(Â S1,1 Â)−1
−1 T
−1 T
D̂ = M0,2 M2,2 − Ĉ Â M1,2 M2,2 ,
where (ν̂1 , . . . , ν̂t ) are the eigenvectors corresponding to the t largest eigen-
−1
values λ̂1 , . . . λ̂t of |λS1,1 − S1,0 S0,0 S0,1 | = 0 and where Φ is an arbitrary t × t
matrix with full rank.
22.4 Example
Example: Amitriptyline Data
This example analyzes conjectured side effects of amitriptyline - a drug some
physicians prescribe as an antidepressant. Data were gathered on n = 17
patients admitted to a hospital with an overdose of amitriptyline. The two
response variables are Y1 = total TCAD plasma level and Y2 = amount of
amitriptyline present in TCAD plasma level. The five predictors measured
are X1 = gender (0 for male and 1 for female), X2 = amount of the drug
taken at the time of overdose, X3 = PR wave measurement, X4 = diastolic
blood pressure, and X5 = QRS wave measurement. Table 22.2 gives the data
set and we wish to fit a multivariate multiple linear regression model.
Y1 Y2 X1 X2 X3 X4 X5
3389 3149 1 7500 220 0 140
1101 653 1 1975 200 0 100
1131 810 0 3600 205 60 111
596 448 1 675 160 60 120
896 844 1 750 185 70 83
1767 1450 1 2500 180 60 80
807 493 1 350 154 80 98
1111 941 0 1500 200 70 93
645 547 1 375 137 60 105
628 392 1 1050 167 60 74
1360 1283 1 3000 180 60 80
652 458 1 450 160 64 60
860 722 1 1750 135 90 79
500 384 0 2000 160 60 80
781 501 0 4500 180 0 100
1070 405 0 1500 170 90 120
1754 1520 1 3000 180 0 129
##########
Coefficients:
Y1 Y2
(Intercept) -2879.4782 -2728.7085
X1 675.6508 763.0298
X2 0.2849 0.3064
X3 10.2721 8.8962
X4 7.2512 7.2056
X5 7.5982 4.9871
##########
Then we can obtain individual ANOVA tables for each response and see that
the multiple regression model for each response is statistically significant.
##########
Response Y1 :
Df Sum Sq Mean Sq F value Pr(>F)
Regression 5 6835932 1367186 17.286 6.983e-05 ***
Residuals 11 870008 79092
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Response Y2 :
Df Sum Sq Mean Sq F value Pr(>F)
Regression 5 6669669 1333934 15.598 0.0001132 ***
Residuals 11 940709 85519
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##########
The following also gives the SSCP matrices for this fit:
##########
$SSCPR
Y1 Y2
Y1 6835932 6709091
Y2 6709091 6669669
$SSCPE
Y1 Y2
Y1 870008.3 765676.5
Y2 765676.5 940708.9
$SSCPTO
Y1 Y2
Y1 7705940 7474767
Y2 7474767 7610378
##########
We can also see which predictors are statistically significant for each response:
##########
Response Y1 :
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.879e+03 8.933e+02 -3.224 0.008108 **
X1 6.757e+02 1.621e+02 4.169 0.001565 **
X2 2.849e-01 6.091e-02 4.677 0.000675 ***
X3 1.027e+01 4.255e+00 2.414 0.034358 *
X4 7.251e+00 3.225e+00 2.248 0.046026 *
X5 7.598e+00 3.849e+00 1.974 0.074006 .
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Response Y2 :
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.729e+03 9.288e+02 -2.938 0.013502 *
X1 7.630e+02 1.685e+02 4.528 0.000861 ***
X2 3.064e-01 6.334e-02 4.837 0.000521 ***
X3 8.896e+00 4.424e+00 2.011 0.069515 .
X4 7.206e+00 3.354e+00 2.149 0.054782 .
X5 4.987e+00 4.002e+00 1.246 0.238622
Response = Y1 Response = Y2
1.5 ● ●
●
●
2
1.0
●
●
● ●
0.5
Studentized Residuals
Studentized Residuals
●
● ●
●
1
●
●
0.0
●
● ●
−0.5
0
●
● ● ●
● ● ●
−1.0
−1
● ●
−1.5
●
●
●
−2.0
● ●
500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000
(a) (b)
Figure 22.1: Plots of the Studentized residuals versus fitted values for the
response (a) total TCAD plasma level and the response (b) amount of
amitriptyline present in TCAD plasma level.
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Data Mining
The field of Statistics is constantly being presented with larger and more
complex data sets than ever before. The challenge for the Statistician is to
be able to make sense of all of this data, extract important patterns, and
find meaningful trends. We refer to the general tools and the approaches for
dealing with these challenges in massive data sets as data mining.1
Data mining problems typically involve an outcome measurement which
we wish to predict based on a set of feature measurements. The set of
these observed measurements is called the training data. From these train-
ing data, we attempt to build a learner, which is a model used to predict the
outcome for new subjects. These learning problems are (roughly) categorized
as either supervised or unsupervised. A supervised learning problem is
one where the goal is to predict the value of an outcome measure based on a
number of input measures, such as classification with labeled samples from
the training data. An unsupervised learning problem is one where there is
no outcome measure and the goal is to describe the associations and patterns
among a set of input measures, which involves clustering unlabeled training
data by partitioning a set of features into a number of statistical classes. The
regression problems that are the focus of this text are (generally) supervised
learning problems.
Data mining is an extensive field in and of itself. In fact, many of the
methods utilized in this field are regression-based. For example, smoothing
splines, shrinkage methods, and multivariate regression methods are all often
found in data mining. The purpose of this chapter will not be to revisit these
1
Data mining is also referred to as statistical learning or machine learning.
353
354 CHAPTER 23. DATA MINING
In fact, a slight modification to the LARS algorithm can calculate all possible
LASSO estimates for a given problem. Moreover, a different modification
to LARS efficiently implements forward stagewise regression. In fact, the
acronym for LARS includes an “S” at the end to reflect its connection to
LASSO and forward stagewise regression.
Earlier in the text we also introduced the bootstrap as a way to get boot-
strap confidence intervals for the regression parameters. However, the notion
of the bootstrap can also be extended to fitting a regression model. Suppose
that we have p − 1 feature measurements and one outcome variable. Let
Z = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} be our training data that we wish to fit
a model to such that we obtain the prediction fˆ(x) at each input x. Boot-
strap aggregation or bagging averages this prediction over a collection of
bootstrap samples, thus reducing its variance. For each bootstrap sample
Z∗b , b = 1, 2, . . . , B, we fit our model, which yields the prediction fˆb∗ (x). The
bagging estimate is then defined by
B
1 X ˆ∗
fˆbag(x) = f (x).
B b=1 b
Denote the empirical distribution function by P̂, which puts equal proba-
bility 1/n on each of the data points (xi , yi ). The “true” bagging estimate
is defined by EP̂ fˆ∗ (x), where Z∗ = {(x∗1 , y1∗ ), (x∗2 , y2∗ ), . . . , (x∗n , yn∗ )} and each
(x∗i , yi∗ ) ∼ P̂ . Note that the bagging estimate given above is a Monte Carlo
estimate of the “true” bagging estimate, which it approaches as B → ∞. The
bagging approach can be used in other model selection approaches through-
out Statistics and data mining.
● ●
● ●
4
4
● ● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ● ●● ● ● ● ●● ●
● ●
2
2
● ● ● ● ● ●
●● ● ●● ●
● ●● ● ●●
● ● ● ●
y
y
● ● ● ●
0
0
● ● ● ●
● ● ● ●
●● ● ●● ●
−2
−2
● ● ● ●
● ●● ● ●●
● ●
● ●● ●●● ● ● ●● ●●● ●
● ● ● ●
● ●
●● ● ●● ●
● ●
−4
−4
● ● ● ●
● ●
−2 0 2 4 −2 0 2 4
x x
(a) (b)
The setting with nonlinear curves and where clusters cannot be completely
separated is illustrated in Figure 23.2. Without loss of generality, our dis-
cussion will mainly be focused to the one attribute and one feature setting.
Moreover, we will be utilizing support vectors in order to build a regression
relationship that fits our data adequately.
A little more terminology is necessary before we move into the regression
discussion. A loss function represents the loss in utility associated with
an estimate being “wrong” (i.e., different from either a desired or a true
value) as a function of a measure of the degree of “wrongness” (generally
the difference between the estimated value and the true or desired value).
When discussing SVM modeling in the regression setting, the loss function
will need to incorporate a distance measure as well.
As a quick illustration of some common loss functions look at Figure 23.3.
Figure 23.3(a) is a quadratic loss function, which is what we use in classical
ordinary least squares. Figure 23.3(b) is a Laplacian loss function, which
1.0
3
2 0.5
● ● ●
●●
● ●
●
1 ● ●
● 0.0
● ● ● ●
●
●
●
● ● ●
●● ●
● ● ●● ●
0 ●
● ●●● ●
● ● ● −0.5
● ● ●
● ●● ● ●
●
● ●
● ● ●
●
−1 ●
● −1.0
● ●
●
−2
●
−2 −1 0 1 2 3
Figure 23.2: A plot of data where a support vector machine has been used
for classification. The data was generated where we know that the circles
belong to group 1 and the triangles belong to group 2. The white contours
show where the margin is; however, there are clearly some values that have
been misclassified since the two clusters are not well-separated. The points
that are solid were used as the training data.
is less sensitive to outliers than the quadratic loss function. Figure 23.3(c)
is Huber’s loss function, which is a robust loss function that has optimal
properties when the underlying distribution of the data is unknown. Finally,
Figure 23.3(d) is called the -insensitive loss function, which enables a sparse
set of support vectors to be obtained.
In Support Vector Regressions (or SVRs), the input is first mapped
onto an N -dimensional feature space using some fixed (nonlinear) mapping,
and then a linear model is constructed in this feature space. Using mathe-
matical notation, the linear model (in the feature space) is given by
N
X
f (x, ω) = ωj gj (x) + b,
j=1
(a) (b)
(c) (d)
Figure 23.3: Plots of the (a) quadratic loss, (b) Laplace loss, (c) Huber’s loss,
and (d) -insensitive loss functions.
then the bias term is dropped. Note that b is not considered stochastic in
this model and is not akin to the error terms we have studied in previous
models.
The optimal regression function is given by the minimum of the functional
n
1 X
Φ(ω, ξ) = kωk2 + C (ξ − + ξ + ),
2 i=1
yi − f (xi , ω) ≤ + ξi+
f (xi , ω) − yi ≤ + ξi−
ξ − , ξ + ≥ 0, i = 1, . . . , n,
where yi is defined through the loss function we are using. The four loss
functions we show in Figure 23.3 are as follows:
Quadratic Loss:
L2 = (f (x) − y) = (f (x) − y)2
Laplace Loss:
L1 = (f (x) − y) = |f (x) − y|
Huber’s Loss2 :
1
− y)2 ,
2
(f (x) for |f (x) − y| < δ;
LH = δ2
δ|f (x) − y| − 2
, otherwise.
-Insensitive Loss:
0, for |f (x) − y| < ;
L =
|f (x) − y| − , otherwise.
depending on which loss function is used and the investigator should become
familiar with the loss function being employed. Regardless, the optimization
approach will require the use of numerical methods.
It is also desirable to strike a balance between complexity and the error
that is present with the fitted model. Test error (also known as gener-
alization error) is the expected prediction error over an independent test
sample and is given by
where X and Y are drawn randomly from their joint distribution. This
expectation is an average of everything that is random in this set-up, includ-
ing the randomness in the training sample that produced the estimate fˆ(·).
Training error is the average loss over the training sample and is given by
n
1X
err = L(yi , fˆ(xi )).
n i=1
We would like to know the test error of our estimated model fˆ(·). As the
model increases in complexity, it is able to capture more complicated un-
derlying structures in the data, which thus decreases bias. But then the
estimation error increases, which thus increases variance. This is known as
the bias-variance tradeoff. In between there is an optimal model complex-
ity that gives minimum test error.
AdaBoost.R2
Input the labeled target data set T of size n, the maximum number of
iterations B, and a base learning algorithm called Learner. Unless otherwise
specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n.
For t = 1, . . . , B:
1. Call Learner with the training set T and the distribution wt , and get
a hypothesis ht : X → R.
2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi −
4. Let γt = t /(1 − t ).
1−eti
5. Update the weight vector as wit+1 = wit γt /Zt , such that Zt is a
normalizing constant.
ExpBoost.R2
Input the labeled target data set T of size n, the maximum number of
iterations B, and a base learning algorithm called Learner. Unless otherwise
specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n.
Moreover, each source data set gets assigned to one expert from the set of
experts H B = {h1 , . . . , hB }.
For t = 1, . . . , B:
1. Call Learner with the training set T and the distribution wt , and get
a hypothesis ht : X → R.
2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi −
ht (xj )|, so that eti = |yi − ht (xi )|/Dt .
5. Let γt = t /(1 − t ).
1−eti
6. Update the weight vector as wit+1 = wit γt /Zt , such that Zt is a
normalizing constant.
We proceed to grow the tree by finding the best binary partition in terms
of the ĉm values. This is generally computationally infeasible which leads to
use of a greedy search algorithm. Typically, the tree is grown until a small
node size (such as 5 nodes) is reached and then a method for pruning the
tree is implemented.
Multivariate adaptive regression splines (MARS) is another non-
parametric method that can be viewed as a modification of CART and is well-
suited for high-dimensional problems. MARS uses expansions in piecewise
linear basis functions of the form (x−t)+ and (t−x)+ such that the “+” sub-
script simply means we take the positive part (e.g., (x−t)+ = (x−t)I(x > t)).
These two functions together are called a reflected pair.
In MARS, each function is piecewise linear with a knot at t. The idea is
to form a reflected pair for each predictor Xj with knots at each observed
value xi,j of that predictors. Therefore, the collection of basis functions for
j = 1, . . . , p is
M
X
f (X) = β0 + βm hm (X),
m=1
where
β0 Si,0
β1 Si,1
β= and Si = .
.. ..
. .
βm−1 Si,m−1
Si,0 equals 1 and for j = 1, . . . , m − 1, the j th derived predictor value for the
ith observation, Si,j , is a nonlinear function fj of a linear combination of the
original predictors:
Si,j = fj (XT
i θ j ),
where
θj,0 Xi,0
θj,1 Xi,1
θj = and X =
.. i ..
. .
θj,p−1 Xi,p−1
model as:
yi = fY (ST
i β) + i
m−1
X
= fY (β0 + βj fj (XT
i θ j )) + i .
j=1
which has a tree structure with r levels (i.e., r levels where probabilistic splits
occur). The λ(·) functions provide the probabilities for the splitting and, in
addition to being dependent on the predictors, they also have their own
set of parameters (the different τ values) requiring estimation (these mixing
23.6 Examples
Example 1: Simulated Neural Network Data
In this very simple toy example, we have provided two features (i.e., input
neurons X1 and X2 ) and one response measurement (i.e., output neuron Y )
which are given in Table 23.1. The model fit is one where X1 and X2 do not
interact on Y . A single hidden-layer and a double hidden-layer neural net
model are each fit to this data to highlight the difference in the fits. Below
is the output (for each neural net) which shows the results from training
the model. A total of 5 training samples were used and a threshold value
of 0.01 was used as a stopping criterion. The stopping criterion pertains to
the partial derivatives of the error function and once we fall beneath that
threshold value, then the algorithm stops for that training sample.
X1 X2 Y
0 0 0
1 0 1
0 1 1
1 1 0
1 1 1 1
2.8
431
1
X1 X1 6.15091
−3 −1
. 90 .1
−0.4
−6
41
0.67
2.17
37 96
.40
3
1864
40
307
502
3
−1.62898 Y Y
3.3107
4
38
3 52
00
.76
87 25
1
. .1
−5
−3 −1
X2 X2 6.07832
(a) (b)
Figure 23.4: (a) The fitted single hidden-layer neural net model to the toy
data. (b) The fitted double hidden-layer neural net model to the toy data.
In the above output, the first group of 5 repetitions pertain to the single
hidden-layer neural net. For those 5 repetitions, the third training sample
yielded the smallest error (about 0.3481). The second group of 5 repetitions
● ● ● ●
● ●
● ●
50
50
● ●● ● ● ●● ●
● ● ● ●
●● ●●
● ● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ●
● ● ● ●
●
●●●● ●●
●●●●●●●● ●●
●● ● ●● ● ●
●●●● ●●
●●●●●●●● ●●
●● ● ●● ●
0
0
● ●● ● ●●
●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
accel
accel
●●● ● ● ●●● ● ●
● ● ● ●
●● ● ● ●● ● ●
−50
−50
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
−100
−100
● ●
● ● ● ●
●
●
●
●
●
●
●
● ε = 0.01
●
●● ●●
●
●● ●● ε = 0.10
● ●
●
●
●
●
ε = 0.70
10 20 30 40 50 10 20 30 40 50
times times
(a) (b)
Figure 23.5: (a) Data from a simulated motorcycle accident where the time
until impact (in milliseconds) is plotted versus the recorded head acceleration
(in g). (b) The data with different values of used for the support vector
regression obtained with an −insensitive loss function. Note how the smaller
the , the more features you pick up in the fit, but the complexity of the model
also increases.
pertain to the double hidden-layer neural net. For those 5 repetitions, the
fourth training sample yielded the smallest error (about 0.0002). The increase
in complexity of the neural net has yielded a smaller training error. The fitted
neural net models are depicted in Figure 23.4.
want to try and strike a good balance regarding the model complexity. For
the training error, we get values of 0.177, 0.168, and 0.250 for the three
levels of . Since our objective is to minimize the training error, the value
of (which has a training error of 0.168) is chosen. This corresponds to the
green line in Figure 23.5(b).
Advanced Topics
This chapter presents topics where theory beyond the scope of this course
needs to be developed with the applicability. The topics are not arranged
in any particular order, but rather are just a sample of some of the more
advanced regression procedures that are available. Not all computer software
has the capabilities to perform analysis on the models presented here.
where
zj,i = I{x1,i = j}.
In otherwords, we are using the leave-one-out method for the levels of x1 .
373
374 CHAPTER 24. ADVANCED TOPICS
10
s(x2,1.73)
s(x3,7.07)
5
5
0
0
−5
−5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x2 x3
10
6
Partial for x1
5
s(x4,1)
4
0
2
−5
x4 x1
we have the same setting as in a partial linear model, but a link function
relates the sum of parametric and nonparametric components to the
response.
Generalized Partial Linear Partial Additive Models: In the model
s
X
T
E(Y |U, V) = g(U β + mj (Vj )),
j=1
All Groups
100
●
●
● ● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ●
● ● ● ●
● ●
●
● ● ●
● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
● ● ● ● ● ●
●
● ● ● ● ●
0
● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ● ●
●
● ●
● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ●
●
● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
Response ● ● ● ● ●
●
●
● ●
● ●
●
● ● ●
● ●
● ● ● ●
● ● ● ●
−100
● ●
● ● ●
● ●
● ● ●
●
● ● ●
● ●
●
−200
●
−300
5 10 15 20
Time
Figure 24.2: Scatterplot of the infant data with a trajectory (in this case, a
quadratic response curve) fitted to each infant.
Group 1 Group 2
●
100
100
● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
0
0
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ●
● ●
●
● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ●
●
● ● ●
Response
Response
● ● ● ● ●
● ● ●
● ●
−100
−100
●
−200
−200
−300
−300
5 10 15 20 5 10 15 20
Time Time
(a) (b)
Group 3 Group 4
100
100
●
● ●
● ● ● ●
●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
●
0
● ●
● ● ● ● ● ● ● ● ●
●
● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
Response
Response
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
−100
−100
● ●
● ● ●
● ●
● ●
●
● ● ●
● ●
●
−200
−200
●
−300
−300
5 10 15 20 5 10 15 20
Time Time
(c) (d)
Figure 24.3: Plots for each group of infants where each group has a different
number of measurements.
• Does trauma affect brain stem activation in a way that inhibits mem-
ory?
Mediator Variable
M
θ1 α2
Figure 24.4: Diagram showing the basic flow of a mediation regression model.
amongst the variables already in our model, are called moderator variables
and are often tested as an interaction effect. A significantly different from 0
XM interaction in the second equation above suggests that the α2 coefficient
differs across different levels of X. These different coefficient levels may
reflect mediation as a manipulation, thus altering the relationship between
M and Y . The moderator variables may be either a manipulated factor in
an experimental setting (e.g., dosage of medication) or a naturally occurring
variable (e.g., gender). By examining moderator effects, one can investigate
whether the experiment differentially affects subgroups of individuals. Three
primary models involving moderator variables are typically studied:
Moderated mediation: The simplest of the three, this model has a vari-
able which mediates the effects of an independent variable on a depen-
dent variable, and the mediated effect depends on the level of another
variable (i.e., the moderator). Thus, the mediational mechanism differs
for subgroups of the study. This model is more complex from an inter-
pretative viewpoint when the moderator is continuous. Basically, you
have either X → M and/or M → Y dependent on levels of another
variable (call it Z).
were σ2i is the variance of the effect size in study i. Between study
variance ση2 is estimated using common estimation procedures for ran-
dom effects models (such as restricted maximum likelihood (REML)
estimators).
P(B|A)P(A)
P(A|B) = .
P(B)
β̂ = (XT X)−1 XT y
is constructed from the frequentist’s view (along with the maximum like-
lihood estimate σ̂ 2 of σ 2 ) in that we assume there are enough measurements
of the predictors to say something meaningful about the response. In the
Bayesian view, we assume we have only a small sample of the possible
measurements and we seek to correct our estimate by “borrowing” informa-
tion from a larger set of similar observations.
P(X ≤ x) = τ.
where µ(·) is some parametric function and ρτ (·) is called the linear check
function and is defined as
2000
●
1500
●
● ●
Food Expenditure
●
● ●
●
●
● ●
1000
● ●
●
● ●
●●
● ●
●
●● ●
● ●
● ●●
● ● ●
● ●
● ●● ●●●●●
●
●
● ●
● ●●
●●● ●
● ●
● ●
● ● ● ●
●
● ●● ● ● ●
●
● ●
● ●● ●
●● ● ●●
● ●
● ● ●●
● ●
●● ●
● ●●
● ●● ●
●
●
●●●●● ●●
● ●
● ● ●● ● ● ●
500
●● ●●
● ●
●
●● ● ●● ● ●
●●
●●●
●●●●● ●●
●
● ●
●● ● ● ● ●
●
●●
● ●●●●●
●● ●●● ●
●● ●
●●
● ●
●●●
●●
●●●
●●
● ●●●
● ●●
●●● ●
● mean (LSE) fit
●● ●
●●●
● ●
●●●
●●
●●
●●
●●
● median (LAE) fit
●
● ●
Household Income
Figure 24.5: Various qunatile regression fits for the food expenditures data
set.
τ = 0.95 regression quantile) while those with the lowest food expenditures
will likely have smaller regression coefficients (such as the τ = 0.05 regression
quantile). The estimates for each of these quantile regressions is as follows:
##########
Coefficients:
tau= 0.05 tau= 0.10 tau= 0.25
(Intercept) 124.8800408 110.1415742 95.4835396
x 0.3433611 0.4017658 0.4741032
tau= 0.75 tau= 0.90 tau= 0.95
(Intercept) 62.3965855 67.3508721 64.1039632
x 0.6440141 0.6862995 0.7090685
Isotonic Regression
●
6
●
5
●
● ● ●
3 4
x$y
● ● ●
2
● ●
1
●
● ●
0
0 2 4 6 8 10
x0
●
25
cumsum(x$y)
●
● ●
●
●
15
●
●
●
5
● ●
●
0
0 2 4 6 8 10
x0
line which is plotted is called the convex minorant. Each predictor value
where this convex minorant intersects at the value of the cumulative sum
is the same value of the predictor where the slope changes in the isotonic
regression plot.
The weights in the geographic weighting matrix W(g) are chosen such that
those observations near the point in space where the parameter estimates
are desired have more influence on the result than those observations further
away. This model is essentially a local regression model like the one discussed
in the section on LOESS. While the choice of a geographic (or spatially)
weighted matrix is a blend of art and science, one commonly used weight is
the Gaussian weight function, where the diagonal entris of the n × n matrix
W(g) are:
wi (g) = exp{−di /h},
where di is the Euclidean distance between observation i and location g, while
h is the bandwidth.
The resulting parameter estimates or standard errors for the spatial het-
erogeneity model may be mapped in order to examine local variations in the
parameter estimates. Hypothesis tests are also possible regarding this model.
Spatial regression models also accommodate spatial dependency in two
major ways: through a spatial lag dependency (where the spatial correlation
occurs in the dependent variable) or a spatial error dependency (where the
spatial correlation occurs through the error term). A spatial lag model is
a spatial regression model which models the response as a function of not
only the predictors, but also values of the response observed at other (likely
neighboring) locations:
yi = f (yj(i) ; θ) + X T
i β + i ,
where u is a vector of random error terms. Other spatial processes exist, such
as a conditional autoregressive process and a spatial moving average
process, both which resemble similar time series processes.
Estimation of these spatial regression models can be accomplished through
various techniques, but they differ depending on if you have a spatial lag de-
pendency or a spatial error dependency. Such estimation methods include
maximum likelihood estimation, the use of instrumental variables, and semi-
parametric methods.
There are also tests for the spatial autocorrelation coefficient, of which the
most notable uses Moran’s I statistic. Moran’s I statistics is calculated
as
eT W(g)e/S0
I= ,
eT e/n
where e is a vector of ordinary
Pleast squares residuals, W(g) is a geographic
n Pn
weighting matrix, and S0 = i=1 j=1 wi,j is a normalizing factor. Then,
Moran’s I test can be based on a normal approximation using a standardized
value I statistic such that
and
tr(MWMWT ) + tr(MWMW) + [tr(MW)]2
Var(I) = ,
(n − p)(n − p + 2)
where M = In×n − X(XT X)−1 XT .
As an example, let us consider 1978 house prices in Boston which we will
try to fit a spatial regression model with spatial error dependency. There
are 20 variables measured for 506 locations. Certain transformations on the
predictors have already been performed due to the investigator’s claims. In
particular, only 13 of the predictors are of interest. First a test on the spatial
autocorrelation coefficient is performed:
##########
Global Moran’s I for regression residuals
data:
model: lm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS +
I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX +
PTRATIO + B + log(LSTAT), data = boston.c)
weights: boston.listw
Residuals:
Min 1Q Median 3Q Max
-0.6476342 -0.0676007 0.0011091 0.0776939 0.6491629
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.85706025 0.16083867 23.9809 < 2.2e-16
CRIM -0.00545832 0.00097262 -5.6120 2.000e-08
ZN 0.00049195 0.00051835 0.9491 0.3425907
INDUS 0.00019244 0.00282240 0.0682 0.9456389
CHAS1 -0.03303428 0.02836929 -1.1644 0.2442466
I(NOX^2) -0.23369337 0.16219194 -1.4408 0.1496286
I(RM^2) 0.00800078 0.00106472 7.5145 5.707e-14
AGE -0.00090974 0.00050116 -1.8153 0.0694827
log(DIS) -0.10889420 0.04783714 -2.2764 0.0228249
log(RAD) 0.07025730 0.02108181 3.3326 0.0008604
TAX -0.00049870 0.00012072 -4.1311 3.611e-05
PTRATIO -0.01907770 0.00564160 -3.3816 0.0007206
yi = β0 + β1 xi + i (mod 2π).
error assumed to follow a von Mises distribution with circular mean 0 and
concentration parameter κ. The von Mises distribution is the circular analog
of the univariate normal distribution, but has a more “complex” form. The
von Mises distribution with circular mean µ and concentration parameter κ
is defined on the range x ∈ [0, 2π), with probability density function
eκ cos(x−µ)
f (x) =
2πI0 (κ)
and cumulative distribution function
∞
1 X Ij (κ) sin(j(x − µ))
F (x) = xI0 (κ) + 2 .
2πI0 (κ) j=1
j
In the above, Ip (·) is called a modified Bessel function of the first kind
of order p. The Bessel function is the contour integral
I
1
Ip (z) = e(z/2)(t−1/t) t−(p+1) dt,
2πi
where the contour encloses the origin and traverses√ in a counterclockwise
direction in the complex plane such that i = −1. Maximum likelihood
estimates can be obtained for the circular regression models (with minor
differences in the details when dealing with a circular predictor or linear
predictor). Needless to say, such formulas do not lend themselves well to
closed-dorm solutions. Thus we turn to numerical methods, which go beyond
the scope of this course.
As an example, suppose we have a data set of size n = 100 where Y
is a circular response and X is a continuous predictor (so a circular-linear
regression model will be built). The error terms are assumed to follow a von
Mises distribution with circular mean 0 and concentration parameter κ (for
this generated data, κ = 1.9). The error terms used in the generation of this
data can be plotted on a circular histogram as given in Figure 24.10.
Estimates for the circular-linear regression fit are given below:
##########
Circular-Linear Regression
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Errors
● ●
●
● ●
●●●●●
●●
● π ●●
● ● ● ● ● ● ●
●
● ● ●
● ●●
●
● ● ● ●
2 ●● ●
8
● ●●●
● ● ●
●● ●
● ● ●
●● ●●
●●● ●●
●●● ●
●
6
●● ●
●●●●
●
●●● ●
● ● ●
●● ●
●
● ● ●
4
●● ● ● ● ●● ●
●
●●
π + 0 ● ●● ●● ●●● ● ● ● ●
Y
● ● ●● ●
● ●
●●● ●
●● ●
●● ●●
2
●●●● ●
●
●●●● ●
●● ●● ●
●●● ●
●● ●
0
●
● ● ●●
●●
●●
● ●
● ●● ● ●
π
3π ● ●● ● ● ●●
● ● ● ●●
−2
● ●
● ● ●
●● ● ●● ●●●●●
●
2 ●● ● ●●
−2 −1 0 1 2
(a) (b)
Figure 24.7: (a) Plot of the von Mises error terms used in the generation of
the sample data. (b) Plot of the continuous predictor (X) versus the circular
response (Y ) along with the circular-linear regression fit.
Log-Likelihood: 55.89
Notice that the maximum likelihood estimates of µ and κ are 0.4535 and
1.954, respectively. Both estimates are close to the values used for gener-
ation of the error terms. Furthermore, the values in parentheses next to
these estimates are the standard errors for the estimates - both of which are
relatively small.
A rough way of looking at the data and estimated circular-linear regres-
sion equation is given in Figure 24.10. This is difficult to display since we are
● ● ●● ● ● ● ●● ●
● ● ● ●
1.2
1.2
● ● ● ●
●● ● ●● ●
●● ●●
● ● ● ●
●● ●●
●● ● ● ●● ● ●
1.1
1.1
●● ● ●● ●
● ● ● ●
●● ● ● ●
● ●● ● ● ●
●
●● ● ● ●● ● ●
1.0
1.0
● ● ● ●
● ●
● ● ● ●
Equivalence
Equivalence
● ● ● ●
0.9
0.9
● ●
●● ● ● ●● ● ●
● ●
●● ●●
● ●
● ● ● ●
0.8
0.8
●
● ● ●
● ●● ●
● ● ●
● ●●
●● ● ● ●● ● ●
● ●
● ● ● ●
● ●
0.7
0.7
● ● ● ●● ● ● ● ●●
● ●
● ●
●
● ●
●
0.6
0.6
● ● ● ● ● ●
● ●
●
● ●
●
● ●
1 2 3 4 1 2 3 4
NO NO
(a) (b)
Figure 24.8: (a) Plot of spark-ignition engine fuel data with equivalence
ratio as the response and the measure of nitrogen oxide emissions. (b) Plot
of the same data with EM algorithm estimates from a 2-component mixture
of regressions fit.
There are many issues one should be cognizant of when building a mixture
model. In particular, maximum likelihood estimation can be quite complex
since the likelihood does not yield closed-form solutions and there are iden-
tifiability issues (however, the use of a Newton-Raphson or EM algorithm
usually provides a good solution). One alternative is to use a Bayesian ap-
proach with Markov Chain Monte Carlo (MCMC) methods, but this too has
its own set of complexities. While we do not explore these issues, we do see
how a mixture model can occur in the regression setting.
A mixture of linear regressions model can be used when it appears
that there is more than one regression line that could fit this data due to
some underlying characteristic (i.e., a latent variable). Suppose we have n
observations which belongs to one of k groups. If we knew to which group
an observation belonged (i.e., its label), then we could write down explicitly
the linear regression model given that observation i belongs to group j:
yi = XT
i β j + ij ,
such that ij is normally distributed with mean 0 and variance σj2 . Notice
how the regression coefficients and variance terms are different for each group.
However, now assume that the labels are unobserved. In this case, we can
only assign a probability that observation i came from group j. Specifically,
the density function for the mixture of linear regression model is:
k
X
2 −1/2 1 T 2
f (yi ) = λj (2πσj ) exp − 2 (yi − Xi βj ) ,
j=1
2σj
such that kj=1 λj = 1. Estimation is done by using the likelihood (or rather
P
log likelihood) function based on the above density. For maximum likelihood,
one typically uses an EM algorithm.
As an example, consider the data set which gives the equivalence ratios
and peak nitrogen oxide emissions in a study using pure ethanol as a spark-
ignition engine fuel. A plot of the equivalence ratios versus the measure
of nitrogen oxide is given in Figure 24.11. Suppose one wanted to predict
the equivalence ratio from the amount of nitrogen oxide emissions. As you
can see, there appears to be groups of data where separate regressions appear
appropriate (one with a positive trend and one with a negative trend). Figure
24.11 gives the same plot, but with estimates from an EM algorithm overlaid.
EM algorithm estimates for this data are β 1 = (0.565 0.085)T , β 1 = (1.247 −
0.083)T , σ 2 = 0.00188, and σ 2 = 0.00058.
It should be noted that mixtures of regressions appear in many areas.
For example, in economics it is called switching regimes. In the social