pt4 Adv Regression Models PDF

Part IV
Advanced Regression Models
232
Chapter 17
Polynomial Regression
In this chapter, we provide models to account for curvature in a data set.

This curvature may be an overall trend of the underlying population or it
may be a certain structure to a specified region of the predictor space. We
will explore two common methods in this chapter.
17.1 Polynomial Regression

In our earlier discussions on multiple linear regression, we have outlined ways
to check assumptions of linearity by looking for curvature in various plots.
• For instance, we look at the plot of residuals versus the fitted values.
• We also look at a scatterplot of the response value versus each predictor.
Sometimes, a plot of the response versus a predictor may also show some
curvature in that relationship. Such plots may suggest there is a nonlin-
ear relationship. If we believe there is a nonlinear relationship between the
response and predictor(s), then one way to account for it is through a poly-
nomial regression model:
Y = β0 + β1 X + β2 X 2 + . . . + βh X h + , (17.1)
where h is called the degree of the polynomial. For lower degrees, the
relationship has a specific name (i.e., h = 2 is called quadratic, h = 3 is
called cubic, h = 4 is called quartic, and so on). As for a bit of semantics,
it was noted at the beginning of the previous course how nonlinear regression
233
234 CHAPTER 17. POLYNOMIAL REGRESSION
(which we discuss later) refers to the nonlinear behavior of the coefficients,

which are linear in polynomial regression. Thus, polynomial regression is still
considered linear regression!
In order to estimate equation (17.1), we would only need the response
variable (Y ) and the predictor variable (X). However, polynomial regression
models may have other predictor variables in them as well, which could lead
to interaction terms. So as you can see, equation (17.1) is a relatively simple
model, but you can imagine how the model can grow depending on your
situation!
For the most part, we implement the same analysis procedures as done
in multiple linear regression. To see how this fits into the multiple linear
regression framework, let us consider a very simple data set of size n = 50
that I generated (see Table 17.1). The data was generated from the quadratic
model
yi = 5 + 12xi − 3x2i + i , (17.2)
where the i s are assumed to be normally distributed with mean 0 and vari-
ance 2. A scatterplot of the data along with the fitted simple linear regression
line is given in Figure 17.1(a). As you can see, a linear regression line is not
a reasonable fit to the data.
Residual plots of this linear regression analysis are also provided in Figure
17.1. Notice in the residuals versus predictor plots how there is obvious cur-
vature and it does not show uniform randomness as we have seen before. The
histogram appears heavily left-skewed and does not show the ideal bell-shape
for normality. Furthermore, the NPP seems to deviate from a straight line
and curves down at the extreme percentiles. These plots alone would sug-
gest that there is something wrong with the model being used and especially
indicate the use of a higher-ordered model.
The matrices for the second-degree polynomial model are:
     
y1 1 x1 x21   1
 y2   1 x2 x2  β0  2 
2 
Y =  .. , X =  .. .. .. , β = β1 , =  .. ,
      
 .   . . .   . 
2
β2
y50 1 x50 x50 50
where the entries in Y and X would consist of the raw data. So as you can
see, we are in a setting where the analysis techniques used in multiple linear
regression (e.g., OLS) are applicable here.
STAT 501 D. S. Young

CHAPTER 17. POLYNOMIAL REGRESSION 235
0
● ●
●
● ● ●● ●
20
●●● ●● ● ●
● ● ● ● ●● ●
●●
● ● ● ●
−100
● ●●
●● ●
●●● ●●
●
●● ●
● ●
● ●
● ●
0
●
●
● ● ●
−200
●● ●
●
●● ●
Residuals
●●
●
● ●
●
●
y
● ● ●
−20
● ●
−300
●
●
●● ●
●
●
●● ●
●●
−400
● −40
● ●
●
●
●
−500
6 8 10 12 14 −400 −300 −200 −100 0
x Fitted y
(a) (b)
Histogram of Residuals Normal Q−Q Plot
●
●● ● ●
1
●●●● ●
●
●●●●●
15
●●●
●●●
●●
●
●
●●
●●
0
●●
Sample Quantiles
●
●
●●●
●●
10
Frequency
●
●●
●
−1
●
5
−2
●
● ●
●
●
0
−3
−60 −40 −20 0 20 −2 −1 0 1 2
Residuals Theoretical Quantiles
(c) (d)
Figure 17.1: (a) Scatterplot of the quadratic data with the OLS line. (b)
Residual plot for the OLS fit. (c) Histogram of the residuals. (d) NPP for
the Studentized residuals.
D. S. Young STAT 501

i xi yi i xi yi i xi yi
1 6.6 -45.4 21 8.4 -106.5 41 8 -95.8
2 10.1 -176.6 22 7.2 -63 42 8.9 -126.2
3 8.9 -127.1 23 13.2 -362.2 43 10.1 -179.5
4 6 -31.1 24 7.1 -61 44 11.5 -252.6
5 13.3 -366.6 25 10.4 -194 45 12.9 -338.5
6 6.9 -53.3 26 10.8 -216.4 46 8.1 -97.3
7 9 -131.1 27 11.9 -278.1 47 14.9 -480.5
8 12.6 -320.9 28 9.7 -162.7 48 13.7 -393.6
9 10.6 -204.8 29 5.4 -21.3 49 7.8 -87.6
10 10.3 -189.2 30 12.1 -284.8 50 8.5 -105.4
11 14.1 -421.2 31 12.1 -287.5
12 8.6 -113.1 32 12.1 -290.8
13 14.9 -482.3 33 9.2 -137.4
14 6.5 -42.9 34 6.7 -47.7
15 9.3 -144.8 35 12.1 -292.3
16 5.2 -14.2 36 13.2 -356.4
17 10.7 -211.3 37 11 -228.5
18 7.5 -75.4 38 13.1 -354.4
19 14.9 -482.7 39 9.2 -137.2
20 12.2 -295.6 40 13.2 -361.6
Table 17.1: The simulated 2-degree polynomial data set with n = 50 values.
Some general guidelines to keep in mind when estimating a polynomial

regression model are:
• The fitted model is more reliable when it is built on a larger sample
size n.
• Do not extrapolate beyond the limits of your observed values.
• Consider how large the size of the predictor(s) will be when incorpo-
rating higher degree terms as this may cause overflow.
• Do not go strictly by low p-values to incorporate a higher degree term,
but rather just use these to support your model only if the plot looks
reasonable. This is sort of a situation where you need to determine
“practical significance” versus “statistical significance”.

• In general, you should obey the hierarchy principle, which says that
if your model includes X h and X h is shown to be a statistically signif-
icant predictor of Y , then your model should also include each X j for
all j < h, whether or not the coefficients for these lower-order terms
are significant.
17.2 Response Surface Regression

A response surface model (RSM) is a method for determining a surface
predictive model based on one or more variables. In the context of RSMs,the
variables are often called factors, so to keep consistent with the correspond-
ing methodology, we will utilize that term for this section. RSM methods are
usually discussed in a Design of Experiments course, but there is a relevant
regression component. Specifically, response surface regression is fitting a
polynomial regression with a certain structure of the predictors.
Many industrial experiments are conducted to discover which values of
given factor variables optimize a response. If each factor is measured at
three or more values, then a quadratic response surface can be estimated by
ordinary least squares regression. The predicted optimal value can be found
from the estimated surface if the surface is shaped like a hill or valley. If
the estimated surface is more complicated or if the optimum is far from the
region of the experiment, then the shape of the surface can be analyzed to
indicate the directions in which future experiments should be performed.
In polynomial regression, the predictors are often continuous with a large
number of different values. In response surface regression, the factors (of
which there are k) typically represent a quantitative measure where their
factor levels (of which there are p) are equally spaced and established at
the design stage of the experiment. This is what we call a pk factorial
design because the analysis will involve all of the pk different treatment
combinations. Our goal is to find a polynomial approximation that works
well in a specified region of the predictor space. As an example, we may
be performing an experiment with k = 2 factors where one of the factors
is a certain chemical concentration in a mixture. The factor levels for the
chemical concentration are 10%, 20%, and 30% (so p = 3). The factors are
then coded in the following way:
∗ Xi,j − [maxi (Xi,j ) + mini (Xi,j )]/2

Xi,j = ,
[maxi (Xi,j ) − mini (Xi,j )]/2

where i = 1, . . . , n indexes the sample and j = 1, . . . , k indexes the factor.

For our example (assuming we label the chemical concentration factor as
“1”) we would have
10−[30+10]/2


 [30−10]/2
= −1, if Xi,1 = 10%;




∗ 20−[30+10]/2
Xi,1 = [30−10]/2
= 0, if Xi,1 = 20%;




30−[30+10]/2

= +1, if Xi,1 = 30%.

[30−10]/2
Some aspects which differentiate a response surface regression model from

the general context of a polynomial regression model include:
• In a response surface regression model, p is usually 2 or 3, while k is

usually the same value for each factor. More complex models can be
developed outside of these constraints, but such a discussion is better
dealt with in a Design of Experiments course.
• The factors are treated as categorical variables. Therefore, the X ma-

trix will have a noticeable pattern based on the way the experiment
was designed. Furthermore, the X matrix is often called the design
matrix in response surface regression.
• The number of factor levels must be at least as large as the number of

factors (p ≥ k).
• If examining a response surface with interaction terms, then the model

must obey the hierarchy principle (this is not required of general poly-
nomial models, although it is usually recommended).
• The number of factor levels must be greater than the order of the model
(i.e., p > h).
• The number of observations (n) must be greater than the number of

terms in the model (including all higher-order terms and interactions).
– It is desirable to have a larger n. A rule of thumb is to have at

least 5 observations per term in the model.

● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
(a) (b) (c)
Figure 17.2: (a) The points of a square portion of a design with factor levels
coded at ±1. This is how a 22 factorial design is coded. (b) Illustration of
the axial (or star) points of a design at (+a,0), (-a,0), (0,-a), and (0,+a). (c)
A diagram which shows the combination of the previous two diagrams with
the design center at (0,0). This final diagram is how a composite design is
coded.
• Typically response surface regression models only have two-way inter-

actions while polynomial regression models can (in theory) have k-way
interactions.
• The response surface regression models we outlined are for a factorial

design. Figure 17.2 shows how a factorial design can be diagramed as
a square using factorial points. More elaborate designs can be con-
structed, such as a central composite design, which takes into con-
sideration axial (or star) points (also illustrated in Figure 17.2). Figure
17.2 pertains to a design with two factors while Figure 17.3 pertains to
a design with three factors.
We mentioned that response surface regression follows the hierarchy prin-

ciple. However, some texts and software do report ANOVA tables which do
not quite follow the hierarchy principle. While fundamentally there is noth-
ing wrong with these tables, it really boils down to a matter of terminology.
If the hierarchy principle is not in place, then technically you are just per-
forming a polynomial regression.
Table 17.2 gives a list of all possible terms when assuming an hth -order
response surface model with k factors. For any interaction that appears in

● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
(a) (b) (c)
Figure 17.3: (a) The points of a cube portion of a design with factor levels
coded at the corners of the cube. This is how a 23 factorial design is coded.
(b) Illustration of the axial (or star) points of this design. (c) A diagram
which shows the combination of the previous two diagrams with the design
center at (0,0). This final diagram is how a composite design is coded.
the model (e.g., Xih1 Xjh2 such that h2 ≤ h1 ), then the hierarchy principle
says that at least the main factor effects for 1, . . . , h1 must appear in the
model, that all h1 -order interactions with the factor powers of 1, . . . , h2 must
appear in the model, and all order interactions less than h1 must appear
in the model. Luckily, response surface regression models (and polynomial
models for that matter) rarely go beyond h = 3.
For the next step, an ANOVA table is usually constructed to assess the
significance of the model. Since the factor levels are all essentially treated as
categorical variables, the designed experiment will usually result in replicates
for certain factor level combinations. This is unlike multiple regression where
the predictors are usually assumed to be continuous and no predictor level
combinations are assumed to be replicated. Thus, a formal lack of fit test
is also usually incorporated. Furthermore, the SSR is also broken down
into the components making up the full model, so you can formally test the
contribution of those components to the fit of your model.
An example of a response surface regression ANOVA is given in Table
17.3. Since it is not possible to compactly show a generic ANOVA table nor
to compactly express the formulas, this example is for a quadratic model
with linear interaction terms. The formulas will be similar to their respec-
tive quantities defined earlier. For this example, assume that there are k

Effect Relevant Terms

Main Factor Xi , Xi2 , Xi3 , . . . , Xih for all i
Linear Interaction Xi Xj for all i < j
Xi2 Xj for i 6= j
Quadratic Interaction
and Xi2 Xj2 for all i < j
Xi3 Xj , Xi3 Xj2 for i 6= j
Cubic Interaction
Xi3 Xj3 for all i < j
.. ..
. .
Xih Xj , Xih Xj2 , Xih Xj3 , . . . , Xih Xjh−1 for i 6= j
hth -order Interaction
Xih Xjh for all i < j
Table 17.2: A table showing all of the terms that could be included in a
response surface regression model. In the above, the indices for the factor
are given by i = 1, . . . , k and j = 1, . . . , k.
factors, n observations, m unique levels of the factor level combinations, and

q total regression parameters are needed for the full model. In Table 17.3,
the following partial sums of squares are used to compose the SSR value:
• The sum of squares due to the linear component is
SSLIN = SSR(X1 , X2 , . . . , Xk ).
• The sum of squares due to the quadratic component is
SSQUAD = SSR(X12 , X22 , . . . , Xk2 |X1 , X2 . . . , Xk ).
• The sum of squares due to the linear interaction component is
SSINT = SSR(X1 X2 , . . . , X1 Xk , X2 Xk , . . . , Xk−1 Xk |X1 , X2 . . . , Xk ,

X12 , X22 , . . . , Xk2 ).
Other analysis techniques are commonly employed in response surface

regression. For example, canonical analysis (which is a multivariate analy-
sis tool) uses the eigenvalues and eigenvectors in the matrix of second-order
parameters to characterize the shape of the response surface (e.g., is the sur-
face flat or have some noticeable shape like a hill or a valley). There is also

Source df SS MS F
Regression q−1 SSR MSR MSR/MSE
Linear k SSLIN MSLIN MSLIN/MSE
Quadratic k SSQUAD MSQUAD MSQUAD/MSE
Interaction q − 2k − 1 SSINT MSINT MSINT/MSE
Error n−q SSE MSE
Lack of Fit m−q SSLOF MSLOF MSLOF/MSPE
Pure Error n−m SSPE MSPE
Total n−1 SSTO
Table 17.3: ANOVA table for a response surface regression model with linear,
quadratic, and linear interaction terms.
ridge analysis, which computes the estimated ridge of optimum response

for increasing radii from the center of the original design. Since the context
of these techniques is better suited for a Design of Experiments course, we
will not develop their details here.
17.3 Examples
Example 1: Yield Data Set
This data set of size n = 15 contains measurements of yield from an exper-
iment done at five different temperature levels. The variables are y = yield
and x = temperature in degrees Fahrenheit. Table 17.4 gives the data used
for this analysis. Figure 17.4 give a scatterplot of the raw data and then
another scatterplot with lines pertaining to a linear fit and a quadratic fit
overlayed. Obviously the trend of this data is better suited to a quadratic
fit.
Here we have the linear fit results:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.306306 0.469075 4.917 0.000282 ***
temp 0.006757 0.005873 1.151 0.270641
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

i Temperature Yield
1 50 3.3
2 50 2.8
3 50 2.9
4 70 2.3
5 70 2.6
6 70 2.1
7 80 2.5
8 80 2.9
9 80 2.4
10 90 3.0
11 90 3.1
12 90 2.8
13 100 3.3
14 100 3.5
15 100 3.0
Table 17.4: The yield measurements data set pertaining to n = 15 observa-

tions.
Residual standard error: 0.3913 on 13 degrees of freedom

Multiple R-Squared: 0.09242, Adjusted R-squared: 0.0226
F-statistic: 1.324 on 1 and 13 DF, p-value: 0.2706
##########
Here we have the quadratic fit results:
##########
Coefficients:
(Intercept) 7.9604811 1.2589183 6.323 3.81e-05 ***
temp -0.1537113 0.0349408 -4.399 0.000867 ***
temp2 0.0010756 0.0002329 4.618 0.000592 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

3.4 ● ●
3.4
● ● ● ●
3.2
3.2
● ●
3.0
3.0
● ● ● ●
● ● ● ●
Yield
Yield
2.8
2.8
● ● ● ●
2.6
2.6
● ●
● ●
2.4
2.4
● ●
● ●
2.2
2.2
● ●
50 60 70 80 90 100 50 60 70 80 90 100
Temperature Temperature
(a) (b)
Figure 17.4: The yield data set with (a) a linear fit and (b) a quadratic fit.

##########
We see that both temperature and temperature squared are significant pre-
dictors for the quadratic model (with p-values of 0.0009 and 0.0006, respec-
tively) and that the fit is much better for than the linear fit. From this out-
put, we see the estimated regression equation is yi = 7.96050 − 0.15371xi +
0.00108x2i . Furthermore, the ANOVA table below shows that the model we
fit is statistically significant at the 0.05 significance level with a p-value of
0.0012. Thus, our model should include a quadratic term.
##########
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
Regression 2 1.47656 0.73828 12.36 0.001218 **
Residuals 12 0.71677 0.05973
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##########

Example 2: Odor Data Set

An experiment is designed to relate three variables (temperature, ratio, and
height) to a measure of odor in a chemical process. Each variable has three
levels, but the design was not constructed as a full factorial design (i.e., it is
not a 33 design). Nonetheless, we can still analyze the data using a response
surface regression routine. The data obtained was already coded and can be
found in Table 17.5.
Odor Temperature Ratio Height

66 -1 -1 0
58 -1 0 -1
65 0 -1 -1
-31 0 0 0
39 1 -1 0
17 1 0 -1
7 0 1 -1
-35 0 0 0
43 -1 1 0
-5 -1 0 1
43 0 -1 1
-26 0 0 0
49 1 1 0
-40 1 0 1
-22 0 1 1
Table 17.5: The odor data set measurements with the factor levels already
coded.
First we will fit a response surface regression model consisting of all of the
first-order and second-order terms. The summary of this fit is given below:
##########
Coefficients:
(Intercept) -30.667 10.840 -2.829 0.0222 *
temp -12.125 6.638 -1.827 0.1052
ratio -17.000 6.638 -2.561 0.0336 *

height -21.375 6.638 -3.220 0.0122 *

temp2 32.083 9.771 3.284 0.0111 *
ratio2 47.833 9.771 4.896 0.0012 **
height2 6.083 9.771 0.623 0.5509
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

##########
As you can see, the square of height is the least statistically significant, so
we will drop that term and rerun the analysis. The summary of this new fit
is given below:
##########
Coefficients:
(Intercept) -26.923 8.707 -3.092 0.012884 *
temp -12.125 6.408 -1.892 0.091024 .
ratio -17.000 6.408 -2.653 0.026350 *
height -21.375 6.408 -3.336 0.008720 **
temp2 31.615 9.404 3.362 0.008366 **
ratio2 47.365 9.404 5.036 0.000703 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

##########
By omitting the square of height, the temperature main effect has now be-
come marginally significant. Note that the square of temperature is statisti-
cally significant. Since we are building a response surface regression model,
we must obey the hierarchy principle. Therefore temperature will be retained
in the model.

Finally, contour and surface plots can also be generated for the response
surface regression model. Figure 17.5 gives the contour plots (with odor as
the contours) for each of the three levels of height (Figure 17.6 gives color
versions of the plots). Notice how the contours are increasing as we go out to
the corner points of the design space (so it is as if we are looking down into
a cone). The surface plots of Figure 17.7 all look similar (with the exception
of the temperature scale), but notice the curvature present in these plots.

Height=−1 Height=0
1
1
Ratio
Ratio
0
0
−1
−1
−1 0 1 −1 0 1
(a) (b)
Height=1
1
Ratio
0
−1
−1 0 1
Temperature
(c)
Figure 17.5: The contour plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.

Height=−1 Height=0
1.0 1.0
100 80
80 60
0.5 0.5
60 40
Ratio
Ratio
0.0 0.0
40 20
−0.5 20 −0.5 0
0 −20
−1.0 −1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
(a) (b)
Height=1
1.0 60
40
0.5
20
Ratio
0.0
0
−20
−0.5
−40
−1.0
−1.0 −0.5 0.0 0.5 1.0
Temperature
(c)
Figure 17.6: The contour plots of ratio versus temperature with odor as a

(a) (b)
(c)
Figure 17.7: The surface plots of ratio versus temperature with odor as a

Chapter 18
Biased Regression Methods

and Regression Shrinkage
Recall earlier that we dealt with multicollinearity (i.e., a near-linear relation-

ship amongst some of the predictors) by centering the variables in order to
reduce the variance inflation factors (which reduces the linear dependency).
When multicollinearity occurs, the ordinary least squares estimates are still
unbiased, but the variances are very large. However, we can add a degree
of bias to the estimation process, thus reducing the variance (and standard
errors). This concept is known as the “bias-variance tradeoff” due to the
functional relationship between the two values. We proceed to discuss some
popular methods for producing biased regression estimates when faced with
a high degree of multicollinearity.
The assumptions made for these methods are mostly the same as in the
multiple linear regression model. Namely, we assume linearity, constant vari-
ance, and independence. Any apparent violation of these assumptions must
be dealt with first. However, these methods do not yield statistical intervals
due to uncertainty in the distributional assumption, so normality of the data
is not assumed.
One additional note is that the procedures in this section are often referred
to as “shrinkage methods”. They are called shrinkage methods because, as
we will see, the regression estimates we obtain cover a smaller range than
those from ordinary least squares.
251
CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION
252 SHRINKAGE
18.1 Ridge Regression

Perhaps the most popular (albeit controversial) and widely studied biased
regression technique to deal with multicollinearity is ridge regression. Be-
fore we get into the computational side of ridge regression, let us recall from
the last course how to perform a correlation transformation (and the corre-
sponding notation) which is performed by standardizing the variables.
The standardized X matrix is given as:
 X1,1 −X̄1 X1,2 −X̄2 X1,p−1 −X̄p−1

s s
. . . sXp−1
 X2,1X−1X̄1 X2,2X−2X̄2 X2,p−1 −X̄p−1 

∗ 1

s X s X
. . . s X
X = √n−1 1 2 p−1
,
 

 .
.. .
.. . .. .
.. 
 
Xn,1 −X̄1 Xn,2 −X̄2 Xn,p−1 −X̄p−1
sX sX
. . . sX
1 2 p−1
which is a n × (p − 1) matrix, and the standardized Y vector is given as:

 
Y1 −Ȳ
sY
 Y2 −Ȳ 
∗ sY
1
 
Y = √n−1  .. ,
.
 
 
Yn −Ȳ
sY
which is still a n-dimensional vector. Here,

sP
n 2
i=1 (Xi,j − X̄j )
sXj =
n−1
for j = 1, 2, . . . , (p − 1) and
sP
n
− Ȳ )2
i=1 (Yi
sY = .
n−1
Remember that we have removed the column of 1’s in forming X∗ , effectively

reducing the column dimension of the original X matrix by 1. Because of
this, we no longer can estimate an intercept term (b0 ), which may be an
important part of the analysis. When using the standardized variables, the
regression model of interest becomes:
Y∗ = X∗ β ∗ + ∗ ,

SHRINKAGE 253
where β ∗ is now a (p − 1)-dimensional vector of standardized regression co-

efficients and ∗ is an n-dimensional vector of errors pertaining to this stan-
dardized model. Thus, the ordinary least squares estimates are
∗
β̂ = (X∗T X∗ )−1 X∗T Y∗
= r−1
XX rXY ,
where rXX is the (p − 1) × (p − 1) correlation matrix of the predictors and

rXY is the (p − 1)dimensional matrix of correlation coefficients between the
∗
predictors and the response. Thus β̂ is a function of correlations and hence
we have performed a correlation transformation.
Notice further that
∗ ∗
E[β̂ ] = β̂
and
∗
V[β̂ ] = σ 2 r−1 −1
XX = rXX .
For the variance-covariance matrix, σ 2 = 1 because we have standardized all

of the variables.
Ridge regression adds a small value k (called a biasing constant) to
the diagonal elements of the correlation matrix. (Recall that a correlation
matrix has 1’s down the diag, so it can sort of be thought of as a “ridge”.)
Mathematically, we have
β̃ = (rXX + kI(p−1)×(p−1) )−1 rXY ,
where 0 < k < 1, but usually less than 0.3. The amount of bias in this
estimator is given by
E[β̃ − β ∗ ] = [(rXX + kI(p−1)×(p−1) )−1 rXX − I(p−1)×(p−1) ]β ∗ ,
and the variance-covariance matrix is given by
V[β̃] = (rXX + kI(p−1)×(p−1) )−1 rXX (rXX + kI−1

(p−1)×(p−1) ).
Remember that β̃ is calculated on the standardized variables (sometimes

called the “standardized” ridge regression estimates). We can transform

254 SHRINKAGE
back to the original scale (sometimes these are called the ridge regression
estimates) by

† sY
β̃j = β̃j
sXj
p−1
X
β̃0† = ȳ − β̃j† x̄j ,
j=1
where j = 1, 2, . . . , p − 1.
How do we choose k? Many methods exist, but there is no agreement
on which to use, mainly due to instability in the estimates asymptotically.
Two methods are primarily used: one graphical and one analytical. The first
method is called the fixed point method and uses the estimates provided
by fitting the correlation transformation via ordinary least squares. This
method suggests using
(p − 1)MSE∗
k= ∗T ∗
,
β̂ β̂
where MSE∗ is the mean square error obtained from the respective fit.
Another method is the Hoerl-Kennard iterative method. This method
calculates
(p − 1)MSE∗
k (t) = T ,
β̃ k(t−1) β̃ k(t−1)
where t = 1, 2, . . .. Here, β̃ k(t−1) pertains to the ridge regression estimates
obtained when the biasing constant is k (t−1) . This process is repeated until
the difference between two successive estimates of k is negligible. The starting
value for this method (k (0) ) is chosen to be the value of k calculated using
the fixed point method.
Perhaps the most common method is a graphical method. The ridge
trace is a plot of the estimated ridge regression coefficients versus k. The
value of k is picked where the regression coefficients appear to have stabilized.
The smallest value of k is chosen as it introduces the smallest amount of bias.
There are criticisms regarding ridge regression. One major criticism is
that ordinary inference procedures are not available since exact distribu-
tional properties of the ridge estimator are not known. Another criticism
is in the subjective choice of k. While we mentioned a few of the methods
here, there are numerous methods found in the literature, each with their

SHRINKAGE 255
own limitations. On the flip-side of these arguments lie some potential ben-
efits of ridge regression. For example, it can accomplish what it sets out to
do, and that is reduce multicollinearity. Also, occasionally ridge regression
can provide an estimate of the mean response which is good for new values
that lie outside the range of our observations (called extrapolation). The
mean response found by ordinary least squares is known to not be good for
extrapolation.
18.2 Principal Components Regression

The method of principal components regression transforms the predictor
variables to their principal components. Principal components of X∗T X∗ are
extracted using the singular value decomposition (SVD) method which
says there exist orthogonal matrices Un×(p−1) and Pn×(p−1) (i.e., UT U =
PT P = I(p−1)×(p−1) ) such that
X∗ = UDPT .
P is called the (factor) loadings matrix while the (principal compo-
nent) scores matrix is defined as
Z = UD,
such that
ZT Z = Λ.
Here, Λ is a (p − 1) × (p − 1) diagonal matrix consisting of the nonzero
eigenvalues of X∗T X∗ on the diagonal (for simplicity, we assume that the
eigenvalues are in decreasing order down the diagonal: λ1 ≥ λ2 ≥ . . . ≥
λp−1 > 0). Notice that Z = X∗ P, which implies that each entry of the Z
matrix is a linear combination of the entries of the corresponding column of
the X∗ matrix. This is because the goal of principal components is to only
keep those linear combinations which help explain a larger amount of the
variation (as determined by using the eigenvalues described below).
Next, we regress Y∗ on Z. The model is
Y∗ = Zβ + ∗ ,
which has the least squares solution
β̂ Z = (ZT Z)−1 ZT Y∗ .

256 SHRINKAGE
Severe multicollinearity is identified by very small eigenvalues. Multicollinear-

ity is corrected by omitting those components which have small eigenvalues.
Since the ith entry of β̂ Z corresponds to the ith component, simply set those
entries of β̂ Z to 0 which have correspondingly small eigenvalues. For exam-
ple, suppose you have 10 predictors (and hence 10 principal components).
You find that the last three eigenvalues are relatively small and decide to
omit these three components. Therefore, you set the last three entries of β̂ Z
equal to 0.
With this value β̂ Z , we can transform back to get the coefficients on the
∗
X scale by
β̂ PC = Pβ̂ Z .
This is a solution to
Y∗ = X∗ β ∗ + ∗ .
Notice that we have not reduced the dimension of β̂ Z from the original cal-
culation, but we have only set certain values equal to 0. Furthermore, as in
ridge regression, we can transform back to the original scale by

† sY
β̂PC,j = β̂PC,j
sXj
p−1
†
X †
β̂PC,0 = ȳ − β̂PC,j x̄j ,
j=1
where j = 1, 2, . . . , p − 1.
How do you choose the number of eigenvalues to omit? This can be
accomplished by looking at the cumulative percent variation explained by
each of the (p − 1) components. For the j th component, this percentage is
Pj
λi
i=1
× 100,
λ1 + λ2 + . . . + λp−1
where j = 1, 2, . . . , p − 1 (remember, the eigenvalues are in decreasing order).

A common rule of thumb is that once you reach a component that explains
roughly 80% − 90% of the variation, then you can omit the remaining com-
ponents.

SHRINKAGE 257
18.3 Partial Least Squares

We next look at a procedure that is very similar to principal components
regression. Here, we will attempt to construct the Z matrix from the last
section in a different manner such that we are still interested in models of
the form
Y∗ = Zβ + ∗ .
Notice that in principal components regression that the construction of the
linear combinations in Z do not rely whatsoever on the response Y∗ . Yet,
Z
we use the estimate β̂ (from regressing Y∗ on Z) to help us build our final
estimate. The method of partial least squares allows us to choose the
linear combinations in Z such that they predict Y∗ as best as possible. We
proceed to describe a common way to estimate with partial least squares.
First, define
SST = X∗T Y∗ Y∗T X∗ .
We construct the score vectors (i.e., the rows of Z) as
zi = X∗ ri ,
for i = 1, . . . , p − 1. The challenge becomes to find the ri values. r1 is just

the first eigenvector of SST . ri for i = 2, . . . , p − 1 maximizes
T
rT
i−1 SS ri ,
subject to the constraint

∗T ∗
rT T
i−1 X X ri = zi−1 zi = 0.
Next, we regress Y∗ on Z, which has the least squares solution
β̂ Z = (ZT Z)−1 ZT Y∗ .
As in principal components regression, we can transform back to get the

coefficients on the X∗ scale by
β̂ PLS = Rβ̂ Z ,
which is a solution to
Y∗ = X∗ β ∗ + ∗ .

258 SHRINKAGE
In the above, R is the matrix where the ith column is ri . Furthermore, as in

both ridge regression and principal components regression, we can transform
back to the original scale by

† sY
β̂PLS,j = β̂PLS,j
sXj
p−1
†
X †
β̂PLS,0 = ȳ − β̂PLS,j x̄j ,
j=1
where j = 1, 2, . . . , p − 1.
The method described above is sometimes referred to as the SIMPLS
method. Another method commonly used is nonlinear iterative partial
least squares (NIPALS). NIPALS is more commonly used when you have
a vector of responses. While we do not discuss the differences between these
algorithms any further, we do discuss later the setting where we have a vector
of responses.
18.4 Inverse Regression

In simple linear regression, we introduced calibration intervals which are a
type of statistical interval for a predictor value given a value for the re-
sponse. An inverse regression technique is essentially what is performed to
find the calibration intervals (i.e., regress the predictor on the response), but
calibration intervals do not extend easily to the multiple regression setting.
However, we can still extend the notion of inverse regression when dealing
with p − 1 predictors.
Let X i be a p-dimensional vector (with fist entry equal to 0 for an inter-
cept so that we actually have p − 1 predictors) such that
 
XT 1
X =  ...  .
 
XT n
However, assume that p is actually quite large with respect to n. Inverse

regression can actually be used as a tool for dimension reduction (i.e., re-
ducing p), which reveals to use the most important aspects (or direction)

SHRINKAGE 259
of the data.1 The tool commonly used is called Sliced Inverse Regres-
sion (or SIR). SIR uses the inverse regression curve E(X|Y = y), which
falls into a reduced dimension space under certain conditions. SIR uses this
curve to perform a weighted principal components analysis such that one can
determine an effective subset of the predictors. The reason for reducing the
dimensions of the predictors is because of the curse of dimensionality,
which means that drawing inferences on the same number of data points in a
higher dimensional space becomes difficult due to the sparsity of the data in
the volume of the higher dimensional space compared to the volume of the
lower dimensional space.2
When working with the classic linear regression model
Y = Xβ +
or a more general regression model
Y = f (Xβ) +
for some real-valued function f , we know that the distribution of Y |X de-

pends on X only through the p-dimensional variable β = (β0 , β1 , . . . , βp−1 )T .
Dimension reduction claims that the distribution of Y |X depends on X only
through the k-dimensional variable β ∗ = (β1∗ , . . . , βk∗ )T such that k < p. This
new vector β ∗ is called the effective dimension reduction direction (or
EDR-direction).
The inverse regression curve is computed by looking for E(X|Y = y),
which is a curve in Rp , but consisting of p one-dimensional regressions (as
opposed to one p-dimensional surface in standard regression). The center of
the inverse regression curve is located at E(E(X|Y = y)) = E(X). Therefore,
the centered inverse regression curve is
m(y) = E(X|Y = y) − E(X),

1
When we have p large with respect to n, we use the terminology dimension reduction.
However, when we are more concerned about which predictors are significant or which
functional form is appropriate for our regression model (and the size of p is not too much
of an issue), then we use the model selection terminology.
2
As an example, consider 100 points on the unit interval [0,1], then imagine 100 points
on the unit square [0, 1]×[0, 1], then imagine 100 points on the unit cube [0, 1]×[0, 1]×[0, 1],
and so on. As the dimension increases, the sparsity of the data makes it more difficult to
make any relevant inferences about the data.

260 SHRINKAGE
which is a p-dimensional curve in Rp . Next, the “slice” part of SIR comes

from estimating m(y) by dividing the range of Y into H non-overlapping
intervals (or slices), which are then used to compute the sample means, m̂h ,
of each slice. These sample means are a crude estimate of m(y).
With the basics of the inverse regression model in place, we can introduce
an algorithm often used to estimate the EDR-direction vector for SIR:
1. Let ΣX be the variance-covariance matrix of X. Using the standardized
X matrix (i.e., the matrix defined earlier in this chapter as X∗ ), we can
rewrite the classic regression model as
Y∗ = X∗ η + ∗
or the more general regression model as
Y∗ = f (X∗ η) + ∗ ,
1/2
where η = ΣX β.
2. Divide the range of y1 , . . . , yn into H non-overlapping slices (using the
index h = 1, . . . , H). Let nh be the number of observation within each
slice and Ih {·} be the indicator function for this slice such that
n
X
nh = Ih {yi }.
i=1
3. Compute
n
X
m̂h (yi ) = n−1
h x∗i Ih {yi },
i=1
which are the means of the H slices.
4. Calculate the estimate for Cov(m(y)) by
H
X
−1
V̂ = n nh m̂h (yi )m̂h (yi )T .
h=1
5. Identify the k largest eigenvalues λ̂i and eigenvectors ri of V̂. Construct

the score vectors zi = X∗ ri as in partial least squares, which are the
rows of Z. Then
η̂ = (ZT Z)−1 ZT Y∗
is the standardized EDR-direction vector.

SHRINKAGE 261
6. Transform the standardized EDR-direction vector back to the original

scale by
∗ 1/2
β̂ = ΣX η̂.
18.5 Regression Shrinkage and Connections

with Variable Selection
Suppose we now wish to find the least squares estimate of the model Y =
Xβ + , but subject to a set of equality constraints Aβ = a. It can be shown
(by using Lagrange multipliers), that
β̂ CLS = β̂ OLS − (XT X)1 AT [A(XT X)−1 AT ]−1 [Aβ̂ OLS − a],
which is called the constrained least squares estimator. This is helpful

when you wish to restrict β from being estimated in various areas of Rp .
However, you can also have more complicated constraints (e.g., inequality
constraints, quadratic constraints, etc.) in which case more sophisticated
optimization techniques need to be utilized. The constraints are imposed to
restrict the range of β and so any corresponding estimate can be thought
of as a shrinkage estimate as they are covering a smaller range than the
ordinary least squares estimates. Ridge regression is a method providing
shrinkage estimators although they are biased. Oftentimes we hope to shrink
our estimates to 0 by imposing certain constraints, but this may not always
be possible.
A common regression shrinkage procedure is least absolute shrinkage
and selection operator or LASSO. LASSO is also concerned with the
setting of finding the least squares
Pp estimate of Y = Xβ + , but subject to
a set of inequality constraints j=1 |βj | ≤ t, which is called an L1 −penalty
since we are looking at an L1 -norm.3 Here, t ≥ 0 is a tuning parameter
which the user sets to control the amount ofPshrinkage. If we let β̂ be the
ordinary least squares estimate and let t0 = pj=1 |βj |, then values of t < t0
will cause shrinkage of the solution towards 0 and some coefficients may be
exactly 0. Because of this, LASSO is also accomplishing model (or subset)
selection as we can omit those predictors from the model whose coefficients
become exactly 0.
3
Pn 2
Ordinary least squares minimizes with respect to the equality constraint i=1 ei ,
which is an L2 −penalty.

262 SHRINKAGE
It is important to restate the purposes of LASSO. Not only does it shrink

the regression estimates, but it also provides a way to accomplish subset se-
lection. Furthermore, ridge regression also serves this dual purpose, although
we introduced ridge regression as a way to deal with multicollinearity and
not as a first line effort for shrinkage. The way subset selection P
is performed
using ridge regression is by imposing the inequality constraint pj=1 βj2 ≤ t,
which is an L2 −penalty. Many competitors to LASSO are available in the
literature (such as regularized least absolute deviation and Dantzig selec-
tors), but LASSO is one of the more commonly used methods. It should be
noted that there are numerous efficient algorithms available for estimating
with these procedures, but due to the level of detail necessary, we will not
explore these techniques.
18.6 Examples
Example 1: GNP Data
This data set of size n = 16 contains macroeconomic data taken between
the years 1947 and 1962. The economic indicators recorded were the GNP
implicit price deflator (IPD), the GNP, the number of people unemployed,
the number of people in the armed forces, the population, and the number
of people employed. We wish to see if the GNP IPD can be modeled as a
function of the other variables. The data set is given in Table 18.1.
First we run a multiple linear regression procedure to obtain the following
output:
##########
Coefficients:
(Intercept) 2946.85636 5647.97658 0.522 0.6144
GNP 0.26353 0.10815 2.437 0.0376 *
Unemployed 0.03648 0.03024 1.206 0.2585
Armed.Forces 0.01116 0.01545 0.722 0.4885
Population -1.73703 0.67382 -2.578 0.0298 *
Year -1.41880 2.94460 -0.482 0.6414
Employed 0.23129 1.30394 0.177 0.8631
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

SHRINKAGE 263
Year GNP IPD GNP Unemployed Forces Population Employed

1947 83.0 234.289 235.6 159.0 107.608 60.323
1948 88.5 259.426 232.5 145.6 108.632 61.122
1949 88.2 258.054 368.2 161.6 109.773 60.171
1950 89.5 284.599 335.1 165.0 110.929 61.187
1951 96.2 328.975 209.9 309.9 112.075 63.221
1952 98.1 346.999 193.2 359.4 113.270 63.639
1953 99.0 365.385 187.0 354.7 115.094 64.989
1954 100.0 363.112 357.8 335.0 116.219 63.761
1955 101.2 397.469 290.4 304.8 117.388 66.019
1956 104.6 419.180 282.2 285.7 118.734 67.857
1957 108.4 442.769 293.6 279.8 120.445 68.169
1958 110.8 444.546 468.1 263.7 121.950 66.513
1959 112.6 482.704 381.3 255.2 123.366 68.655
1960 114.2 502.601 393.1 251.4 125.368 69.564
1961 115.7 518.173 480.6 257.2 127.852 69.331
1962 116.9 554.894 400.7 282.7 130.081 70.551
Table 18.1: The macroeconomic data set for the years 1947 to 1962.

F-statistic: 202.5 on 6 and 9 DF, p-value: 4.426e-09
##########
As you can see, not many predictors appear statistically significant at the
0.05 significance level. We also have a fairly high R2 (over 99%). However,
by looking at the variance inflation factors, multicollinearity is obviously an
issue:
##########
GNP Unemployed Armed.Forces Population Year
1214.57215 83.95865 12.15639 230.91221 2065.73394
Employed
220.41968
##########
In performing a ridge regression, we first obtain a trace plot of possi-
ble ridge coefficients (Figure 18.1). As you can see, the estimates of the

264 SHRINKAGE
20
10
t(x$coef)
0
−10
0.00 0.02 0.04 0.06 0.08 0.10
x$lambda
Figure 18.1: Ridge regression trace plot with the ridge regression coefficient
on the x-axis.
regression coefficients shrink drastically until about 0.02. When using the
Hoerl-Kennard method, a value of about k = 0.0068 is obtained. Other
methods will certainly yield different estimates which illustrates some of the
criticism surrounding ridge regression.
The resulting estimates from this ridge regression analysis are
##########
GNP Unemployed Armed.Forces Population Year
25.3615288 3.3009416 0.7520553 -11.6992718 -6.5403380
Employed
0.7864825
##########
The estimates have obviously shrunk closer to 0 compared to the original

estimates.
Example 2: Acetylene Data

This data set of size n = 16 contains observations of the percentage of con-
version of n-heptane to acetylene and three predictor variables. The response

SHRINKAGE 265
variable is y = conversion of n-heptane to acetylene (%), x1 = reactor tem-

perature (degrees Celsius), x2 = ratio of H2 to n-heptane (Mole ratio), and
x3 = contact time (in seconds). The data set is given in Table 18.2.
i Y X1 X2 X3
1 49.0 1300 7.5 0.0120
2 50.2 1300 9.0 0.0120
3 50.5 1300 11.0 0.0115
4 48.5 1300 13.5 0.0130
5 47.5 1300 17.0 0.0135
6 44.5 1300 23.0 0.0120
7 28.0 1200 5.3 0.0400
8 31.5 1200 7.5 0.0380
9 34.5 1200 11.0 0.0320
10 35.0 1200 13.5 0.0260
11 38.0 1200 17.0 0.0340
12 38.5 1200 23.0 0.0410
13 15.0 1100 5.3 0.0840
14 17.0 1100 7.5 0.0980
15 20.5 1100 11.0 0.0920
16 29.5 1100 17.0 0.0860
Table 18.2: The acetylene data set where Y =conversion of n-heptane to

acetylene (%), X1 =reactor temperature (degrees Celsius), X2 =ratio of H2 to
n-heptane (mole ratio), X3 =contact time (in seconds).
First we run a multiple linear regression procedure to obtain the following

output:
##########
Coefficients:
(Intercept) -121.26962 55.43571 -2.188 0.0492 *
reactor.temp 0.12685 0.04218 3.007 0.0109 *
H2.ratio 0.34816 0.17702 1.967 0.0728 .
cont.time -19.02170 107.92824 -0.176 0.8630
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

266 SHRINKAGE

##########
As you can see, everything but contact time appears statistically significant
at the 0.05 significance level. We also have a fairly high R2 (nearly 90%).
However, by looking at the pairwise scatterplots for the predictors in Figure
18.2, there appears to be a distinctive linear relationship between contact
time and reactor temperature. This is further verified by looking at the
variance inflation factors:
##########
reactor.temp H2.ratio cont.time
12.225045 1.061838 12.324964
##########
We will proceed with a principal components regression analysis. First
we perform the SVD of X∗ in order to get the Z matrix. Then, regressing
Y∗ on Z yields
##########
Coefficients:
Z1 Z2 Z3
-0.66277 0.03952 -0.57268
##########
The above is simply β̂ Z .
From the SVD of X∗ , the factor loadings matrix is found to be
 
−0.6742704 0.2183362 0.70547061
P =  −0.2956893 −0.9551955 0.01301144 
0.6767033 −0.1998269 0.70861973
So transforming back yields
β̂ PC = Pβ̂ Z
 
0.05150079
=  0.15076806  .
−0.86220916

SHRINKAGE 267
0.10
0.10
● ●
● ●
● ●
● ●
0.08
0.08
Contact Time
Contact Time
0.06
0.06
0.04
0.04
● ● ●
●
● ●
● ●
● ●
● ●
0.02
0.02
● ● ●
●
● ● ● ● ●
●
5 10 15 20 1100 1150 1200 1250 1300
H2 Ratio Reactor Temperature
(a) (b)
● ●
20
● ● ●
H2 Ratio
15
● ●
● ● ●
10
● ● ●
● ●
5
1100 1150 1200 1250 1300
Reactor Temparature
(c)
Figure 18.2: Pairwise scatterplots for the predictors from the acetylene data
set. LOESS curves are also provided. Does there appear to be any possible
linear relationships between pairs of predictors?

Chapter 19
Piecewise and Nonparametric

Methods
This chapter focuses on regression models where we start to deviate from

the functional form discussed thus far. The first topic discusses a model
where different regressions are fit depending on which area of the predictor
space we are in. The second topic discusses non parametric models which,
as the name suggests, fits a model which is free of distributional assumptions
and subsequently does not have regression coefficients readily available for
estimation. This is best accomplished by using a smoother, which is a
tool for summarizing the trend of the response as a function of one or more
predictors. The resulting estimate of the trend is less variable than the
response itself.
19.1 Piecewise Linear Regression

A model that proposes a different linear relationship for different intervals (or
regions) of the predictor is called a piecewise linear regression model.
The predictor values at which the slope changes are called knots, which
we will discuss throughout this chapter. Such models are helpful when you
expect the linear trend of your data to change once you hit some threshold.
Usually the knot values are already predetermined due to previous studies
or standards that are in place. However, there are methods for estimating
the knot values (sometimes called changepoints in the context of piecewise
linear regression), but we will not explore such methods.
268
CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 269
For simplicity, we construct the piecewise linear regression model for the
case of simple linear regression and also briefly discuss how this can be ex-
tended to the multiple regression setting. First, let us establish what the
simple linear regression model with one knot value (k1 ) looks like:
Y = β0 + β1 X1 + β2 (X1 − k1 )I{X1 > k1 } + ,
where I{·} is the indicator function such that

1, if X1 > k1 ;
I{X1 > k1 } =
0, otherwise.
So, when X1 ≤ k1 , the simple linear regression line is
E(Y ) = β0 + β1 X1
and when X1 > k1 , the simple linear regression line is
E(Y ) = (β0 − β2 k1 ) + (β1 + β2 )X1 .
Such a regression model is fitted in the upper left-hand corner of Figure 19.1.
For more than one knot value, we can extend the above regression model
to incorporate other indicator values. Suppose we have c knot values (i.e.,
k1 , k2 , . . . , kc ) and we have n observations. Then the piecewise linear regres-
sion model is written as:
yi = β0 + β1 xi,1 + β2 (xi,1 − k1 )I{xi,1 > k1 } + . . . + βc+1 (xi,1 kc )I{xi,1 > kc } + i .
As you can see, this can be written more compactly as:
y = Xβ + ,
where β is a c + 2−dimensional vector and

 
1 x1,1 x1,1 I{x1,1 > k1 } · · · x1,1 I{x1,1 > kc }
 1 x2,1 x2,1 I{x2,1 > k1 } · · · x2,1 I{x2,1 > kc } 
X =  .. .
 
.. .. .. ..
 . . . . . 
1 xn,1 xn,1 I{xn,1 > k1 } · · · xn,1 I{xn,1 > kc }
Furthermore, you can see how for more than one predictor you can construct
the X matrix to have columns as functions of the other predictors.

270 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS
1 Knot 1 Knot (Discontinuous)
11
● ● ●
● ● ●
● ● ●● ●
●
● ● ●● ●● ● ●
●● ●
● ●●● ● ●● ● ● ● ● ●
● ● ●● ● ●●● ● ●●
10
● ● ● ●
● ● ● ●● ● ● ●● ●● ● ●
8
● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ●●
● ● ●●
● ● ● ● ●
● ● ● ●
● ●●
● ●
9
●
● ●
●
● ●
7
● ●
8
● ●
● ●●
Y
Y
●●●● ●
● ●
●●● ●
● ● ●
● ●● ● ●●
●●
7
●● ●
● ● ● ●
● ●● ●● ●●
6
● ● ●● ●
●●
● ● ●
● ● ● ●●
● ● ●
● ●
● ● ● ●
6
●
● ● ●● ●
● ● ●
●
● ● ●
●
● ●
● ●●
5
5
●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
(a) (b)
2 Knots 2 Knots (Discontinuous)
●● ●
● ●
10
●
● ● ● ● ●●●
10
●● ●● ●
● ● ●● ● ●
● ● ● ●●
● ●
● ● ● ●
●
●
● ●
●●●
9
●
●
●●
● ●●
● ●
● ● ●
8
●● ● ● ● ● ● ●
8
● ● ● ● ● ● ●●
● ●●● ●
●
●●
● ●● ● ● ● ● ●
Y
● ●● ●
●
7
● ●
●
● ● ● ● ●●●
●● ●
● ●
● ●●
7
●● ●● ●
●● ●
● ●
● ● ●● ●●
● ●●
●●● ●●
●
6
● ●
● ●● ● ●
● ●
● ●
● ●● ● ● ●
●
● ●
●● ● ●
6
●
●●
●
●●
5
● ● ● ●
●
● ●●●
●●● ● ●●
● ● ●
● ●
● ●
●
● ● ●● ●●● ●
4
5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
(c) (d)
Figure 19.1: Plots illustrating continuous and discontinuous piecewise linear

regressions with 1 and 2 knots.

Sometimes you may also have a discontinuity that needs to be reflected

at the knots (see the right-hand side plots of Figure 19.1). This is easily
reflected in the piecewise linear model we constructed above by adding one
more term to the model. For each kj where there is a discontinuity, you
add the corresponding indicator random variable I{X1 > k1 } as a regressor.
Thus, the X matrix would have the column vector
 
I{x1,1 > kj }
 I{x2,1 > kj } 
 
 .. 
 . 
I{xn,1 > kj }
appended to it for each kj where there is a discontinuity. Extending discon-

tinuities to the case of more than one predictor is analogous.
19.2 Local Regression Methods

Nonparametric regression attempts to find a functional relationship be-
tween yi and xi (only one predictor):
yi = m(xi ) + i ,
where m(·) is the regression function to estimate and E(i ) = 0. It is not

necessary to assume constant variance and, in fact, one typically assumes
that i = σ 2 (xi ), where σ 2 (·) is a continuous, bounded function.
Local regression is a method commonly used to model this nonpara-
metric regression relationship. Specifically, local regression makes no global
assumptions about the function m(·). Global assumptions are made in stan-
dard linear regression as we assume that the regression curve we estimate
(which is characterized by the regression coefficient vector β) properly mod-
els all of our data. However, local regression assumes that m(·) can be
well-approximated locally by a member from a simple class of parametric
functions (e.g., a constant, straight-line, quadratic curve, etc.) What drives
local regression is Taylor’s theorem from Calculus, which says that any con-
tinuous function (which we assume that m(·) is) can be approximated with
a polynomial.
In this section, we discuss some of the common local regression methods
for estimating regressions nonparametrically.

19.2.1 Kernel Regression

One way of estimating m(·) is to use density estimation, which approxi-
mates the probability density function f (·) of a random variable X. Assum-
ing we have n independent observations x1 , . . . , xn from the random variable
X, the kernel density estimator fˆh (x) for estimating the density at x (i.e.,
f (x)) is defined as
n
ˆ 1 X xi − x
fh (x) = K .
nh i=1 h
Here, K(·) is called the kernel function and h is called the bandwidth.
K(·) is a function often resembling a probability density function, but with
no parameters (some common kernel functions are provided in Table 19.1). h
controls the window width around x which we perform the density estimation.
Thus, a kernel density estimator is essentially a weighting scheme (dictated
by the choice of kernel) which takes into consideration the proximity of a
point in the data set near x when given a bandwidth h. Furthermore, more
weight is given to points near x and less weight is given to points further
from x.
With the formalities established, one can perform a kernel regression of
yi on Xi estimate mh (·) with the Nadaraya-Watson estimator:

Pn xi −x
i=1 K h
yi
m̂h (x) = ,
Pn xi −x
i=1 K h
where m has been subscripted to note its dependency on the bandwidth. As

you can see, this kernel regression estimator is just a weighted sum of the
observed responses.
It is also possible to construct approximate confidence intervals and con-
fidence bands using the Nadaraya-Watson estimator, but under some restric-
tive assumptions. An approximate 100×(1−α)% confidence interval is given
by s
σ̂h2 (x)kKk22
m̂h (x) ± z1− α2 ,
nhfˆh (x)
where h = cn−1/5 for some constant c > 0, z1− α2 is the (1 − α/2)−quantile of

Kernel K(u)
Triangle (1 − |u|)I(|u| ≤ 1)
(1−u2 )g
Beta Beta(0.5,g+1)
I(|u| ≤ 1)
1 2
Gaussian √1 e− 2 u
2π
1
Cosinus 2
(1 + cos(πu))I(|u| ≤ 1)
π
cos( πu

Optcosinus 4 2
) I(|u| ≤ 1)
Table 19.1: A table of common kernel functions. In the above, I(|u| ≤ 1)

is the indicator function yielding 1 if |u| ≤ 1 and 0 otherwise. For the beta
kernel, the value g ≥ 0 is specified by the user and is a shape parame-
ter. Common values of g are 0, 1, 2, and 3, which are called the uniform,
Epanechnikov, biweight, and triweight kernels, respectively.
2
Pn xi −x
the standard normal distribution, kKk22 = i=1 K h
, and

1
Pn xi −x
n i=1 K h
{yi m̂h (x)}2
σ̂h2 (x) = .
Pn xi −x
i=1 K h
Next, let h = n−δ for δ ∈ ( 15 , 12 ). Then, under certain regularity conditions,

an approximate 100 × (1 − α)% confidence band is given by
s
σ̂h2 (x)kKk22
m̂h (x) ± zn,α ,
nhfˆh (x)

where 1/2
− log{− 12 log(1 − α)}

zn,α = + dn
(2δ log(n))1/2
and p 0 2 1/2
kK k
dn = (2δ log(n))1/2 + (2δ log(n))−1/2 log p 2 .
2π kKk22
Some final notes about kernel regression include:
• Choice of kernel and bandwidth are still major issues in research. There
are some general guidelines to follow and procedures that have been
developed, but are beyond the scope of this course.
• What we developed in this section is only for the case of one predictor.
If you have multiple predictors (i.e., x1,i , . . . , xp,i ), then one needs to
use a multivariate kernel density estimator at a point x = (x1 , . . . , xp )T ,
which is defined as
n
ˆ 1X 1 xi,1 − x1 xi,p − xp
fh (x) = K ,..., .
n i=1 pj=1 hj
Q
h1 hp
Multivariate kernels require more advanced methods and are difficult

to use as data sets with more predictors will often suffer from the curse
of dimensionality.
19.2.2 Local Polynomial Regression and LOESS

Local polynomial modeling is similar to kernel regression estimation, but the
fitted values are now produced by a locally weighted regression rather than
by a locally weighted average. The theoretical basis for this approach is to
do a Taylor series expansion around a value xi :
m00 (x)(xi )(x − xi ) m(q) (x)(xi )(x − xi )

m(xi ) ≈ m(x)+m0 (x)(xi )(x−xi )+ +. . .+ ,
2 q!
for x in a neighborhood of xi . It is then parameterized in a way such that:
m(xi ) ≈ β0 (x)+β1 (x)(xi −x)+β2 (x)(xi −x)2 +. . .+βq (x)(xi −x)q , |xi −x| ≤ h,

so that
β0 (x) = m(x)
β1 (x) = m0 (x)
β2 (x) = m00 (x)/2
..
.
βq (x) = m(q) (x)/q!.
Note that the β parameters are considered functions of x, hence the “local”
aspect of this methodology.
Local polynomial fitting minimizes
n q 2
X X
j xi − x
yi − βj (x)(xi − x) K
i=1 j=0
h
with respect to the βj (x) terms. Then, letting

   
1 (x1 − x) . . . (x1 − x)q β0 (x)
 1 (x2 − x) . . . (x2 − x)q   β1 (x) 
X =  .. , β(x) =
   
.. . .. .
.. .
 .. 
  
 . . 
1 (xn − x) . . . (x1 − x)q βq (x)

and W = diag K x1h−x , . . . , K xnh−x , the local least squares esti-
mate can be written as:
β̂(x) = arg min(Y − Xβ(x))T W(Y − Xβ(x))
β(x)
= (X W−1 X)−1 XT W−1 Y.

T
Thus we can estimate the ν th derivative of m(x) by

m̂(ν) (x) = ν!β̂ ν (x).
Finally, for any x, we can perform inference on the βj (x) (or the m(ν) (x))
terms in a manner similar to weighted least squares.
The method of LOESS (which stands for Locally Estimated Scatterplot
Smoother)1 is commonly used for local polynomial fitting. However, LOESS
1
There is also another version of LOESS called LOWESS, which stands for Locally
WEighted Scatterplot Smoother. The main difference is the weighting that is introduced
during the smoothing process.

is not a simple mathematical model, but rather an algorithm that, when given
a value of X, computes an appropriate value of Y. The algorithm was de-
signed so that the LOESS curve travels through the middle of the data and
gives points closest to each X value the greatest weight in the smoothing
process, thus limiting the influence of outliers.
Suppose we have a set of observations ((x1 , y1 ), . . . , (xn , yn ). LOESS fol-
lows a basic algorithm as follows:
1. Select a set of values partitioning [x(1) , x(n) ]. Let x0 be an individual
value in this set.
2. For each observation, calculate the distance
di = |xi − x0 |.
Let q be the number of observations in the neighborhood of x0 . The

neighborhood is formally defined as the q smallest values of di where
q = dγne. γ is the proportion of points to be selected and is called the
span (usually chosen to be about 0.40) and d·e means to take the next
largest integer if the calculated value is not already an integer.
3. Perform a weighted regression of the yi ’s on the xi ’s using only the
points in the neighborhood. The weights are given by

|xi − x0 |
wi (x0 ) = T ,
dq
where T (·) is the tricube weight function given by

(1 − |u|3 )3 , if |u| < 1;
T (u) =
0, if |u| ≥ 1,
and dq is the the largest distance in the neighborhood of observations
close to x0 . The weighted regression for x0 is defined by the estimated
regression coefficients
n
X
β̂ LOESS = arg min wi (x0 )[yi − (β0 + β1 (xi − x0 ) + β2 (xi − x0 )2
β
i=1
+ . . . + βh (xi − x0 )h )]2 .
For LOESS, usually h = 2 is sufficient.

4. Calculate the fitted values as
ŷi,LOESS (x0 ) = β̂0,LOESS + β̂1,LOESS (x0 ) + [β̂2,LOESS (x0 )]2

+ . . . + [β̂h,LOESS (x0 )]h .
5. Iterate the above procedure for another value of x0 .
Since outliers can have a large impact on least squares estimates, a ro-
bust weighted regression procedure may also be used to lessen the influence
of outliers on the LOESS curve. This is done by replacing Step 3 in the
algorithm above with a new set of weights. These weights are calculated by
taking the q LOESS residuals
ri∗ = yi − ŷi,LOESS (xi )
and calculating new weights given by
|ri∗ |

wi∗ = wi∗ B .
6M
For wi , the value wi∗0 is the previous weight for this observation (where the first
time you calculate this weight can be done by the original LOESS procedure
we outlined), M is the median of the q absolute values of the residuals, and
B(·) is the bisquare weight function given by

(1 − |u|2 )2 , if |u| < 1;
B(u) =
0, if |u| ≥ 1.
This robust procedure can be iterated up to 5 times for a given x0 .

Some other notes about local regression methods include:
• Various forms of local regression exist in the literature. The main

thing to note is that these are approximation methods with much of
the theory being driven by Taylor’s theorem from Calculus.
• Kernel regression is actually a special case of local regression.
• As with kernel regression, there is also an extension of local regression

regarding multiple predictors. It requires use of a multivariate version
of Taylor’s theorem around the p-dimensional point x0 . The model can

include all main effects, pairwise combination, and k-wise combinations

of the predictors up to the order of h. Weights can then be defined,
such as
kxi − x0 k
wi (x0 ) = T ,
γ
where again γ is the span. The values of xi can also be scaled so that
the smoothness occurs the same way in all directions. However, note
that this estimation is often difficult due to the curse of dimensionality.
19.2.3 Projection Pursuit Regression

Besides procedures like LOESS, there is also an exploratory method called
projection pursuit regression (or PPR) which attempts to reveal possible
nonlinear and interesting structures in
yi = m(xi ) + i
by looking at univariate regressions instead of complicated multiple regres-

sions to avoid the curse of dimensionality. A pure nonparametric approach
can lead to strong oversmoothing and since the sparseness of the space re-
quires to include a lot of space and observations to do a local averaging for a
reliable estimate. To estimate the response function m(·) from the data, the
following PPR algorithm is typically used:
(0)
1. Set ri = yi .
2. For j = 1, . . ., maximize
2
Pn (j−1)
i=1 ri − m̂(j) (α̂T
(j) xi )
2
R(j) =1− 2
Pn (j−1)
i=1 ri
by varying over the orthogonal parameters α̂ ∈ Rp (i.e., kα̂k = 1) and

a univariate regression function m̂(j) (·).
3. Compute new residuals

(j) (j−1)
ri = ri − m̂(j) (α̂T
(j) xi ).

2 2
4. Repeat steps 2 and 3 until R(j) becomes small. A small R(j) implies
T
that m̂(j) (α̂(j) xi ) is approximately the zero function and we will not
find any other useful direction.
The advantages of using PPR for estimation is that we are using univari-
ate regressions which are quick and easy to estimate. Also, PPR is able to
approximate a fairly rich class of functions as well as ignore variables provid-
ing little to no information about m(·). Some disadvantages of using PPR
include having to examine a p-dimensional parameter space to estimate α̂(j)
and interpretation of a single term may be difficult.
19.3 Smoothing Splines

A smoothing spline is a piecewise linear function where the polynomial
pieces fit together at knots. Smoothing splines are continuous on the whole
interval the function is defined on, including at the knots. Mathematically,
a smoothing spline minimizes
n
X Z b
2
(yi − η(xi )) + ω [η 00 (t)]2 dt
i=1 a
among all twice continuously differentiable functions η(·) where ω > 0 is a

smoothing parameter and a ≤ x(1) ≤ . . . ≤ x(n) ≤ b (where, recall, that
x(1) and x(n) are the minimum and maximum x values, respectively). The
knots are usually chosen as the unique values of the predictors in the data
set, but may also be a subset of them.
In the function above, the first term measures the closeness to the data
while the second term penalizes curvature in the function. In fact, it can be
shown that there exists an explicit, unique minimizer, and that minimizer is
a cubic spline with knots at each of the unique values of the xi .
The smoothing parameter does just what it’s name suggests, it smooths
the curve. Typically, 0 < ω ≤ 1, but this need not be the case. When ω > 1,
then ω/(1 + ω) is said to be the tuning parameter. Regardless, when the
smoothing parameter is near 1, a smoother curve is produced. Smaller values
of the smoothing parameter (values near 0) often produce rougher curves as
the curve is interpolating nearer to the observed data points (i.e., the curves
are essentially being drawn right to the location of the data points).

Notice that the cubic smoothing spline introduced above is only capable of
handling one predictor. Suppose now that we have p predictors X1 , . . . , Xp .2
We wish to consider the model
yi = φ(xi,1 , . . . , xi,p ) + i
= φ(xi ) + i ,
fori = 1, . . . , n where φ(·) belongs to the space of functions whose partial

derivatives of order m exist and are in L2 (Ωp ) such that Ωp is the domain of
the p-dimensional random variable X.3 In general, m and p must satisfy the
constraint 2m − p > 0.
For a fixed ω (i.e., the smoothing parameter), we estimate φ by minimizing
n
1X
[yi − φ(xi )]2 + ωJm (φ),
n i=1
which results in what is called a thin-plate smoothing spline. While there

are several ways to define Jm (φ), a common way to define it for a thin-plate
smoothing spline is by
Z +∞ Z +∞ X 2
m! ∂φ
Jm (φ) = ··· t1 tp dx1 · · · dxp ,
−∞ −∞ t1 ! · · · tp ! ∂x1 · · · ∂xp
Γ
where Γ is the set of all permutations of (t1 , . . . , tp ) such that pj=1 tj = m.

P
Numerous algorithms exist for estimation and have demonstrated fairly
stable numerical results. However, one must gently balance fitting the data
closely with avoiding characterizing the fit with excess variation. Fairly gen-
eral procedures also exist for constructing confidence intervals and estimating
the smoothing parameter. The following subsection briefly describes the least
squares method usually driving these algorithms.
19.3.1 Penalized Least Squares

Penalized least squares estimates are a way to balance fitting the data
closely while avoiding overfitting due to excess variation. A penalized least
2
We will forego assuming an intercept for simplicity in this discussion.
3
Basically all this is saying is that the first m derivatives of φ(·) exist when evaluated
at our values of xi,1 , . . . , xi,p for all i.

squares fit is a surface which minimizes a penalized least squares function

over the class of all such surfaces meeting certain regularity conditions.
Let us assume we are in the case defined earlier for a thin-plate smoothing
spline, but now there is also a parametric component. The model we will
consider is
yi = φ(zi ) + xT
i β + i ,
where zi is a q-dimensional vector of covariates while xi is a p-dimensional
vector of covariates whose relationship with yi is characterized through β.
So notice that we have a parametric component to this function and a non-
parametric component. Such a model is said to be semiparametric in nature
and such models are discussed in the last chapter.
The ordinary least squares estimate for our model estimates φ(zi ) and β
by minimizing
n
1X
(yi − φ(zi ) − xT 2
i β) .
n i=1
However, the functional space of φ(zi ) is so large that a function can always
be found which interpolates the points perfectly, but this will simply reflect
all random variation in the data. Penalized least squares attempts to fit the
data well, but provide a degree of smoothness to the fit. Penalized least
squares minimizes
n
1X
(yi − φ(zi ) − xT 2
i β) + ωJm (φ),
n i=1
where Jm (φ) is the penalty on the roughness of the function φ(·). Again, the
squared term of this function measures the goodness-of-fit, while the second
term measures the smoothness associated with φ(·). A larger ω penalizes
rougher fits, while a smaller ω emphasizes the goodness-of-fit.
A final estimate of φ for the penalized least squares method can be written
as n
X
φ̂(zi ) = α + zT
i θ + δk Bk (xi ),
k=1
where the Bk ’s are basis functions dependent on the location of the xi ’s and
α, θ, and δ are coefficients to be estimated. For a fixed ω, (α, θ, δ) can be
estimated. The smoothing parameter ω can be chosen by minimizing the
generalized crossvalidation (or GCV) function. Write
ŷ = A(ω)y,

where A(ω) is called the smoothing matrix. Then the GCV is defined as
k(In×n − A(ω))yk2 /n
V (ω) =
[tr(In×n − A(ω))/n]2
and ω̂ = arg minω V (ω).
19.4 Nonparametric Resampling Techniques

for β̂
In this section, we discuss two commonly used resampling techniques, which
are used for estimating characteristics of the sampling distribution of β̂.
While we discuss these techniques for the regression parameter β, it should be
noted that they can be generalized and applied to any parameter of interest.
They can also be used for constructing nonparametric “confidence” intervals
for the parameter(s) of interest.
19.4.1 The Bootstrap

Bootstrapping is a method where you resample from your data (often with
replacement) in order to approximate the distribution of the data at hand.
While conceptually bootstrapping procedures are very appealing (and they
have been shown to possess certain asymptotic properties), they are compu-
tationally intensive. In the non parametric regression routines we presented,
standard regression assumptions were not made. In these nonstandard sit-
uations, bootstrapping provides a viable alternative for providing standard
errors and confidence intervals for the regression coefficients and predicted
values. When in the regression setting, there are two types of bootstrapping
methods that may be employed. Before we differentiate these methods, we
first discuss bootstrapping in a little more detail.
In bootstrapping, you assume that your sample is actually the pop-
ulation of interest. You draw B samples (B is usually well over 1000)
of size n from your original sample with replacement. With replacement
means that each observation you draw for your sample is always selected
from the entire set of values in your original sample. For each bootstrap
sample, the regression results are computed and stored. For example, if
B = 5000 and we are trying to estimate the sampling distribution of the

regression coefficients for a simple linear regression, then the bootstrap will
∗ ∗ ∗ ∗ ∗ ∗
yield (β0,1 , β1,1 ), (β0,2 , β1,2 ), . . . , (β0,5000 , β1,5000 ) as your sample.
Now suppose that you want the standard errors and confidence intervals
for the regression coefficients. The standard deviation of the B estimates
provided by the bootstrapping scheme is the bootstrap estimate of the stan-
dard error for the respective regression coefficient. Furthermore, a bootstrap
confidence interval is found by sorting the B estimates of a regression co-
efficient and selecting the appropriate percentiles from the sorted list. For
example, a 95% bootstrap confidence interval would be given by the 2.5th
and 97.5th percentiles from the sorted list. Other statistics may be computed
in a similar manner.
One assumption which bootstrapping relies heavily on is that your sam-
ple approximates the population fairly well. Thus, bootstrapping does not
usually work well for small samples as they are likely not representative of
the underlying population. Bootstrapping methods should be relegated to
medium sample sizes or larger (what constitutes a medium sample size is
somewhat subjective).
Now we can turn our attention to the two bootstrapping techniques avail-
able in the regression setting. Assume for both methods that our sample
consists of the pairs (x1 , y1 ), . . . , (xn , yn ). Extending either method to the
case of multiple regression is analogous.
We can first bootstrap the observations. In this setting, the bootstrap
samples are selected from the original pairs of data. So the pairing of a
response with its measured predictor is maintained. This method is appro-
priate for data in which both the predictor and response were selected at
random (i.e., the predictor levels were not predetermined).
We can also bootstrap the residuals. The bootstrap samples in this setting
are selected from what are called the Davison-Hinkley modified residu-
als, given by
n
ei 1X e
e∗i = p − p j ,
1 − hi,i n j=1 1 − hj,j
where the ei ’s are the original regression residuals. We do not simply use
the ei ’s because these lead to biased results. In each bootstrap sample, the
randomly sampled modified residuals are added to the original fitted values
forming new values of y. Thus, the original structure of the predictors will
remain the same while only the response will be changed. This method is
appropriate for designed experiments where the levels of the predictor are

predetermined. Also, since the residuals are sampled and added back at
random, we must assume the variance of the residuals is constant. If not,
this method should not be used.
Finally, a 100 × (1 − α)% bootstrap confidence interval for the regression
coefficient βi is given by
∗ ∗
(βi,b α
×nc , βi,d(1− α )×ne ),
2 2
which is then used to calculate a 100×(1−α)% bootstrap confidence interval

for E(Y |X = xh ), which is given by
∗ ∗ ∗ ∗
(β0,b α
×nc + β1,b α ×nc xh , β0,d(1 α )×ne + β1,d(1− α )×ne xh ).
2 2 2 2
19.4.2 The Jackknife

Jackknifing, which is similar to bootstrapping, is used in statistical infer-
ence to estimate the bias and standard error (variance) of a statistic when a
random sample of observations is used to calculate it. The basic idea behind
the jackknife variance estimator lies in systematically recomputing the esti-
mator of interest by leaving out one or more observations at a time from the
original sample. From this new set of replicates of the statistic, an estimate
for the bias and variance of the statistic can be calculated, which can then
be used to calculate jackknife confidence intervals.
Below we outline the steps for jackknifing in the simple linear regression
setting for simplicity, but the multiple regression setting is analogous:
1. Draw a sample of size n (x1 , y1 ), . . . , (xn , yn ) and divide the sample into
s independent groups, each of size d.
2. Omit the first set of d observations from the sample and estimate β0
and β1 from the (n − d) remaining observations (call these estimates
(J ) (J )
β̂0 1 and β̂1 1 , respectively). The remaining set of (n − d) observations
is called the delete-d jackknife sample.
3. Omit each of the remaining sets of 2, . . . , s groups in turn and esti-

(J ) (J )
mate the respective regression coefficients. These are β̂0 2 , . . . , β̂0 s
(J ) (J )
and β̂1 2 , . . . , β̂1 s . Note that this results in s = nd delete-d jackknife

samples.

(J) (J)
4. Obtain the (joint) probability distribution F (β0 , β1 ) of delete-d jack-
knife estimates. This may be done empirically or through use of inves-
tigating an appropriate distribution.
5. Calculate the jackknife regression coefficient estimate, which is the

(J) (J)
mean of the F (β0 , β1 ) distribution, as:
s
X
(J) (Jk )
β̂j = β̂j ,
k=1
for j = 0, 1. Thus, the delete-d jackknife (simple) linear regression

equation is
(J) (J)
ŷi = β̂0 + β̂1 xi + ei .
The jackknife bias for each regression coefficient is
d J (β̂j ) = (n − 1)(β̂ (J) − βˆj ),

bias j
where βˆj is the estimate obtained when using the full sample of size n. The
jackknife variance for each regression is
(n − 1) (J) ˆ 2
var
c J (β̂j ) = (β̂j − βj ) ,
n
which implies that the jackknife standard error is
q
s.e.
c J (β̂j ) = var
c J (β̂j ).
Finally, if normality is appropriate, then a 100 × (1 − α)% jackknife con-

(J)
fidence interval for the regression coefficient βj is given by
βˆj ± t∗n−2;1−α/2 × s.e.

c J (β̂j ).
Otherwise, we can construct a fully nonparametric jackknife confidence in-

terval in a similar manner as the bootstrap version. Namely,

(J) (J)
β̂j,b α ×nc , β̂j,d(1− α )×ne ,
2 2

which can then be used to calculate a 100 × (1 − α)% bootstrap confidence

interval for E(Y |X = xh ), which is given by

(J) (J) (J) (J)
β̂0,b α ×nc + β̂1,b α ×nc xh , β̂0,d(1 α )×ne + β̂1,d(1− α )×ne xh .
2 2 2 2
While for moderately sized data the jackknife requires less computation,
there are some drawbacks to using the jackknife. Since the jackknife is us-
ing fewer samples, it is only using limited information about β̂. In fact,
the jackknife can be viewed as an approximation to the bootstrap (it is a
linear approximation to the bootstrap in that the two are roughly equal for
linear estimators). Moreover, the jackknife can perform quite poorly if the
estimator of interest is not sufficiently “smooth” (intuitively, smooth can be
thought of as small changes to the data result in small changes to the calcu-
lated statistic), which can especially occur when your sample is too small.
19.5 Examples
Example 1: Packaging Data Set
This data set of size n = 15 contains measurements of yield from a packaging
plant where the manager wants to model the unit cost (y) of shipping lots
of a fragile product as a linear function of lot size (x). Table 19.2 gives the
data used for this analysis. Because of the economics of scale, the manager
believes that the cost per unit will decrease at a fast rate for lot sizes of more
than 1000.
Based on the description of this data, we wish to fit a (continuous) piece-
wise regression with one knot value at k1 = 1000. Figure 19.2 gives a scatter-
plot of the raw data with a vertical line at the lot size of 1000. This appears
to be a good fit.
We can also obtain summary statistics regarding the fit
##########
Coefficients:
(Intercept) 4.0240268 0.1766955 22.774 3.05e-11 ***
lot.size -0.0020897 0.0002052 -10.183 2.94e-07 ***
lot.size.I -0.0013937 0.0003644 -3.825 0.00242 **
---

i Unit Cost Lot Size

1 1.29 1150
2 2.20 840
3 2.26 900
4 2.38 800
5 1.77 1070
6 1.25 1220
7 1.87 980
8 0.71 1300
9 2.90 520
10 2.63 670
11 0.55 1420
12 2.31 850
13 1.90 1000
14 2.15 910
15 1.20 1230
Table 19.2: The packaging data set pertaining to n = 15 observations.
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

##########
As can be seen, the two predictors are statistically significant for this piece-
wise linear regression model.
Example 2: Quality Measurements Dataset (continued )

Recall that we fit a quadratic polynomial to the quality measurement data
set. Let us also fit nonparametric regression curve to this data and calculate
bootstrap confidence intervals for the slope parameters. Figure 19.3(a) shows
two LOESS curves with two different spans.
Here are some general things to think about when fitting data with
LOESS:
1. Which fit appears to be better to you?

2.5
●
●
●
●
●
2.0
Cost ●●
●
1.5
●
●
●
1.0
●
0.5
600 800 1000 1200 1400
Lot Size
Figure 19.2: A scatterplot of the packaging data set with a piecewise linear
regression fitted to the data.
2. How do you think more data would affect the smoothness of the fits?
3. If we drive the span to 0, what type of regression line would you expect
to see?
4. If we drive the span to 1, what type of regression line would you expect
to see?
Figure 19.3(b) shows two kernel regression curves with two different band-
widths. A Gaussian kernel is used. Some things to think about when fitting
the data (as with the LOESS fit) are:
1. Which fit appears to be better to you?
2. How do you think more data would affect the smoothness of the fits?
3. What type of regression line would you expect to see as we change the
bandwidth?
4. How does the choice of kernel affect the fit?

When performing local fitting (as with kernel regression), the last two points
above are issues where there are still no clear solutions.
Quality Scores Quality Scores

100
100
● ● ● ●
● ●
90
90
● ●
● ● ● ●
80
80
Score 2
Score 2
● ●
● ● ● ●
70
70
● ● ● ●
● ●
● ●
60
● Span 60 ● Bandwidth
0.4 5
● 0.9 ● 15
50
50
60 70 80 90 60 70 80 90
Score 1 Score 1
(a) (b)
Figure 19.3: (a) A scatterplot of the quality data set and two LOESS fits
with different spans. (b) A scatterplot of the quality data set and two kernel
regression fits with different bandwidths.
Next, let us return to the orthogonal regression fit of this data. Recall
that the slope term for the orthogonal regression fit was 1.4835. Using a
nonparametric bootstrap (with B = 5000 bootstraps), we can obtain the
following bootstrap confidence intervals for the orthogonal slope parameter:
• 90% bootstrap confidence interval: (0.9677, 2.9408).
Remember that if you were to perform another bootstrap with B = 5000,

then the estimated intervals given above will be slightly different due to the
randomness of the resampling process!

Chapter 20
Regression Models with

Censored Data
Suppose we wish to estimate the parameters of a distribution where only a

portion of the data is known. When the remainder of the data has a measure-
ment that exceeds (or falls below) some threshold and only that threshold
value is recorded for that observation, then the data are said to be censored.
When the data exceeds (or falls below) some threshold, but the data is omit-
ted from the database, then the data are said to be truncated. This chapter
deals primarily with the analysis of censored data by first introducing the
area of reliability (survival) analysis and then presenting some of the basic
tools and models from this area as a segue into a regression setting. We also
devote a section to discussing truncated regression models.
20.1 Overview of Reliability and Survival Analy-

sis
It is helpful to formally define the area of analysis which is heavily concerned
with estimating models with censored data. Survival analysis concerns
the analysis of data from biological events associated with the study of ani-
mals and humans. Reliability analysis concerns the analysis of data from
events associated with the study of engineering applications. We will utilize
terminology from both areas for the sake of completeness.
Survival (reliability) analysis studies the distribution of lifetimes (failure
times). The study will consist of the elapsed time between an initiating event
290
CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 291
and a terminal event. For example:

• Study the time of individuals in a cancer study. The initiating time
could be the diagnosis of cancer or the start of treatment. The terminal
event could be death or cure of the disease.
• Study the lifetime of various machine motors. The initiating time could
be the date the machine was first brought online. The terminal event
could be complete machine failure or the first time it must be brought
off-line for maintenance.
The data are a combination of complete and censored values, which means
a terminal event has occurred or not occurred, respectively. Formally, let Y
be the observed time from the study, T denote the actual event time (and
is sometimes referred to as the latent variable, and t denote some known
threshold value where the values are censored. Observations in a study can
be censored in the following manners:
• Right censoring: This occurs when an observation has dropped out,
been removed from a study, or did not reach a terminal event prior to
termination of the study. In other words, Y ≤ T such that

T, T < t;
Y =
t, T ≥ t.
• Left censoring: This occurs when an observation reaches a terminal

event before the first time point in the study. In other words, Y ≥ T
such that

T, T > t;
Y =
t, T ≤ t.
• Interval censoring: This occurs when a study has discrete time points
and an observation reaches a terminal event between two of the time
points. In other words, for discrete time increments 0 = t1 < t2 <
. . . < tr < ∞, we have Y1 < T < Y2 such that for j = 1, . . . , r − 1,

tj , tj < T < tj+1 ;
Y1 =
0, otherwise
and

tj+1 , tj < T < tj+1 ;
Y2 =
∞, otherwise.

292 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA
• Double censoring: This is when all of the above censored observa-

tions can occur in a study.
Moreover, there are two criteria that define the type of censoring in a study. If
the experimenter controls the type of censoring, then we have non-random
censoring, of which there are two types:
• Type I or time-truncated censoring, which occurs if an observa-
tion is still alive (in operation) when a test is terminated after a pre-
determined length of time.
• Type II or failure-truncated censoring, which occurs if an obser-
vation is still alive (in operation) when a test is terminated after a
pre-determined number of failures is reached.
Suppose T has probability density function f (t) with cumulative distri-
bution function F (t). Since we are interested in survival times (lifetimes),
the support of T is (0, +∞). There are 3 functions usually of interest in a
survival (reliability) analysis:
• The survival function S(t) (or reliability function R(t)) is given
by: Z +∞
S(t) = R(t) = f (x)dx = 1 − F (t).
t
This is the probability that an individual survives (or something is
reliable) beyond time t and is usually the first quantity studied.
• The hazard rate h(t) (or conditional failure rate) is the probability
that an observation at time t will experience a terminal event in the
next instant. It is given by:
f (t) f (t)
h(t) = = .
S(t) R(t)
The empirical hazard (conditional failure) rate function is useful in
identifying which probability distribution to use if it is not already
specified.
• The cumulative hazard function H(t) (or cumulative conditional
failure function) is given by:
Z t
H(t) = h(x)dx.
0

These are only the basics when it comes to survival (reliability) analysis.
However, they provide enough of a foundation for our interests. We are
interested in when a set of predictors (or covariates) are also measured
with the observed time.
20.2 Censored Regression Model

Censored regression models (also called the Tobit model) simply at-
tempts to model the unknown variable T (which is assumed left-censored) as
a linear combination of the covariates X1 , . . . , Xp−1 . For a sample of size n,
we have
Ti = β0 + β1 xi,1 + . . . + βp−1 xi,p−1 + i ,
where i ∼iid N (0, σ 2 ). Based on this model, it can be shown that for the
observed variable Y , that
E[Yi |Yi > t] = XT
i β + σλ(αi ),
where αi = (t − XT
i β)/σ and
φ(αi )
λ(αi ) =
1 − Φ(αi )
such that φ(·) and Φ(·) are the probability density function and cumulative
distribution function of a standard normal random variable (i.e., N (0, 1)),
respectively. Moreover, the quantity λ(αi ) is called the inverse Mills ratio,
which reappears later in our discussion about the truncated regression model.
If we let i1 be the index of all of the uncensored values and i2 be the index
of all of the left-censored values, then we can define a log-likelihood function
for the estimation of the regression parameters (see Appendix C for further
details on likelihood functions):
X
`(β, σ) = −(1/2) [log(2π) + log(σ 2 ) + (yi − XT 2
i β)/σ ]
i1
X
+ log(1 − Φ(XT
i β/σ)).
i2
Optimization of the above equation yields estimates for β and σ.

Now it should be noted that this is a very special case of a broader class of
survival (reliability) regression models. However, it is commonly used so that
is why it is usually treated separately than the broader class of regression
models that are discussed in the next section.

20.3 Survival (Reliability) Regression

Suppose we have n observations where we measure p − 1 covariates with the
observed time. In our examples from the introduction, some covariates you
may also measure include:
• Gender, age, weight, and previous ailments of the cancer patients.
• Manufacturer, metal used for the drive mechanisms, and running tem-
perature of the machines.
Let X∗ be the matrix of covariates as in the standard multiple regression
model, but without the first column consisting of 1’s (so it is an n × (p − 1)
matrix). Then we model
T ∗ = β0 + X∗T β ∗ + ,
where β ∗ is a (p − 1)−dimensional vector, T ∗ = ln T , and has a certain

distribution (which we will discuss shortly). Then
∗T ∗T
β∗ β∗
T = exp (T ∗ ) = eβ0 +X e = eβ0 +X T ],
where T ] = e() . So, the covariate acts multiplicatively on the survival time
T.
The distribution of will allow us to determine the distribution of T ∗ .
Each possible probability distribution has a different h(t). Furthermore, in
a survival regression setting, was assume the hazard rate at time t for an
individual has the form:
h(t|X∗ ) = h0 (t)k(X∗T β ∗ )
∗T
β∗
= h0 (t)eX .
In the above, h0 (t) is called the baseline hazard and is the value of the
hazard function when X∗ = 0 or when β ∗ = 0. Note in the expression for
T ∗ that we separated out the intercept term β0 as it becomes part of the
baseline hazard. Also, k(·) in the equation for h(t|X∗ ) is a specified link
function, which for our purposes will be e(·) .
Next we discuss some of the possible (and common) distributions assumed
for . We do not write out the density formulas here, but they can be found in
most statistical texts. The parameters for your distribution help control three

primary aspects of the density curve: location, scale, and shape. You will
want to consider the properties your data appear to exhibit (or historically
have exhibited) when determining which of the following to use:
• The normal distribution with location parameter (mean) µ and scale

parameter (variance) σ 2 . As we have seen, this is one of the more
commonly used distributions in statistics, but is infrequently used for
lifetime distribution as it allows negative values while lifetimes are al-
ways positive. One possibility is to consider a truncated normal or a
log transformation (which we discuss next).
• The lognormal distribution with location parameter δ, scale parameter

µ, and shape parameter σ 2 . δ gives the minimum value of the random
variable T , and the scale and shape parameters of the lognormal distri-
bution are the location and scale parameters of the normal distribution,
respectively. Note that if T has a lognormal distribution, then ln(T )
has a normal distribution.
• The Weibull distribution with location parameter δ, scale parameter

β, and shape parameter α. The Weibull distribution is probably most
commonly used for time to failure data since it is fairly flexible to work
with. δ again gives the minimum value of the random variable T and
is often set to 0 so that the support of T is positive. Assuming δ = 0
provides the more commonly used two-parameter Weibull distribution.
• The Gumbel distribution (or extreme-value distribution) with location

parameter µ and scale parameter σ 2 is sometimes used, but more often
it is presented due to its relationship to the Weibull distribution. If T
has a Weibull distribution, then ln(T ) has a Gumbel distribution.
• The exponential distribution with location parameter δ and scale pa-

rameter σ (or sometimes called rate 1/σ). δ again gives the minimum
value of the random variable T and is often set to 0 so that the support
of T is positive. Setting δ = 0 results in what is usually referred to as
the exponential distribution. The exponential distribution is a model
for lifetimes with a constant failure rate. If T has an exponential dis-
tribution with δ = 0, then ln(T ) has a standard Gumbel distribution
(i.e., the scale of the Gumbel distribution is 1).

• The logistic distribution with location parameter µ and scale parameter

σ. This distribution is very similar to the normal distribution, but is
used in cases where there are “heavier tails” (i.e., higher probability of
the data occurring out in the tails of the distribution).
• The log-logistic distribution with location parameter δ, scale parameter

λ, and shape parameter α. δ again gives the minimum value of the
random variable T and is often set to 0 so that the support of T is
positive. Setting δ = 0 results in what is usually referred to as the log-
logistic distribution. If T has a log-logistic distribution with δ = 0, then
ln(T ) has a logistic distribution with location parameter µ = −1/ ln(λ)
and scale parameter σ = 1/α.
• The gamma distribution with location parameter δ, scale parameter β,

and shape parameter α. The gamma distribution is a competitor to the
Weibull distribution, but is more mathematically complicated and thus
avoided where the Weibull appears to provide a good fit. The gamma
distribution also arises because the sum of independent exponential
random variables is gamma distributed. δ again gives the minimum
value of the random variable T and is often set to 0 so that the support
of T is positive. Setting δ = 0 results in what is usually referred to as
the gamma distribution.
• The beta distribution has two shape parameter, α and β, as well as

two location parameters, A and B, which denote the minimum and
maximum of the data. If the beta distribution is used for lifetime
data, then it appears when fitting data which are assumed to have an
absolute minimum and absolute maximum. Thus, A and B are almost
always assumed known.
Note that the above is not an exhaustive list, but provides some of the more
commonly used distributions in statistical texts and software. Also, there is
an abuse of notation in that duplication of certain characters (e.g., µ, σ, etc.)
does not imply a mathematical relationship between all of the distributions
where that character appears.
Estimation of the parameters can be accomplished in two primary ways.
One way is to construct a probability of the chosen distribution with your
data and then apply least squares regression to this plot. Another, perhaps
more appropriate, approach is to use maximum likelihood estimation as it

can be shown to be optimum in most situations and provides estimates of

standard errors, and thus confidence limits. Maximum likelihood estimation
is commonly accomplished by using a Newton-Raphson algorithm.
20.4 Cox Proportional Hazards Regression

Recall from the last section that we set T ∗ = ln(T ) where the hazard function
∗Tβ ∗
is h(t|X∗ ) = h0 (t)eX . The Cox formation of this relationship gives:
ln(h(t)) = ln(h0 (t)) + X∗T β ∗ ,
which yields the following form of the linear regression model:

h(t)
ln = X∗T β ∗ .
h0 (t)
Exponentiating both sides yields a ratio of the actual hazard rate and baseline
hazard rate, which is called the relative risk:
h(t) ∗T ∗
= eX β
h0 (t)
p−1
Y
= eβi xi .
i=1
Thus, the regression coefficients have the interpretation as the relative risk
when the value of a covariate is increased by 1 unit. The estimates of the
regression coefficients are interpreted as follows:
• A positive coefficient means there is an increase in the risk, which

decreases the expected survival (failure) time.
• A negative coefficient means there is a decrease in the risk, which in-

creases the expected survival (failure) time.
• The ratio of the estimated risk functions for two different sets of covari-
ates (i.e., two groups) can be used to examine the likelihood of Group
1’s survival (failure) time to Group 2’s survival (failure) time.

Remember, for this model the intercept term has been absorbed by the base-
line hazard.
The model we developed above is the Cox Proportional Hazards re-
gression model and does not include t on the right-hand side. Thus, the
relative risk is constant for all values of t. Estimation for this regression model
is usually done by maximum likelihood and Newton-Raphson is usually the
algorithm used. Usually, the baseline hazard is found non parametrically, so
the estimation procedure for the entire model is said to be semi parametric.
Additionally, if there are failure time ties in the data, then the likelihood gets
more complex and an approximation to the likelihood is usually used (such
as the Breslow Approximation or the Efron Approximation).
20.5 Diagnostic Procedures

Depending on the survival regression model being used, the diagnostic mea-
sures presented here may have a slightly different formulation. We do present
somewhat of a general form for these measures, but the emphasis is on the
purpose of each measure. It should also be noted that one can perform for-
mal hypothesis testing and construct statistical intervals based on various
estimates.
Cox-Snell Residuals
In the previous regression models we studied, residuals were defined as a
difference between observed and fitted values. For survival regression, in
order to check the overall fit of a model, the Cox-Snell residual for the ith
observation in a data set is used and defined as:
∗T ∗
rCi = Ĥ0 (ti )eX β̂
.
∗
In the above, β̂ is the maximum likelihood estimate of the regression co-
efficient vector. Ĥ0 (ti ) is a maximum likelihood estimate of the baseline
cumulative hazard function H0 (ti ), defined as:
Z t
H0 (t) = h0 (x)dx.
0
Notice that rCi > 0 for all i. The way we check for a goodness-of-fit with the
Cox-Snell residuals is to estimate the cumulative hazard rate of the residuals

(call this ĤrC (trCi )) from whatever distribution you are assuming, and then
plot ĤrC (trCi ) versus rCi . A good fit would be suggested if they form roughly
a straight line (like we looked for in probability plots).
Martingale Residuals
Define a censoring indicator for the ith observation as

0, if observation i is censored;
δi =
1, if observation i is uncensored.
In order to identify the best functional form for a covariate given the assumed
functional form of the remaining covariates, we use the Martingale residual
for the ith observation, which is defined as:
M̂i = δi − rCi .
The M̂i values fall between the interval (−∞, 1] and are always negative
for censored values. The M̂i values are plotted against the xj,i , where j
represents the index of the covariate for which we are trying to identify
the best functional form. Plotting a smooth-fitted curve over this data set
will indicate what sort of function (if any) should be applied to xj,i . Note
that the martingale residuals are not symmetrically distributed about 0, but
asymptotically they have mean 0.
Deviance Residuals
Outlier detection in a survival regression model can be done using the de-
viance residual for the ith observation:
q
Di = sgn(M̂i ) −2(`i (θ̂) − `Si (θi )).
For Di , `i (θ̂) is the ith log likelihood evaluated at θ̂, which is the maximum
likelihood estimate of the model’s parameter vector θ. `Si (θi ) is the log
likelihood of the saturated model evaluated at the maximum likelihood θ.
A saturated model is one where n parameters (i.e., θ1 , . . . , θn ) fit the n
observations perfectly.
The Di values should behave like a standard normal sample. A normal
probability plot of the Di values and a plot of Di versus the fitted ln(t)i
values, will help to determine if any values are fairly far from the bulk of

the data. It should be noted that this only applies to cases where light to
moderate censoring occur.
Partial Deviance
Finally, we can also consider hierarchical (nested) models. We start by defin-
ing the model deviance:
X n
∆= Di2 .
i=1
Suppose we are interested in seeing if adding additional covariates to our

model significantly improves the fit from our original model. Suppose we
calculate the model deviances under each model. Denote these model de-
viances as ∆R and ∆F for the reduced model (our original model) and the
full model (our model with all covariates included), respectively. Then, a
measure of the fit can be done using the partial deviance:
Λ = ∆R − ∆F
= −2(`(θ̂ R ) − `(θ̂ F ))

`(θ̂ R )
= −2 log ,
`(θ̂ F )
where `(θ̂ R ) and `(θ̂ F ) are the log likelihood functions evaluated at the maxi-
mum likelihood estimates of the reduced and full models, respectively. Luck-
ily, this is a likelihood ratio statistic and has the corresponding asymptotic χ2
distribution. A large value of Λ (large with respect to the corresponding χ2
distribution) indicates the additional covariates improve the overall fit of the
model. A small value of Λ means they add nothing significant to the model
and you can keep the original set of covariates. Notice that this procedure
is similar to the extra sum of squares procedure developed in the previous
course.
20.6 Truncated Regression Models

Truncated regression models are used in cases where observations with
values for the response variable that are below and/or above certain thresh-
olds are systematically excluded from the sample. Therefore, entire obser-
vations are missing so that neither the dependent nor independent variables

are known. For example, suppose we had wages and years of schooling for
a sample of employees. Some persons for this study are excluded from the
sample because their earned wages fall below the minimum wage. So the
data would be missing for these individuals.
Truncated regression models are often confused with the censored regres-
sion models that we introduced earlier. In censored regression models, only
the value of the dependent variable is clustered at a lower and/or upper
threshold value, while values of the independent variable(s) are still known.
In truncated regression models, entire observations are systematically omit-
ted from the sample based on the lower and/or upper threshold values. Re-
gardless, if we know that the data has been truncated, we can adjust our
estimation technique to account for the bias introduced by omitting values
from the sample. This will allow for more accurate inferences about the en-
tire population. However, if we are solely interested in the population that
does not fall outside the threshold value(s), then we can rely on standard
techniques that we have already introduced, namely ordinary least squares.
Let us formulate the general framework for truncated distributions. Sup-
pose that X is a random variable with a probability density function fX
and associated cumulative distribution function FX (the discrete setting is
defined analogously). Consider the two-sided truncation a < X < b. Then
the truncated distribution is given by
gX (x)
fX (x|a < X < b) = ,
FX (b) − FX (a)
where 
 fX (x), a < x < b;
gX (x) =
0, otherwise.

Similarly, one-sided truncated distributions can be defined by assuming a or

b are set at the respective, natural bound of the support for the distribution
of X (i.e., FX (a) = 0 or FX (b) = 1, respectively). So a bottom-truncated (or
left-truncated) distribution is given by
gX (x)
fX (x|a < X) = ,
1 − FX (a)
while a top-truncated (or right-truncated) distribution is given by
gX (x)
fX (x|X < b) = .
FX (b)

gX (x) is then defined accordingly for whichever distribution with which you
are working.
Consider the canonical multiple linear regression model
Yi = XT
i β + i ,
where i ∼iid N (0, σ 2 ). If no truncation (or censoring) is assumed with the

data, then normal distribution theory yields
Yi |Xi ∼ N (XT 2
i β, σ ).
When truncating the response, the distribution, and consequently the mean
and variance of the truncated distribution, must be adjusted accordingly.
Consider the three possible truncation settings of a < Y < b (two-sided
truncation), a < Yi (bottom-truncation), and Yi < b (top-truncation). Let
αi = (a − XT T T
i β)/σ, γi = (b − Xi β)/σ, and ψi = (yi − Xi β)/σ, such that yi
is the realization of the random variable Yi . Moreover, recall that λ(z) is the
inverse Mills ratio applied to the value of z and let
δ(z) = λ(z)[(Φ(z))−1 − 1].
Then using established results for the truncated normal distribution, the
three different truncated probability density functions are
 1
φ(ψi )



σ
1−Φ(αi )
, Θ = {a < Yi } and a < yi ;



 1
φ(ψi )
fY |X (yi |Θ, XT
i , β, σ) =
σ
Φ(γi )−Φ(αi )
, Θ = {a < Yi < b} and a < yi < b;




1

 φ(ψi )
σ
, Θ = {Yi < b} and yi < b,

Φ(γi )
while the respective truncated cumulative distribution functions are

φ(ψi )−φ(αi )


 1−Φ(αi )
, Θ = {a < Yi } and a < yi ;




φ(ψi )−φ(αi )
FY |X (yi |Θ, XT
i , β, σ) = Φ(γi )−Φ(αi )
, Θ = {a < Yi < b} and a < yi < b;




φ(ψi )

, Θ = {Yi < b} and yi < b.

Φ(γi )

Furthermore, the means of the three different truncated distributions are

 T

 Xi β + σλ(αi ), Θ = {a < Yi };





T T φ(αi )−φ(γi )
E[Yi |Θ, Xi ] = Xi β + σ Φ(γi )−Φ(αi ) , Θ = {a < Yi < b};






 T
Xi β − σδ(γi ), Θ = {Yi < b},
while the corresponding variances are
 2

 σ {1 − λ(αi )[λ(αi ) − αi ]}, Θ = {a < Yi };



 2

T αi φ(αi )−γi φ(γi ) φ(αi )−φ(γi )
Var[Yi |Θ, Xi ] = 2
σ 1 + Φ(γi )−Φ(αi ) − Φ(γi )−Φ(αi ) , Θ = {a < Yi < b};






 2
σ {1 − δ(γi )[δ(γi ) + γi ]}, Θ = {Yi < b}.
Using the distributions defined above, the likelihood function can be found
and maximum likelihood procedures can be employed. Note that the like-
lihood functions will not have a closed-form solution and thus numerical
techniques must be employed to find the estimates of β and σ.
It is also important to underscore the type of estimation method used in
a truncated regression setting. The maximum likelihood estimation method
that we just described will be used when you are interested in a regression
equation that characterizes the entire population, including the observations
that were truncated. If you are interested in characterizing just the subpop-
ulation of observations that were not truncated, then ordinary least squares
can be used. In the context of the example provided at the beginning of
this section, if we regressed wages on years of schooling and were only inter-
ested in the employees who made above the minimum wage, then ordinary
least squares can be used for estimation. However, if we were interested in
all of the employees, including those who happened to be excluded due to
not meeting the minimum wage threshold, then maximum likelihood can be
employed.
20.7 Examples
Example 1: Motor Dataset
This data set of size n = 16 contains observations from a temperature-

accelerated life test for electric motors. The motorettes were tested at four
different temperature levels and when testing terminated, the failure times
were recorded. The data can be found in Table 20.1.
Hours Censor Count Temperature

8064 0 10 150
1764 1 1 170
2772 1 1 170
3444 1 1 170
3542 1 1 170
3780 1 1 170
4860 1 1 170
5196 1 1 170
5448 0 3 170
408 1 2 190
1344 1 2 190
1440 1 1 190
1680 0 5 190
408 1 2 220
504 1 3 220
528 0 5 220
Table 20.1: The motor data set measurements with censoring occurring if a
0 appears in the Censor column.
This data set is actually a very common data set analyzed in survival
analysis texts. We will proceed to fit it with a Weibull survival regression
model. The results from this analysis are
##########
Value Std. Error z p
(Intercept) 17.0671 0.93588 18.24 2.65e-74
count 0.3180 0.15812 2.01 4.43e-02
temp -0.0536 0.00591 -9.07 1.22e-19
Log(scale) -1.2646 0.24485 -5.17 2.40e-07
Scale= 0.282

Weibull distribution
Loglik(model)= -95.6 Loglik(intercept only)= -110.5
Chisq= 29.92 on 2 degrees of freedom, p= 3.2e-07
Number of Newton-Raphson Iterations: 9
n= 16
##########
As we can see, the two covariates are statistically significant. Furthermore,

the scale which is estimated (at 0.282) is the scale pertaining to the distribu-
tion being fit for this model (i.e., Weibull). It too is found to be statistically
significant.
While we appear to have a decent fitting model, we will turn to looking at
the deviance residuals. Figure 20.1(a) gives a plot of the deviance residuals
versus ln(time). As you can see, there does appear to be one value with a
deviance residual of almost -3. This value may be cause for concern. Also,
Figure 20.1(b) gives the NPP plot for these residuals and they do appear to
fit along a straight line, with the trend being somewhat impacted by that
residual in question. One could attempt to remove this point and rerun the
analysis, but the overall fit seems to be good and there are no indications in
the study that this was an incorrectly recorded point, so we will leave it in
the analysis.
Example 2: Logical Reasoning Dataset

Suppose that an educational researcher administered a (hypothetical) test
meant to relate one’s logical reasoning with their mathematical skills. n =
100 participants were chosen for this study and the (simulated) data is pro-
vided in Table 20.2. The test consists of a logical reasoning section (where the
participants received a score between 0 and 10) and a mathematical problem
solving section (where the participants receive a score between 0 and 80).
The scores from the mathematics section (y) were regressed on the scores
from the logical reasoning section (x). The researcher was interested in only
those individuals who received a score of 50 or better on the mathematics
section as they would be used for the next portion of the study, so the data
was truncated at y = 50.
Figure 20.2 shows the data with different regression fits depending on the
assumptions that are made. The solid black circles are all of the participants
with a score of 50 or better on the mathematics section, while the open red
circles indicate those values that are either truncated (as in Figure 20.2(a))

Deviance Residuals Normal Q−Q Plot
1 ● ●
1
● ● ● ●
● ●
● ●
● ● ● ●
● ● ●
0
0
●
Deviance Residuals
● ●
Sample Quantiles
● ●
● ●
● ●
−1
−1
● ●
● ●
−2
−2
● ●
−3
−3
6.0 6.5 7.0 7.5 8.0 8.5 9.0 −2 −1 0 1 2
Log Hours Theoretical Quantiles
(a) (b)
Figure 20.1: (a) Plot of the deviance residuals. (b) NPP plot for the deviance
residuals.
or censored (as in Figure 20.2(b)). The dark blue line on the left is the
truncated regression line that is estimated using ordinary least squares. So
the interpretation of this line will only apply tho those data that were not
truncated, which is what the researcher is interested in. The estimated model
is given below:
##########
Coefficients:
(Intercept) 50.8473 1.1847 42.921 < 2e-16 ***
x 1.6884 0.1871 9.025 7.84e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Multiple R-squared: 0.5045, Adjusted R-squared: 0.4983
##########
Notice that the estimated regression line never drops below the level of trun-
cation (i.e., y = 50) within the domain of the x variable.

Suppose now that the researcher is interested in all of the data and, say,
there is some problem with recovering those participants that were in the
truncated portion of the sample. Then the truncated regression line can be
estimated via the method of maximum likelihood estimation, which is the
light blue line in Figure 20.2(a). This line can (and will likely) go beyond
the level of truncation since the estimation method is accounting for the
truncation. The estimated model is given below:
##########
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 47.69938 1.89772 25.1351 < 2.2e-16 ***
x 2.09223 0.26979 7.7551 8.882e-15 ***
sigma 4.81855 0.46211 10.4273 < 2.2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Log-Likelihood: -231.21 on 3 Df
##########
This will be helpful for the researcher to say something about the broader
population of individuals tested other than those who only received a math
score of 50 or higher. Moreover, notice that both methods yield highly sig-
nificant slope and intercept terms for this data, as would be expected by
observing the strong linear trend in this data.
Figure 20.2(b) shows the estimate obtained when using a survival regres-
sion fit when assuming normal errors. Suppose the data had inadvertently
been censored at y = 50. So all of the red open circles now correspond to
a solid red circle in Figure 20.2(b). Since the data is now treated as left-
censored, we are actually fitting a Tobit regression model. The Tobit fit is
given by the green line and the results are given below:
##########
Value Std. Error z p
(Intercept) 52.21 1.0110 51.6 0.0e+00
x 1.50 0.1648 9.1 8.9e-20
Log(scale) 1.46 0.0766 19.0 7.8e-81
Scale= 4.3

Gaussian distribution
Loglik(model)= -241.6 Loglik(intercept only)= -267.1
Chisq= 51.03 on 1 degrees of freedom, p= 9.1e-13
Number of Newton-Raphson Iterations: 5
n= 100
##########
Moreover, the dashed red line in both figures is the ordinary least squares fit
(assuming all of the data values are known and used in the estimation) and
is simply provided for comparative purposes. The estimates for this fit are
given below:
##########
Coefficients:
(Intercept) 47.2611 0.9484 49.83 <2e-16 ***
x 2.1611 0.1639 13.19 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Multiple R-squared: 0.6396, Adjusted R-squared: 0.6359
F-statistic: 173.9 on 1 and 98 DF, p-value: < 2.2e-16
##########
As you can see, the structure of your data and underlying assumptions
can change your estimate - namely because you are attempting to estimate
different models. The regression lines in Figure 20.2 are a good example of
how different assumptions can alter the final estimates that you report.

Truncated Regression Fits Survival Regression Fit

80
80
Truncated MLE Fit ●
Survival Fit ●
Truncated OLS Fit ● ● Regular OLS Fit ● ●
Regular OLS Fit
● ●
● ● ● ●
70
70
● ●
● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ●●● ● ● ●●●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
● ● ● ●
60
60
● ● ● ●
y
● ● ● ● ● ● ● ●
● ● ● ●● ●● ● ● ● ● ●● ●● ●
● ●● ● ●● ● ●● ● ●●
● ● ● ● ● ●
● ●
● ● ● ●
●● ● ●● ●● ● ●●
● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
50
50
●● ● ● ●●●●●●● ●●● ●●●●●● ● ● ●

● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
●● ●●
● ●
● ● ● ●
● ●
40
40
● ●
0 2 4 6 8 10 0 2 4 6 8 10
x x
(a) (b)
Figure 20.2: (a) A plot of the logical reasoning data. The red circles have
been truncated as they fall below 50. The maximum likelihood fit for the
truncated regression (solid light blue line) and the ordinary least squares fit
for the truncated data set (solid dark blue line) are shown. The ordinary
least squares line (which includes the truncated values for the estimation) is
shown for reference. (b) The logical reasoning data with a Tobit regression
fit provided (solid green line). The data has been censored at 50 (i.e., the
solid red dots are included in the data). Again, the ordinary least squares
line has been provided for reference.

x y x y x y x y
0.00 46.00 2.53 49.95 5.05 64.31 7.58 50.91
0.10 56.38 2.63 51.58 5.15 68.22 7.68 65.51
0.20 45.59 2.73 59.50 5.25 58.39 7.78 61.32
0.30 53.66 2.83 50.84 5.35 58.55 7.88 71.37
0.40 40.05 2.93 55.65 5.45 60.40 7.98 76.97
0.51 46.62 3.03 51.55 5.56 57.10 8.08 56.72
0.61 44.56 3.13 49.16 5.66 58.64 8.18 67.90
0.71 47.20 3.23 58.59 5.76 58.93 8.28 65.30
0.81 57.06 3.33 51.90 5.86 61.30 8.38 61.62
0.91 49.18 3.43 62.95 5.96 60.75 8.48 68.68
1.01 51.06 3.54 57.74 6.06 58.67 8.59 69.43
1.11 51.75 3.64 54.37 6.16 60.67 8.69 64.82
1.21 46.73 3.74 58.21 6.26 59.46 8.79 63.81
1.31 42.04 3.84 55.44 6.36 65.49 8.89 59.27
1.41 48.83 3.94 58.62 6.46 60.96 8.99 62.23
1.52 51.81 4.04 53.63 6.57 57.36 9.09 64.78
1.62 57.35 4.14 43.46 6.67 59.83 9.19 64.88
1.72 49.91 4.24 57.42 6.77 57.40 9.29 72.30
1.82 49.82 4.34 60.64 6.87 62.96 9.39 65.18
1.92 61.53 4.44 50.99 6.97 67.02 9.49 78.35
2.02 47.40 4.55 50.42 7.07 65.93 9.60 64.62
2.12 54.78 4.65 54.68 7.17 63.55 9.70 76.85
2.22 48.94 4.75 54.40 7.27 61.99 9.80 68.57
2.32 55.13 4.85 60.21 7.37 64.48 9.90 61.29
2.42 43.57 4.95 58.70 7.47 62.61 10.00 71.46
Table 20.2: The test scores from n = 100 participants for a logical reasoning
section (x) and a mathematics section (y).

Chapter 21
Nonlinear Regression
All of the models we have discussed thus far have been linear in the parame-
ters (i.e., linear in the beta terms). For example, polynomial regression was
used to model curvature in our data by using higher-ordered values of the
predictors. However, the final regression model was just a linear combination
of higher-ordered predictors.
Now we are interested in studying the nonlinear regression model:
Y = f (X, β) + ,
where X is a vector of p predictors, β is a vector of k parameters, f (·) is

some known regression function, and is an error term whose distribution
mayor may not be normal. Notice that we no longer necessarily have the
dimension of the parameter vector simply one greater than the number of
predictors. Some examples of nonlinear regression models are:
eβ0 +β1 xi
yi = + i
1 + eβ0 +β1 xi
β0 + β1 xi
yi = + i
1 + β2 eβ3 xi
yi = β0 + (0.4 − β0 )e−β1 (xi −5) + i .
However, there are some nonlinear models which are actually called in-
trinsically linear because they can be made linear in the parameters by a
simply transformation. For example:
β0 X
Y =
β1 + X
311
312 CHAPTER 21. NONLINEAR REGRESSION
can be rewritten as
1 1 β1
= + X
Y β0 β0
= θ0 + θ1 X,
which is linear in the transformed variables θ0 and θ1 . In such cases, trans-

forming a model to its linear form often provides better inference procedures
and confidence intervals, but one must be cognizant of the effects that the
transformation has on the distribution of the errors.
We will discuss some of the basics of fitting and inference with nonlinear
regression models. There is a great deal of theory, practice, and computing
associated with nonlinear regression and we will only get to scratch the sur-
face of this topic. We will then turn to a few specific regression models and
discuss generalized linear models.
21.1 Nonlinear Least Squares

We initially consider the setting
Yi = f (Xi , β) + i ,
where the i are iid normal with mean 0 and constant variance σ 2 . For this
setting, we can rely on some of the least squares theory we have developed
over the course. For other nonnormal error terms, different techniques need
to be employed.
First, let
Xn
Q= (yi − f (Xi , β))2 .
i=1
In order to find
β̂ = arg min Q,
β
we first find each of the partial derivatives of Q with respect to βj :

n
∂Q X ∂f (Xi , β)
= −2 [yi − f (Xi , β)] .
∂βj i=1
∂βj

CHAPTER 21. NONLINEAR REGRESSION 313
Then, we set each of the above partial derivatives equal to 0 and the para-
meters βk are each replaced by β̂k . This yields:
n n
X ∂f (Xi , β) X ∂f (Xi , β)
yi − f (Xi , β̂) = 0,
i=1
∂βk
β=β̂ i=1
∂βk
β=β̂
for k = 0, 1, . . . , p − 1.
The solutions to the critical values of the above partial derivatives for
nonlinear regression are nonlinear in the parameter estimates β̂ k and are
often difficult to solve, even in the simplest cases. Hence, iterative numerical
methods are often employed. Even more difficulty arises in that multiple
solutions may be possible!
21.1.1 A Few Algorithms

We will discuss a few incarnations of methods used in nonlinear least squares
estimation. It should be noted that this is NOT an exhaustive list of al-
gorithms, but rather an introduction to some of the more commonly imple-
mented algorithms.
First let us introduce some notation used in these algorithms:
(t)
• Since these numerical algorithms are iterative, let β̂ be the estimated
value of β at time t. When t = 0, this symbolizes a user-specified
starting value for the algorithm.
• Let    
1 y1 − f (X1 , β)
=  ...  =  ..
   
. 
n yn − f (Xn , β)
be an n-dimensional vector of the error terms and e is again the residual
vector.
• Let  
∂kk2
∂β1
∂Q(β)  .. 
∇Q(β) = = . 
∂β T 
∂kk2

∂βk
2
Pnthe 2gradient of the sum of squared errors where Q(β) = kk =
be
i=1 i is the sum of squared errors.

• Let
∂1 ∂1
 
∂β1
... ∂βk
J(β) =  ... ... .. 

. 
∂n ∂n
∂β1
... ∂βk
be the Jacobian matrix.

• Let
∂ 2 Q(β)
H(β) =
∂β T ∂β
 2 2 
∂ kk ∂ 2 kk2
∂β1 ∂β1
... ∂β1 ∂βk

= .. ... .. 
 . . 

∂ 2 kk2 ∂ 2 kk2
∂βk ∂β1
... ∂βk ∂βk
be the Hessian matrix (i.e., matrix of mixed partial derivatives and

second-order derivatives).
• In the following algorithms, we will use the notation established above
(t) (t) (t)
for ∇Q(β̂ ) = ∇Q(β)|β=β̂(t) , H(β̂ ) = H(β)|β = βˆ(t) , and J(β̂ ) =
J(β)|β=β̂(t)
The classical method based on the gradient approach is Newton’s method,

(0)
which starts at β̂ and iteratively calculates
(t+1) (t) (t) (t)
β̂ = β̂ − [H(β̂ )]−1 ∇Q(β̂ )
until a convergence criterion is achieved. The difficulty in this approach is

that inversion of the Hessian matrix can be computation ally difficult. In
particular, the Hessian is not always positive definite unless the algorithm is
initialized with a good starting value, which may be difficult to find.
A modification to Newton’s method is the Gauss-Newton algorithm,
which, unlike Newton’s method, can only be used to minimize a sum of
squares function. The advantage with using the Gauss-Newton algorithm
is that it no longer requires calculation of the Hessian matrix, but rather
approximates it using the Jacobian. The gradient and approximation to the
Hessian matrix can be written as
∇Q(β) = 2J(β)T and H(β) ≈ 2J(β)T J(β).

Thus, the iterative approximation based on the Gauss-Newton method yields

(t+1) (t) (t)
β̂ = β̂ − δ(β̂ )
(t) (t) (t) (t)
= β̂ − [J(β̂ )T J(β̂ )]−1 J(β̂ )T e,
(t) (t)
where we have defined δ(β̂ ) to be everything that is subtracted from β̂ .
Convergence is not always guaranteed with the Gauss-Newton algorithm.
Since the steps for this method may be too large (thus leading to divergence),
one can incorporate a partial step by using
(t+1) (t) (t)
β̂ = β̂ − αδ(β̂ )
such that 0 < α < 1. However, if α is close to 0, an alternative method is
the Levenberg-Marquardt method, which calculates
(t) (t) (t) (t)
δ(β̂ ) = (J(β̂ )T J(β̂ ) + λD)−1 J(β̂ )T e,
where D is a positive diagonal matrix (often taken as the identity matrix)
and λ is the so-called Marquardt parameter. The above is optimized for
λ which limits the length of the step taken at each iteration and improves an
ill-conditioned Hessian matrix.
For these algorithms, you will want to try the easiest one to calculate
for a given nonlinear problem. Ideally, you would like to be able to use
the algorithms in the order they were presented. Newton’s method will give
you an accurate estimate if the Hessian is not ill-conditioned. The Gauss-
Newton will give you a good approximation to the solution Newton’s method
should have arrived at, but convergence is not always guaranteed. Finally,
the Levenberg-Marquardt method can take care of computational difficulties
arising with the other methods, but searching for λ can be tedious.
21.2 Exponential Regression

One simple nonlinear model is the exponential regression model
yi = β0 + β1 eβ2 xi,1 +...+βp+1 xi,1 + i ,
where the i are iid normal with mean 0 and constant variance σ 2 . Notice
that if β0 = 0, then the above is intrinsically linear by taking the natural
logarithm of both sides.

Exponential regression is probably one of the simplest nonlinear regression

models. An example where an exponential regression is often utilized is when
relating the concentration of a substance (the response) to elapsed time (the
predictor).
21.3 Logistic Regression

Logistic regression models a relationship between predictor variables and
a categorical response variable. For example, you could use logistic regression
to model the relationship between various measurements of a manufactured
specimen (such as dimensions and chemical composition) to predict if a crack
greater than 10 mils will occur (a binary variable: either yes or no). Logistic
regression helps us estimate a probability of falling into a certain level of
the categorical response given a set of predictors. You can choose from
three types of logistic regression, depending on the nature of your categorical
response variable:
Binary Logistic Regression: Used when the response is binary (i.e., it

has two possible outcomes). The cracking example given above would
utilize binary logistic regression. Other examples of binary responses
could include passing or failing a test, responding yes or no on a survey,
and having high or low blood pressure.
Nominal Logistic Regression: Used when there are three or more cat-
egories with no natural ordering to the levels. Examples of nominal
responses could include departments at a business (e.g., marketing,
sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN),
and color (black, red, blue, orange).
Ordinal Logistic Regression: Used when there are three or more cate-
gories with a natural ordering to the levels, but the ranking of the
levels do not necessarily mean the intervals between them are equal.
Examples of ordinal responses could be how you rate the effectiveness
of a college course on a scale of 1-5, levels of flavors for hot wings, and
medical condition (e.g., good, stable, serious, critical).
The problems with logistic regression include nonnormal error terms, non-
constant error variance, and constraints on the response function (i.e., the

response is bounded between 0 and 1). We will investigate ways of dealing

with these in the logistic regression setting.
21.3.1 Binary Logistic Regression

The multiple binary logistic regression model is the following:
eβ0 +β1 X1 +...+βp−1 Xp−1

π=
1 + eβ0 +β1 X1 +...+βp−1 Xp−1
T
eX β
= T , (21.1)
1 + eX β
where here π denotes a probability and not the irrational number.
• π is the probability that an observation is in a specified category of the

binary Y variable.
• Notice that the model describes the probability of an event happening

as a function of X variables. For instance, it might provide estimates
of the probability that an older person has heart disease.
• With the logistic model, estimates of π from equations (21.1) will al-
ways be between 0 and 1. The reasons are:
– The numerator eβ0 +β1 X1 +...+βp−1 Xp−1 must be positive, because it

is a power of a positive value (e).
– The denominator of the model is (1+numerator), so the answer
will always be less than 1.
• With one X variable, the theoretical model for π has an elongated “S”
shape (or sigmoidal shape) with asymptotes at 0 and 1, although in
sample estimates we may not see this “S” shape if the range of the X
variable is limited.
There are algebraically equivalent ways to write the logistic regression

model in equation (21.1):
• First is
π
= eβ0 +β1 X1 +...+βp−1 Xp−1 , (21.2)
1−π

which is an equation that describes the odds of being in the current

category of interest. By definition, the odds for an event is P/(1 − P )
such that P is the probability of the event. For example, if you are at
the racetrack and there is a 80% chance that a certain horse will win
the race, then his odds are .80/(1-.80)=4, or 4:1.
• Second is

π
log = β0 + β1 X1 + . . . + βp−1 Xp−1 , (21.3)
1−π
which states that the logarithm of the odds is a linear function of the
X variables (and is often called the log odds).
In order to discuss goodness-of-fit measures and residual diagnostics for
binary logistic regression, it is necessary to at least define the likelihood (see
Appendix C for a further discussion). For a sample of size n, the likelihood
for a binary logistic regression is given by:
n
Y
L(β; y, X) = πiyi (1 − πi )1−yi
i=1
n T y i 1−yi
Y eXi β 1
= T T .
i=1
1 + eXi β 1 + eXi β
This yields the log likelihood:

n
T T
X
`(β) = [yi eXi β − log(1 + eXi β )].
i=1
Maximizing the likelihood (or log likelihood) has no closed-form solution, so a

technique like iteratively reweighted least squares is used to find an estimate
of the regression coefficients, β̂. Once this value of β̂ has been obtained, we
may proceed to define some various goodness-of-fit measures and calculated
residuals. For the residuals we present, they serve the same purpose as in
linear regression. When plotted versus the response, they will help identify
suspect data points. It should also be noted that the following is by no
way an exhaustive list of diagnostic procedures, but rather some of the more
common methods which are used.

Odds Ratio
The odds ratio (which we will write as θ) determines the relationship between
a predictor and response and is available only when the logit link is used.
The odds ratio can be any nonnegative number. An odds ratio of 1 serves
as the baseline for comparison and indicates there is no association between
the response and predictor. If the odds ratio is greater than 1, then the odds
of success are higher for the reference level of the factor (or for higher levels
of a continuous predictor). If the odds ratio is less than 1, then the odds of
success are less for the reference level of the factor (or for higher levels of
a continuous predictor). Values farther from 1 represent stronger degrees of
association. For binary logistic regression, the odds of success are:
π T
= eX β .
1−π
This exponential relationship provides an interpretation for β. The odds
increase multiplicatively by eβj for every one-unit increase in Xj . More for-
mally, the odds ratio between two sets of predictors (say X(1) and X(2) ) is
given by
(π/(1 − π))|X=X(1)
θ= .
(π/(1 − π))|X=X(2)
Wald Test
The Wald test is the test of significance for regression coefficients in logistic
regression (recall that we use t-tests in linear regression). For maximum
likelihood estimates, the ratio
β̂i
Z=
s.e.(β̂i )
can be used to test H0 : βi = 0. The standard normal curve is used to
determine the p-value of the test. Furthermore, confidence intervals can be
constructed as
β̂i ± z1−α/2 s.e.(β̂i ).
Raw Residual
The raw residual is the difference between the actual response and the
estimated probability from the model. The formula for the raw residual is
ri = yi − π̂i .

Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
pi = p .
π̂i (1 − π̂i )
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these
residuals is the deviance statistic. The formula for the deviance residual is
s
yi 1 − yi
di = ± 2 yi log + (1 − yi ) log .
π̂i 1 − π̂i
Hat Values
The hat matrix serves a similar purpose as in the case of linear regression -
to measure the influence of each observation on the overall fit of the model -
but the interpretation is not as clear due to its more complicated form. The
hat values are given by
T
hi,i = π̂i (1 − π̂i )xT
i (X WX)xi ,
where W is an n × n diagonal matrix with the values of π̂i (1 − π̂i ) for i =

1, . . . , n on the diagonal. As before, a hat value is large if hi,i > 2p/n.
Studentized Residuals
We can also report Studentized versions of some of the earlier residuals. The
Studentized Pearson residuals are given by
pi
spi = p
1 − hi,i
and the Studentized deviance residuals are given by

di
sdi = p .
1 − hi,i
C and C̄
C and C̄ are extensions of Cook’s distance for logistic regression. C̄ measures

the overall change in fitted log its due to deleting the ith observation for all
points excluding the one deleted while C includes the deleted point. They
are defined by:
p2i hi,i
Ci =
(1 − hi,i )2
and
p2i hi,i
C̄i = .
(1 − hi,i )
Goodness-of-Fit Tests
Overall performance of the fitted model can be measured by two different
chi-square tests. There is the Pearson chi-square statistic
n
X
P = p2i
i=1
and the deviance statistic

n
X
G= d2i .
i=1
Both of these statistics are approximately chi-square distributed with n − p

degrees of freedom. When a test is rejected, there is a statistically significant
lack of fit. Otherwise, there is no evidence of lack of fit.
These goodness-of-fit tests are analogous to the F -test in the analysis of
variance table for ordinary regression. The null hypothesis is
H0 : β1 = β2 = . . . = βk−1 = 0.
A significant p-value means that at least one of the X variables is a predictor
of the probabilities of interest.
In general, one can also use the likelihood ratio test for testing the
null hypothesis that any subset of the β’s is equal to 0. Suppose we test that
r < p of the β’s are equal to 0. Then the likelihood ratio test statistic is
given by:
(0)
Λ∗ = −2(`(β̂ ) − `(β̂)),
(0)
where `(β̂ ) is the log likelihood of the model specified by the null hypothesis
evaluated at the maximum likelihood estimate of that reduced model. This
test statistic has a χ2 distribution with p − r degrees of freedom.

One additional test is Brown’s test, which has a test statistic to judge
the fit of the logistic model to the data. The formula for the general alter-
native with two degrees of freedom is:
T = sT C −1 s,
where sT = (s1 , s2 ) and C is the covariance matrix of s. The formulas for s1

and s2 are:
n
X log(π̂i )
s1 = (yi − π̂i )(1 + )
i=1
1 − π̂i
n
X log(1 − π̂i )
s2 = (yi − π̂i )(1 + ).
i=1
π̂i
The formula for the symmetric alternative with 1 degree of freedom is:
(s1 + s2 )2
.
Var(s1 + s2 )
To interpret the test, if the p-value is less than your accepted significance
level, then reject the null hypothesis that the model fits the data adequately.
DFDEV and DFCHI

DFDEV and DFCHI are statistics that measure the change in deviance
and in Pearson’s chi-square, respectively, that occurs when an observation
is deleted from the data set. Large values of these statistics indicate ob-
servations that have not been fitted well. The formulas for these statistics
are
DFDEVi = d2i + C̄i
and
C̄i
DFCHILi = .
hi,i
RA2
The calculation of R2 used in linear regression does not extend directly to
logistic regression. The version of R2 used in logistic regression is defined as
`(β̂) − `(βˆ0 )
R2 = ,
`(βˆ0 ) − `S (β)

where `(βˆ0 ) is the log likelihood of the model when only the intercept is
included and `S (β) is the log likelihood of the saturated model (i.e., where a
model is fit perfectly to the data). This R2 does go from 0 to 1 with 1 being
a perfect fit.
21.3.2 Nominal Logistic Regression

In binomial logistic regression, we only had two possible outcomes. For nom-
inal logistic regression, we will consider the possibility of having k possible
outcomes. When k > 2, such responses are known as polytomous.1 The
multiple nominal logistic regression model (sometimes called the multino-
mial logistic regression model) is given by the following:
 1+Pkj=2 eXT βj , j = 2, . . . , k;
 T
 eX βj

πj = (21.4)
 Pk XT β , j = 1,
1


1+ j e
j=2
where again πj denotes a probability and not the irrational number. Notice
that k − 1 of the groups have their own set of β values. Furthermore, since
P k
j=1 πj = 1, we set the β values for group 1 to be 0 (this is what we call the
reference group). Notice that when k = 2, we are back to binary logistic
regression.
πj is the probability that an observation is in one of k categories. The
likelihood for the nominal logistic regression model is given by:
n Y
k
y
Y
L(β; y, X) = πi,ji,j (1 − πi,j )1−yi,j ,
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.
n X
X k
`(β) = yi,j πi,j .
i=1 j=1

of the regression coefficients, β̂.
1
The word polychotomous is sometimes used, but note that this is not actually a word!

An odds ratio (θ) of 1 serves as the baseline for comparison. If θ = 1,

then there is no association between the response and predictor. If θ > 1,
then the odds of success are higher for the reference level of the factor (or for
higher levels of a continuous predictor). If θ < 1, then the odds of success are
less for the reference level of the factor (or for higher levels of a continuous
predictor). Values farther from 1 represent stronger degrees of association.
For nominal logistic regression, the odds of success (at two different levels of
the predictors, say X(1) and X(2) ) are:
(πj /π1 )|X=X(1)

θ= .
(πj /π1 )|X=X(2)
Many of the procedures discussed in binary logistic regression can be

extended to nominal logistic regression with the appropriate modifications.
21.3.3 Ordinal Logistic Regression

For ordinal logistic regression, we again consider k possible outcomes as in
nominal logistic regression, except that the order matters. The multiple
ordinal logistic regression model is the following:
k ∗ T
X eβ0,k∗ +X β
πj = T , (21.5)
j=1 1 + eβ0,k∗ +X β
such that k ∗ ≤ k, π1 ≤ π2 , ≤ . . . , ≤ πk , and again πj denotes a probability.

Notice that this model is a cumulative sum of probabilities which involves just
changing the intercept of the linear regression portion (so β is now (p − 1)-
dimensional and X is n × (p − 1) such that Pfirst column of this matrix is not
a column of 1’s). Also, it still holds that kj=1 πj = 1.
πj is still the probability that an observation is in one of k categories, but
we are constrained by the model written in equation (21.5). The likelihood
for the ordinal logistic regression model is given by:
k
n Y
y
Y
L(β; y, X) = πi,ji,j (1 − πi,j )1−yi,j ,
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.


n X
X k
`(β) = yi,j πi,j .
i=1 j=1
Notice that this is identical to the nominal logistic regression likelihood.

Thus, maximization again has no closed-form solution, so we defer to a pro-
cedure like iteratively reweighted least squares.
For ordinal logistic regression, a proportional odds model is used to de-
termine the odds ratio. Again, an odds ratio (θ) of 1 serves as the baseline
for comparison between the two predictor levels, say X(1) and X(2) . Only one
parameter and one odds ratio is calculated for each predictor. Suppose we
are interested in calculating the odds of X(1) to X(2) . If θ = 1, then there is
no association between the response and these two predictors. If θ > 1, then
the odds of success are higher for the predictor X(1) . If θ < 1, then the odds
of success are less for the predictor X(1) . Values farther from 1 represent
stronger degrees of association. For ordinal logistic regression, the odds ratio
utilizes cumulative probabilities and their complements and is given by:
Pk∗ Pk∗
j=1 πj |X=X(1) /(1 − j=1 πj )|X=X(1)
θ= Pk∗ Pk∗ .
j=1 πj |X=X(2) /(1 j=1 πj )|X=X(2)
21.4 Poisson Regression

The Poisson distribution for a random variable X has the following proba-
bility mass function for a given value X = x:
e−λ λx
P(X = x|λ) = ,
x!
for x = 0, 1, 2, . . .. Notice that the Poisson distribution is characterized by
the single parameter λ, which is the mean rate of occurrence for the even
being measured. For the Poisson distribution, it is assumed that large counts
(with respect to the value of λ) are rare.
Poisson regression is similar to logistic regression in that the dependent
variable (Y ) is a categorical response. Specifically, Y is an observed count
that follows the Poisson distribution, but the rate λ is now determined by

a set of p predictors X = (X1 , . . . , Xp )T . The expression relating these

quantities is
λ = exp{XT β}.
Thus, the fundamental Poisson regression model for observation i is given by
T
e− exp{Xi β} exp{XT
i β}
yi
P(Yi = yi |Xi , β) = .
yi !
That is, for a given set of predictors, the categorical outcome follows a Poisson
distribution with rate exp{XT β}.
In order to discuss goodness-of-fit measures and residual diagnostics for
Poisson regression, it is necessary to at least define the likelihood. For a
sample of size n, the likelihood for a Poisson regression is given by:
n T
Y e− exp{Xi β} exp{XT β}yi i
L(β; y, X) = .
i=1
yi !
n
X n
X n
X
`(β) = yi XT
i β− exp{XT
i β} − log(yi !).
i=1 i=1 i=1

of the regression coefficients, β̂. Once this value of β̂ has been obtained, we
may proceed to define some various goodness-of-fit measures and calculated
residuals. For the residuals we present, they serve the same purpose as in
linear regression. When plotted versus the response, they will help identify
suspect data points.
Goodness-of-Fit
Overall performance of the fitted model can be measured by two different
chi-square tests. There is the Pearson statistic
n
X (yi − exp{XT β̂})2 i
P = T
i=1 exp{Xi β̂}
and the deviance statistic
n
X yi
G= yi log − (yi exp{XT
i β̂}) .
i=1 exp{XT
i β̂}

Both of these statistics are approximately chi-square distributed with n − p

degrees of freedom. When a test is rejected, there is a statistically significant
lack of fit. Otherwise, there is no evidence of lack of fit.
Overdispersion means that the actual covariance matrix for the ob-
served data exceeds that for the specified model for Y |X. For a Poisson
distribution, the mean and the variance are equal. In practice, the data
almost never reflects this fact. So we have overdispersion in the Poisson re-
gression model since the variance is oftentimes greater than the mean. In
addition to testing goodness-of-fit, the Pearson statistic can also be used as
a test of overdispersion. Note that overdispersion can also be measured in
the logistic regression models that were discussed earlier.
Deviance
Recall the measure of deviance introduced in the study of survival regressions
and logistic regression. The measure of deviance for the Poisson regression
setting is given by
D(y, β̂) = 2`S (β) − `(β̂),
where `S (β) is the log likelihood of the saturated model (i.e., where a model
is fit perfectly to the data). This measure of deviance (which differs from the
deviance statistic defined earlier) is a generalization of the sum of squares
from linear regression. The deviance also has an approximate chi-square
distribution.
Pseudo R2
The value of R2 used in linear regression also does not extend to Poisson
regression. One commonly used measure is the pseudo R2 , defined as
`(β̂) − `(βˆ0 )
R2 = ,
`S (β) − `(βˆ0 )
where `(βˆ0 ) is the log likelihood of the model when only the intercept is
included. The pseudo R2 goes from 0 to 1 with 1 being a perfect fit.
Raw Residual
The raw residual is the difference between the actual response and the
estimated value from the model. Remember that the variance is equal to the
mean for a Poisson random variable. Therefore, we expect that the variances

of the residuals are unequal. This can lead to difficulties in the interpretation
of the raw residuals, yet it is still used. The formula for the raw residual is
ri = yi − exp{XT
i β}.
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
pi = q ,
φ̂ exp{XT
i β}
where n
1 X (yi − exp{XT i β̂})
2
φ̂ = .
n − p i=1 exp{XTi β̂}
φ̂ is a dispersion parameter to help control overdispersion.
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these
residuals is the deviance statistic. The formula for the deviance residual is
s
T yi T
di = sgn(yi − exp{Xi β̂}) 2 yi log − (yi − exp{Xi β̂}) .
exp{XT i β̂}
Hat Values
The hat matrix serves the same purpose as in the case of linear regression -
to measure the influence of each observation on the overall fit of the model.
The hat values, hi,i , are the diagonal entries of the Hat matrix
H = W1/2 X(XT WX)−1 XT W1/2 ,
where W is an n × n diagonal matrix with the values of exp{XT
i β̂} on the
diagonal. As before, a hat value is large if hi,i > 2p/n.
Finally, we can also report Studentized versions of some of the earlier resid-
uals. The Studentized Pearson residuals are given by
pi
spi = p
1 − hi,i

and the Studentized deviance residuals are given by
di
sdi = p .
1 − hi,i
21.5 Generalized Linear Models

All of the regression models we have considered (both linear and nonlinear)
actually belong to a family of models called generalized linear models.
Generalized linear models provides a generalization of ordinary least squares
regression that relates the random term (the response Y ) to the system-
atic term (the linear predictor XT β) via a link function (denoted by g(·)).
Specifically, we have the relation
E(Y ) = µ = g −1 (XT β),
so g(µ) = XT β. Some common link functions are:
• The identity link:

g(µ) = µ = XT β,
which is used in traditional linear regression.
• The logit link:

µ
g(µ) = log = XT β
1−µ
T
eX β
⇒µ= T ,
1 + eX β
which is used in logistic regression.
• The log link:
g(µ) = log(µ) = XT β
T
⇒ µ = eX β
,
which is used in Poisson regression.

• The probit link:
g(µ) = Φ−1 (µ) = XT β

⇒ µ = Φ(XT β),
where Φ(·) is the cumulative distribution function of the standard nor-

mal distribution. This link function is also sometimes called the nor-
mit link. This also can be used in logistic regression.
• The complementary log-log link:
g(µ) = log(− log(1 − µ)) = XT β

T
⇒ µ = 1 − exp{−eX β
},
which can also be used in logistic regression. This link function is also
sometimes called the gompit link.
• The power link:
g(µ) = µλ = XT β
⇒ µ = (XT β)1/λ ,
where λ 6= 0. This is used in other regressions which we do not explore

(such as gamma regression and inverse Gaussian regression).
Also, the variance is typically a function of the mean and is often written as
Var(Y ) = V (µ) = V (g −1 (XT β)).
The random variable Y is assumed to belong to an exponential family

distribution where the density can be expressed in the form

yθ − b(θ)
q(y; θ, φ) = exp + c(y, φ) ,
a(φ)
where a(·), b(·), and c(·) are specified functions, θ is a parameter related
to the mean of the distribution, and φ is called the dispersion parame-
ter. Many probability distributions belong to the exponential family. For
example, the normal distribution is used for traditional linear regression, the
binomial distribution is used for logistic regression, and the Poisson distrib-
ution is used for Poisson regression. Other exponential family distributions

lead to gamma regression, inverse Gaussian (normal) regression, and negative

binomial regression, just to name a few.
The unknown parameters, β, are typically estimated with maximum like-
lihood techniques (in particular, using iteratively reweighted least squares),
Bayesian methods (which we will touch on in the advanced topics section),
or quasi-likelihood methods. The quasi-likelihood is a function which pos-
sesses similar properties to the log-likelihood function and is most often used
with count or binary data. Specifically, for a realization y of the random
variable Y , it is defined as
Z µ
y−t
Q(µ; y) = 2
dt,
y σ V (t)
where σ 2 is a scale parameter. There are also tests using likelihood ratio sta-
tistics for model development to determine if any predictors may be dropped
from the model.
21.6 Examples
Example 1: Nonlinear Regression Example
A simple model for population growth towards an asymptote is the logistic
model
β1
yi = + i ,
1 + eβ2 +β3 xi
where yi is the population size at time xi , β1 is the asymptote towards which
the population grows, β2 reflects the size of the population at time x =
0 (relative to its asymptotic size), and β3 controls the growth rate of the
population.
We fit this model to Census population data for the United States (in
millions) ranging from 1790 through 1990 (see Table 21.1). The data are
graphed in Figure 21.1(a) and the line represents the fit of the logistic pop-
ulation growth model.
To fit the logistic model to the U. S. Census data, we need starting values
for the parameters. It is often important in nonlinear least squares estimation
to choose reasonable starting values, which generally requires some insight
into the structure of the model. We know that β1 represents asymptotic
population. The data in Figure 21.1(a) show that in 1990 the U. S. population
stood at about 250 million and did not appear to be close to an asymptote;

year population
1790 3.929
1800 5.308
1810 7.240
1820 9.638
1830 12.866
1840 17.069
1850 23.192
1860 31.443
1870 39.818
1880 50.156
1890 62.948
1900 75.995
1910 91.972
1920 105.711
1930 122.775
1940 131.669
1950 150.697
1960 179.323
1970 203.302
1980 226.542
1990 248.710
Table 21.1: The U.S. Census data.
so as not to extrapolate too far beyond the data, let us set the starting value
of β1 to 350. It is convenient to scale time so that x1 = 0 in 1790, and so
that the unit of time is 10 years. Then substituting β1 = 350 and x = 0 into
the model, using the value y1 = 3.929 from the data, and assuming that the
error is 0, we have
350
3.929 = .
1 + eβ2 +β3 (0)

Solving for β2 gives us a plausible start value for this parameter:

350
eβ2 = −1
3.929

350
β2 = log − 1 ≈ 4.5.
3.929
Finally, returning to the data, at time x = 1 (i.e., at the second Census

performed in 1800), the population was y2 = 5.308. Using this value, along
with the previously determined start values for β1 and β2 , and again setting
the error to 0, we have
350
5.308 = .
1 + e4.5+β3(1)
Solving for β3 we get
350
e4.5+β3 = −1
5.308

350
β3 = log − 1 − 4.5 ≈ −0.3.
5.308
So now we have starting values for the nonlinear least squares algorithm
that we use. Below is the output from running a Gauss-Newton algorithm
for optimization. As you can see, the starting values resulted in convergence
with values not too far from our guess.
##########
Formula: population ~ beta1/(1 + exp(beta2 + beta3 * time))
Parameters:
beta1 389.16551 30.81197 12.63 2.20e-10 ***
beta2 3.99035 0.07032 56.74 < 2e-16 ***
beta3 -0.22662 0.01086 -20.87 4.60e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Census Data Residuals
250 ● ●
5
●
●
●
200
●
●
●
●
● ●
150
●
Population
Residuals
● ●
0
● ●
●
●
100
●
● ●
●
●
●
● ● ● ●
−5
●
50
●
●
● ●
●
●
● ●
● ● ● ●
0
1800 1850 1900 1950 1800 1850 1900 1950
Year Year
(a) (b)
Figure 21.1: (a) Plot of the Census data with the logistic functional fit. (b)
Plot of the residuals versus the year.
Number of iterations to convergence: 6 Achieved convergence

tolerance: 1.492e-06
##########
Figure 21.1(b) is a plot of the residuals versus the year. As you can see,
the logistic functional form that we chose did catch the gross characteristics
of this data, but some of the nuances appear to not be as well characterized.
Since there are indications of some cyclical behavior, a model incorporating
correlated errors or, perhaps, trigonometric functions could be investigated.
Example 2: Binary Logistic Regression Example

We will first perform a binary logistic regression analysis. The data set we
will use is data published on n = 27 leukemia patients. The data (found in
Table 21.2) has a response variable of whether leukemia remission occurred
(REMISS), which is given by a 1. The independent variables are cellularity
of the marrow clot section (CELL), smear differential percentage of blasts
(SMEAR), percentage of absolute marrow leukemia cell infiltrate (INFIL),
percentage labeling index of the bone marrow leukemia cells (LI), absolute
number of blasts in the peripheral blood (BLAST), and the highest temper-
ature prior to start of treatment (TEMP).

The following gives the estimated logistic regression equation and associ-
ated significance tests. The reference group of remission is 1 for this data.
##########
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.25808 74.96480 0.857 0.391
cell 30.83006 52.13520 0.591 0.554
smear 24.68632 61.52601 0.401 0.688
infil -24.97447 65.28088 -0.383 0.702
li 4.36045 2.65798 1.641 0.101
blast -0.01153 2.26634 -0.005 0.996
temp -100.17340 77.75289 -1.288 0.198
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom

Residual deviance: 21.594 on 20 degrees of freedom
AIC: 35.594
Number of Fisher Scoring iterations: 8

##########
As you can see, the index of the bone marrow leukemia cells appears to be
closest to a significant predictor of remission occurring. After looking at
various subsets of the data, it is found that a significant model is one which
only includes the labeling index as a predictor.
##########
Coefficients:
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for binomial family taken to be 1)

Deviance Residuals Pearson Residuals
● ●
● ● ● ●
● ●
0.5
0.5
● ●
● ●
● ●
Deviance Residuals
Pearson Residuals
● ● ● ●
0.0
0.0
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●
● ●
−0.5
−0.5
● ●
● ●
0 5 10 15 20 25 0 5 10 15 20 25
Observation Observation
(a) (b)
Figure 21.2: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.

AIC: 30.073
Odds Ratio: 18.125

95% Confidence Interval: 1.770 185.562
##########
Notice that the odds ratio for LI is 18.12. It is calculated as e2.897 . The
95% confidence interval is calculated as e2.897±z0.975 ∗1.187 , where z0.975 = 1.960
is the 97.5th percentile from the standard normal distribution. The interpre-
tation of the odds ratio is that for every increase of 1 unit in LI, the estimated
odds of leukemia reoccurring are multiplied by 18.12. However, since the LI
appears to fall between 0 and 2, it may make more sense to say that for
every .1 unit increase in L1, the estimated odds of remission are multiplied
by e2.897×0.1 = 1.337. So, assume that we have CELL=1.0 and TEMP=0.97.
Then
• At LI=0.8, the estimated odds of leukemia reoccurring is exp{−3.777+
2.897 ∗ 0.8} = 0.232.

Simulated Poisson Data
10
● ●
● ● ● ● ●
6
Y
● ● ● ●
● ●
4
● ● ● ●
● ● ● ● ● ●
2
● ●
●
0
5 10 15 20
Figure 21.3: Scatterplot of the simulated Poisson data set.
• At LI=0.9, the estimated odds of leukemia reoccurring is exp{−3.777+

2.897 ∗ 0.9} = 0.310.
0.232
• The odds ratio is θ = 0.310 , which is the ratio of the odds of death when
LI=0.8 compared to the odds when L1=0.9. Notice that 0.232×1.337 =
ˆ
0.310, which demonstrates the multiplicative effect by e−β2 on the odds
ratio.
Figure 21.2 also gives plots of the deviance residuals and the Pearson
residuals. These plots seem to be okay.
Example 3: Poisson Regression Example

Table 21.3 consists of a simulated data set of size n = 30 such that the
response (Y ) follows a Poisson distribution with rate λ = exp{0.50 + 0.07X}.
A plot of the response versus the predictor is given in Figure 21.3.
The following gives the analysis of the Poisson regression data:
##########
Coefficients:

Deviance Residuals Pearson Residuals
4 ● ●
4
● ● ● ●
Deviance Residuals
● ●
Pearson Residuals
● ● ● ●
2
2
● ●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
0
0
● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ●
● ●
−2
−2
● ●
● ●
● ●
● ● ● ●
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Fitted Values Fitted Values
(a) (b)
Figure 21.4: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.
(Intercept) 0.007217 0.989060 0.007 0.994

x 0.306982 0.066799 4.596 8.37e-05 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for gaussian family taken to be 3.977365)

AIC: 130.49

##########
As you can see, the predictor is highly significant.

Finally, Figure 21.4 also provides plots of the deviance residuals and Pear-
son residuals versus the fitted values. These plots appear to be good for a
Poisson fit. Further diagnostic plots can also be produced and model selec-
tion techniques can be employed when faced with multiple predictors.

REMISS CELL SMEAR INFIL LI BLAST TEMP

1 0.80 0.83 0.66 1.90 1.10 1.00
1 0.90 0.36 0.32 1.40 0.74 0.99
0 0.80 0.88 0.70 0.80 0.18 0.98
0 1.00 0.87 0.87 0.70 1.05 0.99
1 0.90 0.75 0.68 1.30 0.52 0.98
0 1.00 0.65 0.65 0.60 0.52 0.98
1 0.95 0.97 0.92 1.00 1.23 0.99
0 0.95 0.87 0.83 1.90 1.35 1.02
0 1.00 0.45 0.45 0.80 0.32 1.00
0 0.95 0.36 0.34 0.50 0.00 1.04
0 0.85 0.39 0.33 0.70 0.28 0.99
0 0.70 0.76 0.53 1.20 0.15 0.98
0 0.80 0.46 0.37 0.40 0.38 1.01
0 0.20 0.39 0.08 0.80 0.11 0.99
0 1.00 0.90 0.90 1.10 1.04 0.99
1 1.00 0.84 0.84 1.90 2.06 1.02
0 0.65 0.42 0.27 0.50 0.11 1.01
0 1.00 0.75 0.75 1.00 1.32 1.00
0 0.50 0.44 0.22 0.60 0.11 0.99
1 1.00 0.63 0.63 1.10 1.07 0.99
0 1.00 0.33 0.33 0.40 0.18 1.01
0 0.90 0.93 0.84 0.60 1.59 1.02
1 1.00 0.58 0.58 1.00 0.53 1.00
0 0.95 0.32 0.30 1.60 0.89 0.99
1 1.00 0.60 0.60 1.70 0.96 0.99
1 1.00 0.69 0.69 0.90 0.40 0.99
0 1.00 0.73 0.73 0.70 0.40 0.99
Table 21.2: The leukemia data set. Descriptions of the variables are given in
the text.

i xi yi i xi yi
1 2 0 16 16 7
2 15 6 17 13 6
3 19 4 18 6 2
4 14 1 19 16 5
5 16 5 20 19 5
6 15 2 21 24 6
7 9 2 22 9 2
8 17 10 23 12 5
9 10 3 24 7 1
10 23 10 25 9 3
11 14 2 26 7 3
12 14 6 27 15 3
13 9 5 28 21 4
14 5 2 29 20 6
15 17 2 30 20 9
Table 21.3: Simulated data for the Poisson regression example.

Chapter 22
Multivariate Multiple
Regression
Up until now, we have only been concerned with univariate responses (i.e.,
the case where the response Y is simply a single value for each observation).
However, sometimes you may have multiple responses measured for each ob-
servation, whether it be different characteristics or perhaps measurements
taken over time. When our regression setting must accommodate multiple
responses for a single observation, the technique is called multivariate regres-
sion.
22.1 The Model

A multivariate multiple regression model is a multivariate linear model
that describes how a vector of responses (or y-variables) relates to a set of
predictors (or x-variables). For example, you may have a newly machined
component which is divided into four sections (or sites). Various experimental
predictors may be the temperature and amount of stress induced on the
component. The responses may be the average length of the cracks that
develop at each of the four sites.
The general structure of a multivariate multiple regression model is as
follows:
• A set of p − 1 predictors, or independent variables, are measured for
341
342 CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION
each of the i = 1, . . . , n observations:

 
Xi,1
Xi =  ..
.
 
.
Xi,p−1
• A set of m responses, or dependent variables, are measured for each of

the i = 1, . . . , n observations:
 
Yi,1
Yi =  ...  .
 
Yi,m
• Each of the j = 1, . . . , m responses has it’s own regression model:
Yi,j = β0,j + β1,j Xi,1 + β2,j Xi,2 + . . . + βp−1,j Xi,p−1 + i,j .
Vectorizing the above model for a single observation yields:
Yi = (1 XT
i )B + i ,
where
 
β0,1 β0,2 ... β0,m
 β1,1 β1,2 ... β1,m 
B= β1 β2 . . . βm =
 
.. .. .. .. 
 . . . . 
βp−1,1 βp−1,2 . . . βp−1,m
and
 
i,1
i =  ...  .
 
i,m
Notice that i is the vector of errors for the ith observation.

CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION 343
• Finally, we may explicitly write down the multivariate multiple regres-

sion model:
 T 
Y1
 .. 
Yn×m =  . 
YT
n
 
  β0,1 β0,2 ... β0,m  
1 XT
1 T1
 .. .
 β1,1 β1,2 ... β1,m 
.. 
= . .
. 

.. .. ... .. +
 
. 
. . .
1 XT T
 
n n
βp−1,1 βp−1,2 . . . βp−1,m
= Xn×p Bp×m + εn×m .
Or more compactly, without the dimensional subscripts, we will write:
Y = XB + ε.
22.2 Estimation and Statistical Regions

Least Squares
Extending least squares theory from the multiple regression setting to the
multivariate multiple regression setting is fairly intuitive. The biggest hurdle
is dealing with the matrix calculations (which statistical packages perform for
you anyhow). We can also formulate similar assumptions for the multivariate
model.
Let  
1,j
(j) =  ...  ,
 
n,j
which is the vector of errors for the j th trial of all n observations. We assume
that E((j) ) = 0 and Cov((i) , (k) ) = σi,k Im×m for each i, k = 1, . . . , n. Notice
that the j th trial of the n observations have variance-covariance matrix Σ =
{σi,k }, but observations from different entries of the vector are uncorrelated.
The least squares estimate for B is simply given by:
B̂ = (XT X)−1 XT Y.

Using B̂, we can calculate the predicted values as:
Ŷ = XB̂
and the residuals as:

ε̂ = Y − Ŷ.
Furthermore, an estimate of Σ (which is the maximum likelihood estimate of
Σ) is given by:
1
Σ̂ = ε̂T ε̂.
n
Hypothesis Testing
Suppose we are interested in testing the hypothesis that our multivariate
responses do not depend on the predictors Xi,q+1 , . . . , Xi,p−1 . We can par-
tition B to consist of two matrices: one with the regression coefficients of
the predictors we assume will remain in the model and one with the regres-
sion coefficients we wish to test. Similarly, we can partition X in a similar
manner. Formally, the test is
H0 : β (2) = 0,
where
β (1)
B=
β (2)
and

X= X1 X2 .
Here X2 is an n × (p − q − 1) matrix of predictors corresponding to the null
hypothesis and X1 is an n × (q) matrix of predictors we assume will remain
in the model. Furthermore, β (2) and β (1) are (p − q − 1) × m and q × m
matrices, respectively, for these predictor matrices.
Under the null hypothesis, we can calculate
−1 T
β̂ (1) = (XT
1 X1 ) X1 Y
and
Σ̂1 = (Y − X1 β̂ (1) )T (Y − X1 β̂ (1) )/n.

These values (which are maximum likelihood estimates under the null hy-
pothesis) can be used to calculate one of four commonly used multivariate
test statistics:
|nΣ̂|
Wilks’ Lambda =
|n(Σ̂1 − Σ̂)|
Pillai’s Trace = tr[(Σ̂1 − Σ̂)Σ̂−1
1 ]
Hotelling-Lawley Trace = tr[(Σ̂1 − Σ̂)Σ̂−1 ]

λ1
Roy’s Greatest Root = .
1 + λ1
In the above, λ1 is the largest nonzero eigenvalue of (Σ̂1 − Σ̂)Σ̂−1 . Also,
the value |Σ| is the determinant of the variance-covariance matrix Σ and is
called the generalized variance which assigns a single numerical value to
express the overall variation of this multivariate problem. All of the above
test statistics have approximate F −distributions with degrees of freedom
which are more complicated to calculate than what we have seen. Most
statistical packages will report at least one of the above if not all four. For
large sample sizes, the associated p-values will likely be similar, but various
situations (such as many large eigenvalues of (Σ̂1 Σ̂)Σ̂−1 or a relatively small
sample size) will lead to a discrepancy between the results. In this case, it is
usually accepted to report the Wilks’ lambda value as this is the likelihood
ratio test.
Confidence Regions
One problem is to predict the mean responses corresponding to fixed values
T
xh of the predictors. Using various distributional results concerning B̂ xh
and Σ̂, it can be shown that the 100 × (1 − α)% simultaneous confidence
intervals for E(Yi |X = xh ) = xT
h β̂ i are
s
T m(n − p − 2)
xh β̂ i ± Fm,n−p−1−m;1−α
n−p−1−m
s
T T −1
n
× xh (X X) xh σ̂i,i ,
n−p−2
for i = 1, . . . , m. Here, β̂ i is the ith column of B̂ and σ̂i,i is the ith diagonal
element of Σ̂. Also, notice that the simultaneous confidence intervals are

constructed for each of the m entries of the response vector, thus why they are
considered “simultaneous”. Furthermore, the collection of these simultaneous
T
intervals yields what we call a 100 × (1 − α)% confidence region for B̂ xh .
Prediction Regions
Another problem is to predict new responses Yh = BT xh + εh . Again,
skipping over a discussion on various distributional assumptions, it can be
shown that the 100 × (1 − α)% simultaneous prediction intervals for the
individual responses Yh,i are
s
m(n − p − 2)
xT
h β̂ i ± Fm,n−p−1−m;1−α
n−p−1−m
s
T n
× (1 + xT −1
h (X X) xh ) σ̂i,i ,
n−p−2
for i = 1, . . . , m. The quantities here are the same as those in the simultane-
ous confidence intervals. Furthermore, the collection of these simultaneous
prediction intervals are called a 100 × (1 − α)% prediction region for yh .
MANOVA
The multivariate analysis of variance (MANOVA) table is similar to
it’s univariate counterpart. The sum of squares values in a MANOVA are
no longer scalar quantities, but rather matrices. Hence, the entries in the
MANOVA table are called sum of squares and cross-products (SSCPs).
These quantities are described in a little more detail below:
• The
Pn sum of squares and cross-products for total is SSCPTO =
T
i=1 (Yi − Ȳ)(Yi − Ȳ) , which is the sum of squared deviations from
the overall mean vector of the Yi ’s. SSCPTO is a measure of the overall
variation in the Y vectors. The corresponding total degrees of freedom
are n − 1.
• P
The sum of squares and cross-products for the errors is SSCPE =
n T
i=1 (Yi − Ŷi )(Yi − Ŷi ) , which is the sum of squared observed errors
(residuals) for the observed data vectors. SSE is a measure of the vari-
ation in Y that is not explained by the multivariate regression. The
corresponding error degrees of freedom are n − p.

• The sum of squares and cross-products due to the regression is

SSCPR = SSCPTO − SSCPE, and it is a measure of the total variation
in Y that can be explained by the regression with the predictors. The
corresponding model degrees of freedom are p − 1.
Formally, a MANOVA table is given in Table 22.1.
Source df SSCP
Pn
Regression p−1 (Ŷ − Ȳ)(Ŷi − Ȳ)T
Pni=1 i
Error n−p Pi=1 (Yi − Ŷi )(Yi − Ŷi )T
n T
Total n−1 i=1 (Yi − Ȳ)(Yi − Ȳ)
Table 22.1: MANOVA table for the multivariate multiple linear regression
model.
Notice in the MANOVA table that we do not define any mean square
values or an F -statistic. Rather, a test of the significance of the multivari-
ate multiple regression model is carried out using a Wilks’ lambda quantity
similar to
Pn T

i=1 (Yi − Ŷi )(Yi − Ŷi )
Λ∗ = ,
n (Yi − Ȳ)(Yi − Ȳ)T
P
i=1
which will follow a χ2 distribution. However, depending on the number of

variables and the number of trials, modified versions of this test statistic
must be used, which will affect the degrees of freedom for the corresponding
χ2 distribution.
22.3 Reduced Rank Regression

Reduced rank regression is a way of constraining the multivariate linear
regression model so that the rank of the regression coefficient matrix has
less than full rank. The objective in reduced rank regression is to minimize
the sum of squared residual subject to a reduced rank condition. Without
the rank condition, the estimation problem is an ordinary least squares prob-
lem. Reduced-rank regression is important in that it contains as special cases

the classical statistical techniques of principal component analysis, canoni-

cal variate and correlation analysis, linear discriminant analysis, exploratory
factor analysis, multiple correspondence analysis, and other linear methods
of analyzing multivariate data. It is also heavily utilized in neural network
modeling and econometrics.
Recall that the multivariate regression model is
Y = XB + ε,
where Y is an n × m matrix, X is an n × p matrix, and B is a p × m matrix

of regression parameters. A reduced rank regression occurs when we have
the rank constraint
rank(B) = t ≤ min(p, m),
with equality yielding the traditional least squares setting. When the rank
condition above holds, then there exists two non-unique full rank matrices
Ap×t and Ct×m , such that
B = AC.
Moreover, there may be an additional set of predictors, say W, such that W
is a n × q matrix. Letting D denote a q × m matrix of regression parameters,
we can then write the reduced rank regression model as follows:
Y = XAC + WD + ε.
In order to get estimates for the reduced rank regression model, first note
that E((j) ) = 0 and Var(0(j) ) = Im×m ⊗ Σ. For simplicity in the following,
let Z0 = Y, Z1 = X, and Z2 = W. Next, we define the moment matrices
−1
Mi,j = ZT i Zj /m for i, j = 0, 1, 2 and Si,j = Mi,j − Mi,2 M2,2 M2,i , i, j = 0, 1.
Then, the parameters estimates for the reduced rank regression model are as
follows:
Â = (ν̂1 , . . . , ν̂t )Φ
T T
Ĉ = S0,1 Â(Â S1,1 Â)−1
−1 T
−1 T
D̂ = M0,2 M2,2 − Ĉ Â M1,2 M2,2 ,
where (ν̂1 , . . . , ν̂t ) are the eigenvectors corresponding to the t largest eigen-
−1
values λ̂1 , . . . λ̂t of |λS1,1 − S1,0 S0,0 S0,1 | = 0 and where Φ is an arbitrary t × t
matrix with full rank.

22.4 Example
Example: Amitriptyline Data
This example analyzes conjectured side effects of amitriptyline - a drug some
physicians prescribe as an antidepressant. Data were gathered on n = 17
patients admitted to a hospital with an overdose of amitriptyline. The two
response variables are Y1 = total TCAD plasma level and Y2 = amount of
amitriptyline present in TCAD plasma level. The five predictors measured
are X1 = gender (0 for male and 1 for female), X2 = amount of the drug
taken at the time of overdose, X3 = PR wave measurement, X4 = diastolic
blood pressure, and X5 = QRS wave measurement. Table 22.2 gives the data
set and we wish to fit a multivariate multiple linear regression model.
Y1 Y2 X1 X2 X3 X4 X5
3389 3149 1 7500 220 0 140
1101 653 1 1975 200 0 100
1131 810 0 3600 205 60 111
596 448 1 675 160 60 120
896 844 1 750 185 70 83
1767 1450 1 2500 180 60 80
807 493 1 350 154 80 98
1111 941 0 1500 200 70 93
645 547 1 375 137 60 105
628 392 1 1050 167 60 74
1360 1283 1 3000 180 60 80
652 458 1 450 160 64 60
860 722 1 1750 135 90 79
500 384 0 2000 160 60 80
781 501 0 4500 180 0 100
1070 405 0 1500 170 90 120
1754 1520 1 3000 180 0 129
Table 22.2: The amitriptyline data set.
First we obtain the regression estimates for each response:
##########
Coefficients:

Y1 Y2
(Intercept) -2879.4782 -2728.7085
X1 675.6508 763.0298
X2 0.2849 0.3064
X3 10.2721 8.8962
X4 7.2512 7.2056
X5 7.5982 4.9871
##########
Then we can obtain individual ANOVA tables for each response and see that
the multiple regression model for each response is statistically significant.
##########
Response Y1 :
Regression 5 6835932 1367186 17.286 6.983e-05 ***
Residuals 11 870008 79092
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Response Y2 :
Regression 5 6669669 1333934 15.598 0.0001132 ***
Residuals 11 940709 85519
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##########
The following also gives the SSCP matrices for this fit:
##########
$SSCPR
Y1 Y2
Y1 6835932 6709091
Y2 6709091 6669669
$SSCPE
Y1 Y2
Y1 870008.3 765676.5

Y2 765676.5 940708.9
$SSCPTO
Y1 Y2
Y1 7705940 7474767
Y2 7474767 7610378
##########
We can also see which predictors are statistically significant for each response:
##########
Response Y1 :
Coefficients:
(Intercept) -2.879e+03 8.933e+02 -3.224 0.008108 **
X1 6.757e+02 1.621e+02 4.169 0.001565 **
X2 2.849e-01 6.091e-02 4.677 0.000675 ***
X3 1.027e+01 4.255e+00 2.414 0.034358 *
X4 7.251e+00 3.225e+00 2.248 0.046026 *
X5 7.598e+00 3.849e+00 1.974 0.074006 .
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Response Y2 :
Coefficients:
(Intercept) -2.729e+03 9.288e+02 -2.938 0.013502 *
X1 7.630e+02 1.685e+02 4.528 0.000861 ***
X2 3.064e-01 6.334e-02 4.837 0.000521 ***
X3 8.896e+00 4.424e+00 2.011 0.069515 .
X4 7.206e+00 3.354e+00 2.149 0.054782 .
X5 4.987e+00 4.002e+00 1.246 0.238622

Response = Y1 Response = Y2
1.5 ● ●
●
●
2
1.0
●
●
● ●
0.5
●
● ●
●
1
●
●
0.0
●
● ●
−0.5
0
●
● ● ●
● ● ●
−1.0
−1
● ●
−1.5
●
●
●
−2.0
● ●
500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000
Fitted Values Fitted Values
(a) (b)
Figure 22.1: Plots of the Studentized residuals versus fitted values for the
response (a) total TCAD plasma level and the response (b) amount of
amitriptyline present in TCAD plasma level.
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

##########
We can proceed to drop certain predictors from the model in an attempt

to improve the fit as well as view residual plots to assess the regression
assumptions. Figure 22.1 gives the Studentized residual plots for each of the
responses. Notice that the plots have a fairly random pattern, but there is
one value high with respect to the fitted values. We could formally test (i.e.,
with a Levene’s test) to see if this affects the constant variance assumption
and also to study pairwise scatterplots for any potential multicollinearity in
this model.

Chapter 23
Data Mining
The field of Statistics is constantly being presented with larger and more
complex data sets than ever before. The challenge for the Statistician is to
be able to make sense of all of this data, extract important patterns, and
find meaningful trends. We refer to the general tools and the approaches for
dealing with these challenges in massive data sets as data mining.1
Data mining problems typically involve an outcome measurement which
we wish to predict based on a set of feature measurements. The set of
these observed measurements is called the training data. From these train-
ing data, we attempt to build a learner, which is a model used to predict the
outcome for new subjects. These learning problems are (roughly) categorized
as either supervised or unsupervised. A supervised learning problem is
one where the goal is to predict the value of an outcome measure based on a
number of input measures, such as classification with labeled samples from
the training data. An unsupervised learning problem is one where there is
no outcome measure and the goal is to describe the associations and patterns
among a set of input measures, which involves clustering unlabeled training
data by partitioning a set of features into a number of statistical classes. The
regression problems that are the focus of this text are (generally) supervised
learning problems.
Data mining is an extensive field in and of itself. In fact, many of the
methods utilized in this field are regression-based. For example, smoothing
splines, shrinkage methods, and multivariate regression methods are all often
found in data mining. The purpose of this chapter will not be to revisit these
1
Data mining is also referred to as statistical learning or machine learning.
353
354 CHAPTER 23. DATA MINING
methods, but rather to add to our toolbox additional regression methods,

which are methods that happen to be utilized more in data mining problems.
23.1 Some Notes on Variable and Model Se-

lection
When faced with high-dimensional data, it is often desired to perform some
variable selection procedure. Methods discussed earlier, such as best subsets,
forward selection, and backwards elimination can be used; however, these
can be very computationally expensive to implement. Shrinkage methods
like LASSO can be implemented, but these too can be expensive.
Another alternative used in variable selection and commonly discussed in
the context of data mining is least angle regression or LARS. LARS is
a stagewise procedure that uses a simple mathematical formula to acceler-
ate the computations relative to other variable selection procedure we have
discussed. Only p steps are required for the full set of solutions, where pis
the number of predictors. The LARS procedure starts with all coefficients
equal to zero, and then finds the predictor most correlated with the response,
say Xj1 . We take the largest step possible in the direction of this predictor
until some other predictor, say Xj2 , has as much correlation with the cur-
rent residual. LARS then proceeds in a direction equiangular between the
two predictors until a third variable, say Xj3 earns its way into the “most
correlated” set. LARS then proceeds equiangularly between Xj1 , Xj2 , and
Xj3 (along the “least angle direction”) until a fourth variable enters. This
continues until all p predictors have entered the model and then the ana-
lyst studies these p models to determine which yields an appropriate level of
parsimony.
A related methodology to LARS is forward stagewise regression.
Forward stagewise regression starts by taking the residuals between the re-
sponse values and their mean (i.e., all of the regression slopes are set to
0). Call this vector r. Then, find the predictor most correlated with r, say
Xj1 . Update the regression coefficient βj1 by setting βj1 = βj∗1 + δji , where
δji = × corr(r, Xj1 ) for some > 0 and βj∗1 is the old value of βj1 . Finally,
update r by setting it equal to r∗ − δj1 Xj1 , such that r∗ is the old value of r.
Repeat this process until no predictor has a correlation with r.
LARS and forward stagewise regression are very computationally efficient.

CHAPTER 23. DATA MINING 355
In fact, a slight modification to the LARS algorithm can calculate all possible
LASSO estimates for a given problem. Moreover, a different modification
to LARS efficiently implements forward stagewise regression. In fact, the
acronym for LARS includes an “S” at the end to reflect its connection to
LASSO and forward stagewise regression.
Earlier in the text we also introduced the bootstrap as a way to get boot-
strap confidence intervals for the regression parameters. However, the notion
of the bootstrap can also be extended to fitting a regression model. Suppose
that we have p − 1 feature measurements and one outcome variable. Let
Z = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} be our training data that we wish to fit
a model to such that we obtain the prediction fˆ(x) at each input x. Boot-
strap aggregation or bagging averages this prediction over a collection of
bootstrap samples, thus reducing its variance. For each bootstrap sample
Z∗b , b = 1, 2, . . . , B, we fit our model, which yields the prediction fˆb∗ (x). The
bagging estimate is then defined by
B
1 X ˆ∗
fˆbag(x) = f (x).
B b=1 b
Denote the empirical distribution function by P̂, which puts equal proba-
bility 1/n on each of the data points (xi , yi ). The “true” bagging estimate
is defined by EP̂ fˆ∗ (x), where Z∗ = {(x∗1 , y1∗ ), (x∗2 , y2∗ ), . . . , (x∗n , yn∗ )} and each
(x∗i , yi∗ ) ∼ P̂ . Note that the bagging estimate given above is a Monte Carlo
estimate of the “true” bagging estimate, which it approaches as B → ∞. The
bagging approach can be used in other model selection approaches through-
out Statistics and data mining.
23.2 Classification and Support Vector Re-

gression
Classification is the problem of identifying the subpopulation to which new
observations belong, where the labels of the subpopulation are unknown,
on the basis of a training set of data containing observations whose sub-
population is known. The classification problem is often contrasted with
clustering, where the problem is to analyze a data set and determine how
(or if) the data set can be divided into groups. In data mining, classification

is a supervised learning problem while clustering is an unsupervised learning

problem.
In this chapter, we will focus on a special classification technique which
has many regression applications in data mining. Support Vector Ma-
chines (or SVMs) perform classification by constructing an N -dimensional
hyperplane that optimally separates the data into two categories. SVM mod-
els are closely related to neural networks, which we discuss later in this
chapter. The predictor variables are called attributes, and a transformed
attribute that is used to define the hyperplane is the feature. The task of
choosing the most suitable representation is known as feature selection. A
set of features that describes one case (i.e., a row of predictor values) is called
a vector. So the goal of SVM modeling is to find the optimal hyperplane
that separates clusters of vectors in such a way that cases with one category
of the target variable are on one side of the plane and cases with the other
category are on the other size of the plane. The vectors near the hyperplane
are the support vectors.
Suppose we wish to perform classification with the data shown in Figure
23.1(a) and our data has a categorical target variable with two categories.
Also assume that the attributes have continuous values. Figure 23.1(b) pro-
vides a snapshot of how we perform SVM modeling. The SVM analysis
attempts to find a 1-dimensional hyperplane (i.e., a line) that separates the
cases based on their target categories. There are an infinite number of possi-
ble lines and we show only one in Figure 23.1(b). The question is which line
is optimal and how do we define that line.
The dashed lines drawn parallel to the separating line mark the distance
between the dividing line and the closest vectors to the line. The distance
between the dashed lines is called the margin. The vectors (i.e., points)
that constrain the width of the margin are the support vectors. An SVM
analysis finds the line (or, in general, hyperplane) that is oriented so that the
margin between the support vectors is maximized. Unfortunately, the data
we deal with is not generally as simple as that in Figure 23.2. The challenge
will be to develop an SVM model that accommodates such characteristics as:
1. more than two attributes;
2. separation of the points with nonlinear curves;
3. handling of cases where the clusters cannot be completely separated;

and

● ●
● ●
4
4
● ● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ● ●● ● ● ● ●● ●
● ●
2
2
● ● ● ● ● ●
●● ● ●● ●
● ●● ● ●●
● ● ● ●
y
y
● ● ● ●
0
0
● ● ● ●
● ● ● ●
●● ● ●● ●
−2
−2
● ● ● ●
● ●● ● ●●
● ●
● ●● ●●● ● ● ●● ●●● ●
● ● ● ●
● ●
●● ● ●● ●
● ●
−4
−4
● ● ● ●
● ●
−2 0 2 4 −2 0 2 4
x x
(a) (b)
Figure 23.1: (a) A plot of the data where classification is to be performed.

(b) The data where a support vector machine has been used. The points
near the parallel dashed lines are the support vectors. The regions between
the parallel dashed lines is called the margin, which is the region we want to
optimize.
4. handling classification with more than two categories.
The setting with nonlinear curves and where clusters cannot be completely
separated is illustrated in Figure 23.2. Without loss of generality, our dis-
cussion will mainly be focused to the one attribute and one feature setting.
Moreover, we will be utilizing support vectors in order to build a regression
relationship that fits our data adequately.
A little more terminology is necessary before we move into the regression
discussion. A loss function represents the loss in utility associated with
an estimate being “wrong” (i.e., different from either a desired or a true
value) as a function of a measure of the degree of “wrongness” (generally
the difference between the estimated value and the true or desired value).
When discussing SVM modeling in the regression setting, the loss function
will need to incorporate a distance measure as well.
As a quick illustration of some common loss functions look at Figure 23.3.
Figure 23.3(a) is a quadratic loss function, which is what we use in classical
ordinary least squares. Figure 23.3(b) is a Laplacian loss function, which

SVM classification plot
1.0
3
2 0.5
● ● ●
●●
● ●
●
1 ● ●
● 0.0
● ● ● ●
●
●
●
● ● ●
●● ●
● ● ●● ●
0 ●
● ●●● ●
● ● ● −0.5
● ● ●
● ●● ● ●
●
● ●
● ● ●
●
−1 ●
● −1.0
● ●
●
−2
●
−2 −1 0 1 2 3
Figure 23.2: A plot of data where a support vector machine has been used
for classification. The data was generated where we know that the circles
belong to group 1 and the triangles belong to group 2. The white contours
show where the margin is; however, there are clearly some values that have
been misclassified since the two clusters are not well-separated. The points
that are solid were used as the training data.
is less sensitive to outliers than the quadratic loss function. Figure 23.3(c)
is Huber’s loss function, which is a robust loss function that has optimal
properties when the underlying distribution of the data is unknown. Finally,
Figure 23.3(d) is called the -insensitive loss function, which enables a sparse
set of support vectors to be obtained.
In Support Vector Regressions (or SVRs), the input is first mapped
onto an N -dimensional feature space using some fixed (nonlinear) mapping,
and then a linear model is constructed in this feature space. Using mathe-
matical notation, the linear model (in the feature space) is given by
N
X
f (x, ω) = ωj gj (x) + b,
j=1
where gj (·), j = 1, . . . , N denotes a set of nonlinear transformations, and b

is a bias term. If the data is assumed to be of zero mean (as it usually is),

(a) (b)
(c) (d)
Figure 23.3: Plots of the (a) quadratic loss, (b) Laplace loss, (c) Huber’s loss,
and (d) -insensitive loss functions.

then the bias term is dropped. Note that b is not considered stochastic in
this model and is not akin to the error terms we have studied in previous
models.
The optimal regression function is given by the minimum of the functional
n
1 X
Φ(ω, ξ) = kωk2 + C (ξ − + ξ + ),
2 i=1
where C is a pre-specified constant, ξ − and ξ + are slack variables representing

upper and lower constraints (respectively) on the output of the system. In
other words, we have the following constraints:
yi − f (xi , ω) ≤ + ξi+
f (xi , ω) − yi ≤ + ξi−
ξ − , ξ + ≥ 0, i = 1, . . . , n,
where yi is defined through the loss function we are using. The four loss
functions we show in Figure 23.3 are as follows:
Quadratic Loss:
L2 = (f (x) − y) = (f (x) − y)2
Laplace Loss:
L1 = (f (x) − y) = |f (x) − y|
Huber’s Loss2 :
1
− y)2 ,

2
(f (x) for |f (x) − y| < δ;
LH = δ2
δ|f (x) − y| − 2
, otherwise.
-Insensitive Loss:

0, for |f (x) − y| < ;
L =
|f (x) − y| − , otherwise.
Depending on which loss function is chosen, then an appropriate optimiza-

tion problem can be specified, which can involve kernel methods. Moreover,
specification of the kernel type as well as values like C, , and δ all control
the complexity of the model in different ways. There are many subtleties
2
The quantity δ is a specified threshold constant.

depending on which loss function is used and the investigator should become
familiar with the loss function being employed. Regardless, the optimization
approach will require the use of numerical methods.
It is also desirable to strike a balance between complexity and the error
that is present with the fitted model. Test error (also known as gener-
alization error) is the expected prediction error over an independent test
sample and is given by
Err = E[L(Y, fˆ(X))],
where X and Y are drawn randomly from their joint distribution. This
expectation is an average of everything that is random in this set-up, includ-
ing the randomness in the training sample that produced the estimate fˆ(·).
Training error is the average loss over the training sample and is given by
n
1X
err = L(yi , fˆ(xi )).
n i=1
We would like to know the test error of our estimated model fˆ(·). As the
model increases in complexity, it is able to capture more complicated un-
derlying structures in the data, which thus decreases bias. But then the
estimation error increases, which thus increases variance. This is known as
the bias-variance tradeoff. In between there is an optimal model complex-
ity that gives minimum test error.
23.3 Boosting and Regression Transfer

Transfer learning is the notion that it is easier to learn a new concept (such
as how to play racquetball) if you are already familiar with a similar concept
(such as knowing how to play tennis). In the context of supervised learn-
ing, inductive transfer learning is often framed as the problem of learning
a concept of interest, called the target concept, given data from multiple
sources: a typically small amount of target data that reflects the target con-
cept, and a larger amount of source data that reflects one or more different,
but possibly related, source concepts.
While most algorithms addressing this notion are in classification settings,
some of the common algorithms can be extended to the regression setting to
help us build our models. The approach we discuss is called boosting or

boosted regression. Boosted regression is highly flexible in that it allows

the researcher to specify the feature measurements without specifying their
functional relationship to the outcome measurement. Because of this flexibil-
ity, a boosted model will tend to fit better than a linear model and therefore
inferences made based on the boosted model may have more credibility.
Our goal is to learn a model of a concept ctarget mapping feature vec-
tors from the feature space containing X to the response space Y . We
are given a set of training instances Ttarget = {(xi , yi )}, with xi ∈ X and
yi ∈ Y for i = 1, . . . , n that reflect ctarget . In addition, we are given data sets
1 B
Tsource , . . . , Tsource source reflecting B different, but possibly related, concepts
also mapping X to Y . In order to learn the most accurate possible model of
ctarget , we must decide how to use both the target and source data sets. If
Ttarget is sufficiently large, we can likely learn a good model using only this
data. However, if Ttarget is small and one or more of the source concepts is
similar to ctarget , then we may be able to use the source data to improve our
model.
Regression transfer algorithms fit into two basic categories: those that
make use of models trained on the source data, and those that use the source
data directly as training data. The two algorithms presented here fit into
each of these categories and are inspired by two boosting-based algorithms
for classification transfer: ExpBoost and AdaBoost. The regression analogues
that we present are called ExpBoost.R2 and AdaBoost.R2. Boosting is an
ensemble method in which a sequence of models (or hypotheses) h1 , . . . , hm ,
each mapping from X to Y , are iteratively fit to some transformation of a
data set using a base learner. The outputs of these models are then combined
into a final hypothesis, which we denote as h∞ . We can now formalize the
two regression transfer algorithms.
AdaBoost.R2
Input the labeled target data set T of size n, the maximum number of
iterations B, and a base learning algorithm called Learner. Unless otherwise
specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n.
For t = 1, . . . , B:
1. Call Learner with the training set T and the distribution wt , and get
a hypothesis ht : X → R.
2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi −

ht (xj )|, so that eti = |yi − ht (xi )|/Dt .

Pn t t
3. Calculate the adjusted error of ht , which is t = i=1 ei wi . If t ≥ 0.5,
then stop and set B = t − 1.
4. Let γt = t /(1 − t ).
1−eti
5. Update the weight vector as wit+1 = wit γt /Zt , such that Zt is a
normalizing constant.
Output the hypothesis h∞ , which is the median of ht (x) for t = 1, . . . , B,

using ln(1/γt ) as the weight for hypothesis ht .
The method used in AdaBoost.R2 is to express each error in relation to

the largest error D = maxi |ei | in such a way that each adjusted error e0i
is in the range [0, 1]. In particular, one of three possible loss functions is
used: e0i = ei /D (linear), e0i = e2i /D2 (quadratic), or e0i = 1 − exp(−ei /D)
(exponential). The degree to which instance xi is reweighted in iteration t
thus depends on how large the error of ht is on xi relative to the error on the
worst instance.
ExpBoost.R2
Input the labeled target data set T of size n, the maximum number of
iterations B, and a base learning algorithm called Learner. Unless otherwise
specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n.
Moreover, each source data set gets assigned to one expert from the set of
experts H B = {h1 , . . . , hB }.
For t = 1, . . . , B:
1. Call Learner with the training set T and the distribution wt , and get
a hypothesis ht : X → R.
2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi −
ht (xj )|, so that eti = |yi − ht (xi )|/Dt .
3. Calculate the adjusted error of ht , which is t = ni=1 eti wit . If t ≥ 0.5,

P
then stop and set B = t − 1.

4. Calculate the weighted errors of each expert in H B on the current

weighting scheme. If any expert in H B has a lower weighted error than
ht , then replace ht with this “best” expert.
5. Let γt = t /(1 − t ).
1−eti
6. Update the weight vector as wit+1 = wit γt /Zt , such that Zt is a
normalizing constant.
Output the hypothesis h∞ , which is the median of ht (x) for t = 1, . . . , B,

using ln(1/γt ) as the weight for hypothesis ht .
As can be seen, ExpBoost.R2 is similar to the AdaBoost.R2 algorithm,

but with a few minor differences.
23.4 CART and MARS

Classification and regression trees (CART) is a nonparametric tree-
based method which partitions the predictor space into a set of rectangles
and then fits a simple model (like a constant) in each one. While they seem
conceptually simple, they are actually quite powerful.
Suppose we have one response (yi ) and p predictors (xi,1 , . . . , xi,p ) for
i = 1, . . . , n. First we partition the predictor space into M regions (say,
R1 , . . . , RM ) and model the response as a constant cm in each region:
M
X
f (x) = cm I(x ∈ Rm ).
m=1
Then, minimizing the sum of squares yields

n
X
ĉm = (yi − f (xi ))2
i=1
P n
yi I(xi ∈ Rm )
= Pi=1n .
i=1 I(xi ∈ Rm )
We proceed to grow the tree by finding the best binary partition in terms
of the ĉm values. This is generally computationally infeasible which leads to

use of a greedy search algorithm. Typically, the tree is grown until a small
node size (such as 5 nodes) is reached and then a method for pruning the
tree is implemented.
Multivariate adaptive regression splines (MARS) is another non-
parametric method that can be viewed as a modification of CART and is well-
suited for high-dimensional problems. MARS uses expansions in piecewise
linear basis functions of the form (x−t)+ and (t−x)+ such that the “+” sub-
script simply means we take the positive part (e.g., (x−t)+ = (x−t)I(x > t)).
These two functions together are called a reflected pair.
In MARS, each function is piecewise linear with a knot at t. The idea is
to form a reflected pair for each predictor Xj with knots at each observed
value xi,j of that predictors. Therefore, the collection of basis functions for
j = 1, . . . , p is
C = {(Xj − t)+ , (t − Xj )+ }t∈{x1,j ,...,xn,j } .
MARS proceeds like a forward stepwise regression model selection procedure,

but instead of selecting the predictors to use, we use functions from the set
C and their products. Thus, the model has the form
M
X
f (X) = β0 + βm hm (X),
m=1
where each hm (X) is a function in C or a product of two or more such

functions.
You can also think of MARS as “selecting” a weighted sum of basis func-
tions from the set of (a large number of) basis functions that span all values
of each predictor (i.e., that set would consist of one basis function and knot
value t for each distinct value of each predictor variable). The MARS algo-
rithm then searches over the space of all inputs and predictor values (knot
locations t) as well as interactions between variables. During this search,
an increasingly larger number of basis functions are added to the model (se-
lected from the set of possible basis functions) to maximize an overall least
squares goodness-of-fit criterion. As a result of these operations, MARS au-
tomatically determines the most important independent variables as well as
the most significant interactions among them.

23.5 Neural Networks

With the exponential growth in available data and advancement in comput-
ing power, researchers in statistics, artificial intelligence, and data mining
have been faced with the challenge to develop simple, flexible, powerful pro-
cedures for modeling large data sets. One such model is the neural network
approach, which attempts to model the response as a nonlinear function of
various linear combinations of the predictors. Neural networks were first used
as models for the human brain.
The most commonly used neural network model is the single-hidden-
layer, feedforward neural network (sometimes called the single-layer
perceptron.) In this neural network model, the ith response yi is modeled as
a nonlinear function fY of m derived predictor values, Si,0 , Si,1 , . . . , Si,m−1 :
yi = fY (β0 Si,0 + β1 Si,1 + . . . + βm−1 Si,m−1 ) + i

= fY (ST
i β) + i ,
where
   
β0 Si,0
 β1   Si,1 
β=  and Si =  .
   
.. ..
 .   . 
βm−1 Si,m−1
Si,0 equals 1 and for j = 1, . . . , m − 1, the j th derived predictor value for the
ith observation, Si,j , is a nonlinear function fj of a linear combination of the
original predictors:
Si,j = fj (XT
i θ j ),
where
   
θj,0 Xi,0
 θj,1   Xi,1 
θj =  and X =
   
..  i  .. 
 .   . 
θj,p−1 Xi,p−1
and Xi,0 = 1. The functions fY , f1 , . . . , fm−1 are called activation func-

tions. Finally, we can combine all of the above to form the neural network

model as:
yi = fY (ST
i β) + i
m−1
X
= fY (β0 + βj fj (XT
i θ j )) + i .
j=1
There are various numerical optimization algorithms for fitting neural

networks (e.g., quasi-Newton methods and conjugate-gradient algorithms).
One important thing to note is that parameter estimation in neural networks
often utilizes penalized least squares to control the level of overfitting. The
penalized least squares criterion is given by:
n
X m−1
X
Q= [yi − fY (β0 + βj fj (XT 2
i θ j ))] + pλ (β, θ 1 , . . . , θ m−1 ),
i=1 j=1
where the penalty term is given by:

m−1 m−1 p−1
X XX
pλ (β, θ 1 , . . . , θ m−1 ) = λ βi2 + 2
θi,j .
i=0 i=1 j=1
Finally, there is also a modeling technique which is similar to the tree-

based methods discussed earlier. The hierarchical mixture-of-experts
model (HME model) is a parametric tree-based method which recursively
splits the function of interest at each node. However, the splits are done
probabilistically and the probabilities are functions of the predictors. The
model is written as
k1
X k2
X
f (yi ) = λj1 (xi , τ ) λj2 (xi , τ j1 )
j1 =1 j2 =1
kr
X
··· λjr (xi , τ j1 , j2 , . . . , jr−1 )g(yi ; xi , θ j1 ,j2 ,...,jr−1 ),
jr =1
which has a tree structure with r levels (i.e., r levels where probabilistic splits
occur). The λ(·) functions provide the probabilities for the splitting and, in
addition to being dependent on the predictors, they also have their own
set of parameters (the different τ values) requiring estimation (these mixing

proportions are modeled using logistic regressions). Finally, θ is simply the

parameter vectors for the regression modeled at each terminal node of the
tree constructed using the HME structure.
The HME model is similar to CART, however, unlike CART it does not
provide a “hard” split at each node (i.e., either a node splits or it does not).
The HME model incorporates these predictor-dependent mixing proportions
which provide a “soft” probabilistic split at each node. The HME model
can also be thought of as being in the middle of continuum where at one
end we have CART (which provides hard splits) and at the other end is
mixtures of regressions (which is closely related to the HME model, but the
mixing proportions which provide the soft probabilistic splits are no longer
predictor-dependent). We will discuss mixtures of regressions at the end of
this chapter.
23.6 Examples
Example 1: Simulated Neural Network Data
In this very simple toy example, we have provided two features (i.e., input
neurons X1 and X2 ) and one response measurement (i.e., output neuron Y )
which are given in Table 23.1. The model fit is one where X1 and X2 do not
interact on Y . A single hidden-layer and a double hidden-layer neural net
model are each fit to this data to highlight the difference in the fits. Below
is the output (for each neural net) which shows the results from training
the model. A total of 5 training samples were used and a threshold value
of 0.01 was used as a stopping criterion. The stopping criterion pertains to
the partial derivatives of the error function and once we fall beneath that
threshold value, then the algorithm stops for that training sample.
X1 X2 Y
0 0 0
1 0 1
0 1 1
1 1 0
Table 23.1: The simulated neural network data.
########## 5 repetitions were calculated.

Error Reached Threshold Steps

3 0.3480998376 0.007288519715 41 1 0.5000706639
0.009004839727 14 2 0.5000949028 0.009028409036 26 4
0.5001216674 0.008221843135 35 5 0.5007429970
0.007923316336 10
5 repetitions were calculated.
Error Reached Threshold Steps

4 0.0002241701811 0.009294280160 61 2 0.0004741186530
0.008171862296 193 5 0.2516368073472 0.006640846189 88 3
0.3556429122848 0.007036160421 46 1 0.5015928330534
0.009549108455 25 ##########
1 1 1 1
2.8
431
1
X1 X1 6.15091
−3 −1
. 90 .1
−0.4
−6
41
0.67
2.17
37 96
.40
3
1864
40
307
502
3
−1.62898 Y Y
3.3107
4
38
3 52
00
.76
87 25
1
. .1
−5
−3 −1
X2 X2 6.07832
Error: 0.3481 Steps: 41 Error: 0.000224 Steps: 61
(a) (b)
Figure 23.4: (a) The fitted single hidden-layer neural net model to the toy
data. (b) The fitted double hidden-layer neural net model to the toy data.
In the above output, the first group of 5 repetitions pertain to the single
hidden-layer neural net. For those 5 repetitions, the third training sample
yielded the smallest error (about 0.3481). The second group of 5 repetitions

● ● ● ●
● ●
● ●
50
50
● ●● ● ● ●● ●
● ● ● ●
●● ●●
● ● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ●
● ● ● ●
●
●●●● ●●
●●●●●●●● ●●
●● ● ●● ● ●
●●●● ●●
●●●●●●●● ●●
●● ● ●● ●
0
0
● ●● ● ●●
●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
accel
accel
●●● ● ● ●●● ● ●
● ● ● ●
●● ● ● ●● ● ●
−50
−50
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
−100
−100
● ●
● ● ● ●
●
●
●
●
●
●
●
● ε = 0.01
●
●● ●●
●
●● ●● ε = 0.10
● ●
●
●
●
●
ε = 0.70
10 20 30 40 50 10 20 30 40 50
times times
(a) (b)
Figure 23.5: (a) Data from a simulated motorcycle accident where the time
until impact (in milliseconds) is plotted versus the recorded head acceleration
(in g). (b) The data with different values of used for the support vector
regression obtained with an −insensitive loss function. Note how the smaller
the , the more features you pick up in the fit, but the complexity of the model
also increases.
pertain to the double hidden-layer neural net. For those 5 repetitions, the
fourth training sample yielded the smallest error (about 0.0002). The increase
in complexity of the neural net has yielded a smaller training error. The fitted
neural net models are depicted in Figure 23.4.
Example 2: Motorcycle Accident Data

This data set is from a simulated accident involving different motorcycles.
The time in milliseconds until impact and the g-force measurement of ac-
celeration are recorded. The data are provided in Table 23.2 and plotted in
Figure 23.5(a). Given the obvious nonlinear trend that is present with this
data, we will attempt to fit a support vector regression to this data.
A support vector regression using an -insensitive loss function is fitted to
this data. ∈ {0.01, 0.10, 0.70} are fitted to this data and are shown in Figure
23.5(b). As decreases, different characteristics of the data are emphasized,
but the level of complexity of the model is increased. As noted earlier, we

want to try and strike a good balance regarding the model complexity. For
the training error, we get values of 0.177, 0.168, and 0.250 for the three
levels of . Since our objective is to minimize the training error, the value
of (which has a training error of 0.168) is chosen. This corresponds to the
green line in Figure 23.5(b).

Obs. Times Accel. Obs. Times Accel. Obs. Times Accel.

1 2.4 0.0 33 18.6 -112.5 64 32.0 54.9
2 2.6 -1.3 34 19.2 -123.1 65 32.8 46.9
3 3.2 -2.7 35 19.4 -85.6 66 33.4 16.0
4 3.6 0.0 36 19.6 -127.2 67 33.8 45.6
5 4.0 -2.7 37 20.2 -123.1 68 34.4 1.3
6 6.2 -2.7 38 20.4 -117.9 69 34.8 75.0
7 6.6 -2.7 39 21.2 -134.0 70 35.2 -16.0
8 6.8 -1.3 40 21.4 -101.9 71 35.4 69.6
9 7.8 -2.7 41 21.8 -108.4 72 35.6 34.8
10 8.2 -2.7 42 22.0 -123.1 73 36.2 -37.5
11 8.8 -1.3 43 23.2 -123.1 74 38.0 46.9
12 9.6 -2.7 44 23.4 -128.5 75 39.2 5.4
13 10.0 -2.7 45 24.0 -112.5 76 39.4 -1.3
14 10.2 -5.4 46 24.2 -95.1 77 40.0 -21.5
15 10.6 -2.7 47 24.6 -53.5 78 40.4 -13.3
16 11.0 -5.4 48 25.0 -64.4 79 41.6 30.8
17 11.4 0.0 49 25.4 -72.3 80 42.4 29.4
18 13.2 -2.7 50 25.6 -26.8 81 42.8 0.0
19 13.6 -2.7 51 26.0 -5.4 82 43.0 14.7
20 13.8 0.0 52 26.2 -107.1 83 44.0 -1.3
21 14.6 -13.3 53 26.4 -65.6 84 44.4 0.0
22 14.8 -2.7 54 27.0 -16.0 85 45.0 10.7
23 15.4 -22.8 55 27.2 -45.6 86 46.6 10.7
24 15.6 -40.2 56 27.6 4.0 87 47.8 -26.8
25 15.8 -21.5 57 28.2 12.0 88 48.8 -13.3
26 16.0 -42.9 58 28.4 -21.5 89 50.6 0.0
27 16.2 -21.5 59 28.6 46.9 90 52.0 10.7
28 16.4 -5.4 60 29.4 -17.4 91 53.2 -14.7
29 16.6 -59.0 61 30.2 36.2 92 55.0 -2.7
30 16.8 -71.0 62 31.0 75.0 93 55.4 -2.7
31 17.6 -37.5 63 31.2 8.1 94 57.6 10.7
32 17.8 -99.1
Table 23.2: The motorcycle data.

Chapter 24
Advanced Topics
This chapter presents topics where theory beyond the scope of this course
needs to be developed with the applicability. The topics are not arranged
in any particular order, but rather are just a sample of some of the more
advanced regression procedures that are available. Not all computer software
has the capabilities to perform analysis on the models presented here.
24.1 Semiparametric Regression

Semiparametric regression is concerned with flexible modeling of nonlin-
ear functional relationships in regression analysis by building a model con-
sisting of both parametric and nonparametric components. We have already
visited a semiparametric model with the Cox proportional hazards model. In
this model, there is the baseline hazard, which is nonparametric, and then
the hazards ratio, which is parametric.
Suppose we have n = 200 observations where y is the response, x1 is
a predictor taking on only values of 1, 2, 3 or 4, and x2 , x3 and x4 are
predictors taking on values between 0 and 1. A semiparametric regression
model of interest for this setting is
yi = β0 + β1 z2,i + β2 z3,i + β3 z4,i + m(x2,i , x3,i , x4,i ),
where
zj,i = I{x1,i = j}.
In otherwords, we are using the leave-one-out method for the levels of x1 .
373
374 CHAPTER 24. ADVANCED TOPICS
The results of fitting a semiparametric regression model are given in Fig-

ure 24.1. There are noticeable functional forms for x2 and x3 , however, x4
appears to almost be 0. In fact, this is exactly how the data was generated.
The data were generated according to:
yi = −5.15487 + e2xi,1 + 0.2x11 6 3 10
2,i (10(1 − x2,i )) + 10(10x2,i ) (1 − x2,i ) + ei ,
where the ei were generated according to a normal distribution with mean 0

and variance 4. Notice how the x4,i term was not used in the data generation,
which is reflected in the plot and in the significance of the smoothing term
from the output below:
##########
Parametric coefficients:
(Intercept) 3.9660 0.2939 13.494 < 2e-16 ***
x12 1.8851 0.4176 4.514 1.12e-05 ***
x13 3.8264 0.4192 9.128 < 2e-16 ***
x14 6.1100 0.4181 14.615 < 2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Approximate significance of smooth terms:

edf Est.rank F p-value
s(x2) 1.729 4.000 25.301 <2e-16 ***
s(x3) 7.069 9.000 45.839 <2e-16 ***
s(x4) 1.000 1.000 0.057 0.811
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
R-sq.(adj) = 0.78 Deviance explained = 79.4%

GCV score = 4.5786 Scale est. = 4.2628 n = 200
##########
There are actually many general forms of semiparametric regression mod-
els. We will list a few of them. In the following outline, X = (X1 , . . . , Xp )T
pertains to the predictors and may be partitioned such that X = (UT , VT )T
where U = (U1 , . . . , Ur )T , V = (V1 , . . . , Vs )T , and r + s = p. Also, m(·)
is a nonparametric function and g(·) is a link function as established in the
discussion on generalized linear models.

CHAPTER 24. ADVANCED TOPICS 375
10
10
s(x2,1.73)
s(x3,7.07)
5
5
0
0
−5
−5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x2 x3
10
6
Partial for x1
5
s(x4,1)
4
0
2
−5
0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4
x4 x1
Figure 24.1: Semiparametric regression fits of the generated data.

Additive Models: In the model

p
X
E(Y |X) = β0 + mj (Xj ),
j=1
we have a fixed intercept term and wish to estimate p nonparametric

functions - one for each of the predictors.
Partial Linear Models: In the model
E(Y |U, V) = UT β + m(V),
we have the sum of a purely parametric part and a purely nonparamet-

ric part, which involves parametric estimation routines and nonpara-
metric estimation routines, respectively. This is the type of model used
in the generation of the example given above.
Generalized Additive Models: In the model
p
X
E(Y |X) = g(β0 + mj (Xj )),
j=1
we have the same setting as in an additive model, but a link function

relates the sum of functions to the response variable. This is the model
fitted to the example above.
Generalized Partial Linear Models: In the model
E(Y |U, V) = g(UT β + m(V)),
we have the same setting as in a partial linear model, but a link function
relates the sum of parametric and nonparametric components to the
response.
Generalized Partial Linear Partial Additive Models: In the model
s
X
T
E(Y |U, V) = g(U β + mj (Vj )),
j=1
we have the sum of a parametric component and the sum of s individual

nonparametric functions, but there is also a link function that relates
this sum to the response.

Another method (which is often a semiparametric regression model due to

its exploratory nature) is the projection pursuit regression method discussed
earlier. “Projection Pursuit” stands for a class of exploratory projection
techniques. This class contains statistical methods designed for analyzing
high-dimensional data using low-dimensional projections. The aim of pro-
jection pursuit regression is to reveal possible nonlinear relationship between
a response and very large number of predictors with the ultimate goal of
finding interesting structures hidden within the high-dimensional data.
To conclude this section, let us outline the general context of the three
classes of regression models:
Parametric Models: These models are fully determined up to a parame-
ter vector. If the underlying assumptions are correct, then the fitted
model can easily be interpreted and estimated accurately. If the as-
sumptions are violated, then fitted parametric estimates may provide
inconsistencies and misleading interpretations.
Nonparametric Models: These models provide flexible models and avoid

the restrictive parametric form. However, they may be difficult to in-
terpret and yield inaccurate estimates for a large number of regressors.
Semiparametric Models: These models combine parametric and nonpara-

metric components. They allow easy interpretation of the parametric
component while providing the flexibility of the nonparametric compo-
nent.
24.2 Random Effects Regression and Multi-

level Regression
The next model we consider is not unlike growth curve models. Suppose
we have responses measured on each subject repeatedly. However, we no
longer assume that the same number of responses are measured for each
subject (such data is often called longitudinal data or trajectory data).
In addition, the regression parameters are now subject-specific parameters.
The regression parameters are considered random effects and are assumed
to follow their own distribution. Earlier, we only discussed the sampling
distribution of the regression parameters and the regression parameters were
assumed fixed (i.e., they were assumed to be fixed effects).

All Groups
100
●
●
● ● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ●
● ● ● ●
● ●
●
● ● ●
● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
● ● ● ● ● ●
●
● ● ● ● ●
0
● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ● ●
●
● ●
● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ●
●
● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
Response ● ● ● ● ●
●
●
● ●
● ●
●
● ● ●
● ●
● ● ● ●
● ● ● ●
−100
● ●
● ● ●
● ●
● ● ●
●
● ● ●
● ●
●
−200
●
−300
5 10 15 20
Time
Figure 24.2: Scatterplot of the infant data with a trajectory (in this case, a
quadratic response curve) fitted to each infant.
As an example, consider a sample of 40 infants used in the study of a

habituation task. Suppose the infants were broken into four groups and
studied by four different psychologists. A similar habituation task is given to
the four groups, but the number of times it is performed in each group differs.
Furthermore, it is suspected that each infant will have it’s own trajectory
when a response curve is constructed. All of the data are presented in Figure
24.2 with a quadratic response curve fitted to each infant. When broken
down further, notice in Figure 24.3 how each group has a set of infants
with a different number of responses. Furthermore, notice how a different
trajectory was fit to each infant. Each of these trajectories has its own set
of regression parameter estimates.
Let us formulate the linear model for this setting. Suppose we have
i = 1, . . . , N subjects and each subject has a response vector yi which consists
of ni measurements (notice that n is subscripted by i to signify the varying
number of measurements nested within each subject - if all subjects have
the same number of measurements, then ni ≡ n). The random effects
regression model is given by:
yi = Xi β i + i ,

Group 1 Group 2
●
100
100
● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
0
0
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ●
● ●
●
● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ●
●
● ● ●
Response
Response
● ● ● ● ●
● ● ●
● ●
−100
−100
●
−200
−200
−300
−300
5 10 15 20 5 10 15 20
Time Time
(a) (b)
Group 3 Group 4
100
100
●
● ●
● ● ● ●
●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
●
0
● ●
● ● ● ● ● ● ● ● ●
●
● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
Response
Response
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
−100
−100
● ●
● ● ●
● ●
● ●
●
● ● ●
● ●
●
−200
−200
●
−300
−300
5 10 15 20 5 10 15 20
Time Time
(c) (d)
Figure 24.3: Plots for each group of infants where each group has a different
number of measurements.

where the Xi are known ni × p design matrices, β i are regression parameters

for subject i, and the i are ni × 1 vectors of random within-subject resid-
uals distributed independently as Nni (0, σ 2 Ini ×ni ). Furthermore, the β i are
assumed to be multivariate normally distributed with mean vector µβ and
variance-covariance matrix Σβ . Given these assumptions, it can be shown
that the yi are marginally distributed as independent normals with mean
Xi µβ and variance-covariance matrix Xi Σβ XT 2
i + σ Ini ×ni .
Another regression model, not unrelated to random effects regression
models, involves imposing another model structure on the regression coef-
ficients. These are called multilevel (hierarchical) regression models.
For example, suppose the random effects regression model above is a sim-
ple linear case (i.e., β = (β0 β1 )T ). We may assume that the regression
coefficients in the random effects regression model from above to have the
following structure:
β i = α0 + α1 ui + δi .
In this regression relationship, we would also have observed the ui ’s and we
assume that δi ∼iid N (0, τ 2 ) for all i. Then, we estimate α0 and α1 directly
from the data. Note the hierarchical structure of this model, hence the name.
Estimation of the random effects regression model and the multilevel
regression model requires more sophisticated methods. Some common es-
timation methods include use of empirical or hierarchical Bayes estimates,
iteratively reweighted maximum marginal likelihood methods, and EM algo-
rithms. Various statistical intervals can also be constructed for these models.
24.3 Functional Linear Regression Analysis

Functional data consists of observations which can be treated as functions
rather than numeric vectors. One example is fluorescence curves used in pho-
tosynthesis research where the curve reflects the biological processes which
occur during the plant’s initial exposure to sunlight. Longitudinal data can
be considered a type of functional data such as taking repeated measurements
over time on the same subject (e.g., blood pressure or cholesterol readings).
Functional regression models are of the form
yi (t) = β(t)φ(xi ) + i (t),
where yi (t), β(t), and i (t) represent the functional response, average curve,
and the error process, respectively. φ(xi ) is a multiplicative effect modifying

the average curve according to the predictors. So each of the i = 1, . . . , n

trajectories (or functions) are observed at points t1 , . . . , tk in time, where k is
large. In otherwords, we are trying to fit a regression surface for a collection
of functions (i.e., we actually observe trajectories and not individual data
points).
Functional regression models do sound similar to random effects regres-
sion models, but differ in a few ways. In random effects regression models, we
assume that each observation’s set of regression coefficients are random vari-
ables from some distribution. However, functional regression models do not
make distributions on the regression coefficients and are treated as separate
functions which are characterized by the densely sampled set of points over
t1 , . . . , tk . Also, random effects regression models easily accommodate tra-
jectories of varying dimensions, whereas this is not reflected in a functional
regression model.
Estimation of β(t) is beyond the scope of this discussion as it requires
knowledge of Fourier series and more advance multivariate techniques. Fur-
thermore, estimation of β(t) is intrinsically an infinite-dimensional problem.
However, estimates found in the literature have been shown to possess desir-
able properties of an estimator. While there are also hypothesis tests avail-
able concerning these models, difficulties still exist with using these models
for prediction.
24.4 Mediation Regression

Consider the following research questions found in psychology:
• Will changing social norms about science improve children’s achieve-

ment in scientific disciplines?
• Can changes in cognitive attributions reduce depression?
• Does trauma affect brain stem activation in a way that inhibits mem-
ory?
Such questions suggest a chain of relations where a predictor variable affects

another variable, which then affects the response variable. A mediation
regression model attempts to identify and explicate the mechanism that
underlies an observed relationship between an independent variable and a

dependent variable, via the inclusion of a third explanatory variable called a

mediator variable.
Instead of modeling a direct, causal relationship between the indepen-
dent and dependent variables, a mediation model hypothesizes that the in-
dependent variable causes the mediator variable which, in turn, causes the
dependent variable. Mediation models are generally utilizes in the area of
psychometrics, while other scientific disciplines (including statistics) have
criticized the methodology. One such criticism is that sometimes the roles of
the mediator variable and the dependent variable can be switched and yield a
model which explains the data equally well, thus causing identifiability issues.
The model we present simply has one independent variable, one dependent
variable, and one mediator variable in the model. Models including more of
any of these variables are possible to construct.
The following three regression models are used in our discussion:
1. Y = β0 + β1 X +
2. Y = α0 + α1 X + α2 M + δ
3. M = θ0 + θ1 X + γ.
The first model is the simple linear regression model we are familiar with.
The is the relationship between X and Y that we typically wish to study (in
causal analysis, this is written as X → Y ). The second and third models
show how we incorporate the mediator variable into this framework so that
X causes the mediator M and M causes Y (i.e., X → M → Y ). So α1 is the
coefficient relating X to Y adjusted for M , α2 is the coefficient relating M to
Y adjusted for X, θ1 is the coefficient relating X to M , and , δ, and γ are
error terms for the three relationships. Figure 24.4 gives a diagram showing
these relationships, sans the error terms.
The mediated effect in the above models can be calculated in two ways
- either as θ̂1 α̂2 or β̂1 − β̂2 . There are various methods for estimating these
coefficients, including those based on ordinary least squares and maximum
likelihood theory. To test for significance, the chosen quantity (i.e., either
θ̂1 α̂2 or β̂1 − β̂2 ) is divided by the standard error and then the ratio is com-
pared to a standard normal distribution. Thus, confidence intervals for the
mediation effect are readily available by using the 100×(1−α/2)th -percentile
of the standard normal distribution.
Finally, the strength and form of mediated effects may depend on, yet,
another variable. These variables, which affect the hypothesized relationship

Mediator Variable
M
θ1 α2
Independent Variable α1 Dependent Variable

X Y
Figure 24.4: Diagram showing the basic flow of a mediation regression model.
amongst the variables already in our model, are called moderator variables
and are often tested as an interaction effect. A significantly different from 0
XM interaction in the second equation above suggests that the α2 coefficient
differs across different levels of X. These different coefficient levels may
reflect mediation as a manipulation, thus altering the relationship between
M and Y . The moderator variables may be either a manipulated factor in
an experimental setting (e.g., dosage of medication) or a naturally occurring
variable (e.g., gender). By examining moderator effects, one can investigate
whether the experiment differentially affects subgroups of individuals. Three
primary models involving moderator variables are typically studied:
Moderated mediation: The simplest of the three, this model has a vari-
able which mediates the effects of an independent variable on a depen-
dent variable, and the mediated effect depends on the level of another
variable (i.e., the moderator). Thus, the mediational mechanism differs
for subgroups of the study. This model is more complex from an inter-
pretative viewpoint when the moderator is continuous. Basically, you
have either X → M and/or M → Y dependent on levels of another
variable (call it Z).

Mediated moderation: This occurs when a mediator is intermediate in

the causal sequence from an interaction effect to a dependent variable.
The purpose of this model is to determine the mediating variables that
explain the interaction effect.
Mediated baseline by treatment moderation: This model is a special

case of the mediated moderation model. The basic interpretation of
the mediated effect in this model is that the mediated effect depends
on the baseline level of the mediator. This scenario is common in pre-
vention and treatment research, where the effects of an intervention are
often stronger for participants who are at higher risk on the mediating
variable at the time they enter the program.
24.5 Meta-Regression Models

In statistics, a meta-analysis combines the results of several studies that
address a set of related research hypotheses. In its simplest form, this is
normally by identification of a common measure of effect size, which is a
descriptive statistic that quantifies the estimated magnitude of a relationship
between variables without making any inherent assumption about if such a
relationship in the sample reflects a true relationship for the population.
In a meta-analysis, a weighted average might be used as the output. The
weighting might be related to sample sizes within the individual studies.
Typically, there are other differences between the studies that need to be
allowed for, but the general aim of a meta-analysis is to more powerfully
estimate the true “effect size” as opposed to a smaller “effect size” derived
in a single study under a given set of assumptions and conditions.
Meta-regressions are similar in essence to classic regressions, in which
a response variable is predicted according to the values of one or more predic-
tor variables. In meta-regression, the response variable is the effect estimate
(for example, a mean difference, a risk difference, a log odds ratio or a log
risk ratio). The predictor variables are characteristics of studies that might
influence the size of intervention effect. These are often called potential
effect modifiers or covariates. Meta-regressions usually differ from simple
regressions in two ways. First, larger studies have more influence on the re-
lationship than smaller studies, since studies are weighted by the precision
of their respective effect estimate. Second, it is wise to allow for the resid-

ual heterogeneity among intervention effects not modeled by the predictor

variables. This gives rise to the random-effects meta-regression, which we
discuss later.
The regression coefficient obtained from a meta-regression analysis will
describe how the response variable (the intervention effect) changes with a
unit increase in the predictor variable (the potential effect modifier). The
statistical significance of the regression coefficient is a test of whether there
is a linear relationship between intervention effect and the predictor variable.
If the intervention effect is a ratio measure, the log-transformed value of the
intervention effect should always be used in the regression model, and the
exponential of the regression coefficient will give an estimate of the relative
change in intervention effect with a unit increase in the predictor variable.
Generally, three types of meta-regression models are commonplace in the
literature:
Simple meta-regression: This model can be specified as:
yi = β0 + β1 xi,1 + β2 xi,2 + . . . + βp−1 xi,p−1 + ,
where yi is the effect size in study i and β0 (i.e., the “intercept”) is the
estimated overall effect size. The variables xi,j , for j = 1, . . . , (p − 1),
specify different characteristics of the study and specifies the between
study variation. Note that this model does not allow specification of
within study variation.
Fixed-effect meta-regression: This model assumes that the true effect
size δθ is distributed as N (θ, σθ2 ), where σθ2 is the within study variance
of the effect size. A fixed effect1 meta-regression model thus allows for
within study variability, but no between study variability because all
studies have the identical expected fixed effect size δθ ; i.e., = 0.This
model can be specified as:
yi = β0 + β1 xi,1 + β2 xi,2 + . . . + βp−1 xi,p−1 + ηi ,
where ση2i is the variance of the effect size in study i. Fixed effect meta-
regressions ignore between study variation. As a result, parameter
estimates are biased if between study variation can not be ignored.
Furthermore, generalizations to the population are not possible.
1
Note that for the “fixed-effect” model, no plural is used as only ONE true effect across
all studies is assumed.

Random effects meta-regression: This model rests on the assumption

that θ in N (θ, σθ ) is a random variable following a hyper-distribution
N (µθ , ςθ ). The model can be specified as:
yi = β0 + β1 xi,1 + β2 xi,2 + . . . + βp−1 xi,p−1 + η + i ,
were σ2i is the variance of the effect size in study i. Between study
variance ση2 is estimated using common estimation procedures for ran-
dom effects models (such as restricted maximum likelihood (REML)
estimators).
24.6 Bayesian Regression

Bayesian inference is concerned with updating our model (which is based
on previous beliefs) as a result of receiving incoming data. Bayesian inference
is based on Bayes’ Theorem, which says for two events, A and B,
P(B|A)P(A)
P(A|B) = .
P(B)
We update our model by treating the parameter(s) of interest as a random

variable and defining a distribution for the parameters based on previous
beliefs (this distribution is called a prior distribution). This is multiplied
by the likelihood function of our model and then divided by the marginal
density function (which is the joint density function with the parameter
integrated out). The result is called the posterior distribution. Luckily,
the marginal density function is just a normalizing constant and does not
usually have to be calculated in practice.
For multiple linear regression, the ordinary least squares estimate
β̂ = (XT X)−1 XT y
is constructed from the frequentist’s view (along with the maximum like-
lihood estimate σ̂ 2 of σ 2 ) in that we assume there are enough measurements
of the predictors to say something meaningful about the response. In the
Bayesian view, we assume we have only a small sample of the possible
measurements and we seek to correct our estimate by “borrowing” informa-
tion from a larger set of similar observations.

The (conditional) likelihood is given as:

2 2 −n/2 1 2
`(y|X, β, σ ) = (2πσ ) exp − 2 ky − Xβk .
2σ
We seek a conjugate prior (a prior which yields a joint density that is of
the same functional form as the likelihood). Since the likelihood is quadratic
in β, we re-write the likelihood so it is normal in (β − β̂). Write
ky − Xβk2 = ky − Xβ̂k2 + (β − β̂)T (XT X)(β − β̂)
Now rewrite the likelihood as
vs2

2 2 −v/2 2 −(n−v)/2 1 2
`(y|X, β, σ ) ∝ (σ ) exp − 2 (σ ) exp − 2 kX(β − β̂)k ,
2σ 2σ
where vs2 = ky − Xβ̂k2 and v = n − p with p as the number of parameters
to estimate. This suggests a form for the priors:
π(β, σ 2 ) = π(σ 2 )π(β|σ 2 ).
The prior distributions are characterized by hyperparameters, which
are parameter values (often data-dependent) which the researcher specifies.
The prior for σ 2 (π(σ 2 )) is an inverse gamma distribution with shape hy-
perparameter α and scale hyperparameter γ. The prior for β (π(β|σ 2 )) is a
multivariate normal distribution with location and dispersion hyperparame-
ters β̄ and Σ. This yields the joint posterior distribution:
f (β, σ 2 |y, X) ∝ `(y|X, β, σ 2 )π(β|σ 2 )π(σ 2 )

−n−α 1 T −1 T
∝σ exp − 2 (s̃ + (β − β̃) (Σ + X X)(β − β̃)) ,
2σ
where
β̃ = (Σ−1 + XT X)−1 (Σ−1 β̄ + XT Xβ̂)
s̃ = 2γ + σ̂ 2 (n − p) + (β̄ − β̃)T Σ−1 β̄ + (β̂ − β̃)T XT Xβ̂.
Finally, it can be shown that the distribution of β|X, y is a multivariate-t
distribution with n + α − p − 1 degrees of freedom such that:
E(β|X, y) = β̃
s̃(Σ−1 + XT X)−1
Cov(β|X, y) = .
n+α−p−3

Furthermore, the distribution of σ 2 |X, y is an inverse gamma distribution

with shape parameter n + α − p and scale parameter 0.5σ̂ 2 (n + α − p).
One can also construct statistical intervals based on draws simulated from
a Bayesian posterior distribution. A 100 × (1 − α)% credible interval is
constructed by taken the middle 100 × (1 − α)% of the values simulated from
the parameter’s posterior distribution. The interpretation of these intervals
is that there is a 100 × (1 − α)% chance that the true population parameter
is in the 100 × (1 − α)% credible interval which is constructed (which is how
many people initially try to interpret confidence intervals).
24.7 Quantile Regression

The τ th quantile of a random variable X is the value of x such that
P(X ≤ x) = τ.
For example, if τ = 21 , then the corresponding value of x would be the median.

This concept of quantiles can also be extended to the regression setting.
In a quantile regression, we have a data set of size n with response y
and predictors x = (x1 , . . . , xp−1 )T and we seek a solution to the least squares
criterion n
X
β̂ τ = ρτ (yi − µ(xT 2
i β)) ,
i=1
where µ(·) is some parametric function and ρτ (·) is called the linear check
function and is defined as
ρτ (x) = τ x − xI{x < 0}.
For linear regression, µ(xT T

i β) = xi β.
We actually encountered quantile regression earlier. Least absolute devi-
ations regression is just the case of quantile regression where τ = 1/2.
Figure 24.5 gives a plot relating food expenditure to a family’s monthly
household income. Overlaid on the plot is a dashed red line which gives the
ordinary least squares fit. The solid blue line is the least absolute deviation fit
(i.e., τ = 0.50). The gray lines (from bottom to top) are the quantile regres-
sion fits for τ = 0.05, 0.10, 0.25, 0.75, 0.90, and 0.95, respectively. Essentially,
what this says is that if we looked at those households with the highest food
expenditures, they will likely have larger regression coefficients (such as the

2000
●
1500
●
● ●
Food Expenditure
●
● ●
●
●
● ●
1000
● ●
●
● ●
●●
● ●
●
●● ●
● ●
● ●●
● ● ●
● ●
● ●● ●●●●●
●
●
● ●
● ●●
●●● ●
● ●
● ●
● ● ● ●
●
● ●● ● ● ●
●
● ●
● ●● ●
●● ● ●●
● ●
● ● ●●
● ●
●● ●
● ●●
● ●● ●
●
●
●●●●● ●●
● ●
● ● ●● ● ● ●
500
●● ●●
● ●
●
●● ● ●● ● ●
●●
●●●
●●●●● ●●
●
● ●
●● ● ● ● ●
●
●●
● ●●●●●
●● ●●● ●
●● ●
●●
● ●
●●●
●●
●●●
●●
● ●●●
● ●●
●●● ●
● mean (LSE) fit
●● ●
●●●
● ●
●●●
●●
●●
●●
●●
● median (LAE) fit
●
● ●
1000 2000 3000 4000 5000
Household Income
Figure 24.5: Various qunatile regression fits for the food expenditures data
set.
τ = 0.95 regression quantile) while those with the lowest food expenditures
will likely have smaller regression coefficients (such as the τ = 0.05 regression
quantile). The estimates for each of these quantile regressions is as follows:
##########
Coefficients:
tau= 0.05 tau= 0.10 tau= 0.25
(Intercept) 124.8800408 110.1415742 95.4835396
x 0.3433611 0.4017658 0.4741032
tau= 0.75 tau= 0.90 tau= 0.95
(Intercept) 62.3965855 67.3508721 64.1039632
x 0.6440141 0.6862995 0.7090685
Degrees of freedom: 235 total; 233 residual

##########
Estimation for quantile regression can be done through linear program-

ming or other optimization procedures. Furthermore, statistical intervals can
also be computed.

24.8 Monotone Regression

Suppose we have a set of data (x1 , y1 ), . . . , (xn , yn ). For ease of notation, let
us assume there is already an ordering on the predictor variable. Specifically,
we assume that x1 ≤ . . . ≤ xn . Monotonic regression is a technique where
we attempt to find a weighted least squares fit of the responses y1 , . . . , yn to
a set of scalars a1 , . . . , an with corresponding weights w1 , . . . , wn , subject to
monotonicity constraints giving a simple or partial ordering of the responses.
In otherwords, the responses are suppose to strictly increase (or decrease)
as the predictor increases and the regression line we fit is piecewise constant
(which resembles a step function). The weighted least squares problem for
monotonic regression is given by the following quadratic program:
n
X
arg min wi (yi − ai )2
a
i=1
and is subject to one of two possible constraints. If the direction of the

trend is to be monotonically increasing, then the process is called isotonic
regression and the constraint is yi ≥ yj for all i > j where this ordering
is true. If the direction of the trend is to be monotonically decreasing, then
the process is called antitonic regression and the constraint is yi ≤ yj for
all i > j where this ordering is true. More generally, one can also perform
monotonic regression under Lp for p > 0:
n
X
arg min wi |yi − ai |p ,
a
i=1
with the appropriate constraints imposed for isotonic or antitonic regression.

Monotonic regression does have its place in statistical inference. For
example, in astronomy data sets, there is a measure of a gamma-ray burst
flux measurements versus time. On the log scale, one can identify an area of
“flaring” which is an area where the flux measurements are to increase. Such
an area could be fit using an isotonic regression.
An example of an isotonic regression fitted to a made-up data set is given
in Figure 24.6. The top plot gives the actual isotonic regression fit. The
horizontal lines represent the values of the scalars minimizing the weighted
least squares problem given earlier. The bottom plot shows the cumulative
sums of the responses plotted against the predictors. The piecewise regression

Isotonic Regression
●
6
●
5
●
● ● ●
3 4
x$y
● ● ●
2
● ●
1
●
● ●
0
0 2 4 6 8 10
x0
Cumulative Data and Convex Minorant

●
●
●
25
cumsum(x$y)
●
● ●
●
●
15
●
●
●
5
● ●
●
0
0 2 4 6 8 10
x0
Figure 24.6: An example of an isotonic regression fit.
line which is plotted is called the convex minorant. Each predictor value
where this convex minorant intersects at the value of the cumulative sum
is the same value of the predictor where the slope changes in the isotonic
regression plot.
24.9 Spatial Regression

Suppose an econometrician is trying to quantify the price of a house. In
doing so, he will surely need to incorporate neighborhood effects (e.g., how
much is the house across the street valued at as well as the one next door?)
However, the house prices in an adjacent neighborhood may also have an
impact on the price, but house prices in the adjacent county will likely not.
The framework for such modeling is likely to incorporate some sort of spatial
effect as houses nearest to the home of interest are likely to have a greater
impact on the price while homes further away will have a smaller or negligible
impact.
Spatial regression deals with the specification, estimation, and diag-
nostic analysis of regression models which incorporate spatial effects. Two

broad classes of spatial effects are often distinguished: spatial heterogene-

ity and spatial dependency. We will provide a brief overview of both
types of effects, but it should be noted that we will only skim the surface of
what is a very rich area.
A spatial regression model reflecting spatial heterogeneity is written lo-
cally as
Y = Xβ(g) + ,
where g indicates that the regression coefficients are to be estimated locally
at the coordinates specified by g and is an error term distributed with
mean 0 and variance σ 2 . This model is called geographically weighted
regression or GWR. The estimation of β(g) is found using a weighting
scheme such that
β̂(g) = (XT W(g)X)−1 XT W(g)Y.
The weights in the geographic weighting matrix W(g) are chosen such that
those observations near the point in space where the parameter estimates
are desired have more influence on the result than those observations further
away. This model is essentially a local regression model like the one discussed
in the section on LOESS. While the choice of a geographic (or spatially)
weighted matrix is a blend of art and science, one commonly used weight is
the Gaussian weight function, where the diagonal entris of the n × n matrix
W(g) are:
wi (g) = exp{−di /h},
where di is the Euclidean distance between observation i and location g, while
h is the bandwidth.
The resulting parameter estimates or standard errors for the spatial het-
erogeneity model may be mapped in order to examine local variations in the
parameter estimates. Hypothesis tests are also possible regarding this model.
Spatial regression models also accommodate spatial dependency in two
major ways: through a spatial lag dependency (where the spatial correlation
occurs in the dependent variable) or a spatial error dependency (where the
spatial correlation occurs through the error term). A spatial lag model is
a spatial regression model which models the response as a function of not
only the predictors, but also values of the response observed at other (likely
neighboring) locations:
yi = f (yj(i) ; θ) + X T
i β + i ,

where j(i) is an index including all of the neighboring locations j of i such

that i 6= j. The function f can be very general, but typically is simplified by
using a spatially weighted matrix (as introduced earlier).
Assuming a spatially
Pn weighted matrix W(g) which has row-standardized
spatial weights (i.e., j=1 wi,j = 1), we obtain a mixed regressive spatial
autoregressive model:
n
X
yi = ρ wi,j yj + X T
i β + i ,
j=1
where ρ is the spatial autoregressive coefficient. In matrix notation, we

have
Y = ρW(g)Y + Xβ + .
The proper solution to the equation for all observations requires (after some
matrix algebra)
Y = (In×n − ρW)−1 Xβ + (In×n − ρW)−1
to be solved simultaneously for β and ρ.
The inclusion of a spatial lag is similar to a time series model, although
with a fundamental difference. Unlike time dependency, a spatial depen-
dency is multidirectional, implying feedback effects and simultaneity. More
precisely, if i and j are neighboring locations, then yj enters on the right-
hand side in the equation for yi , but yi also enters on the right-hand side in
the equation for yj .
In a spatial error model, the spatial autocorrelation does not enter as an
additional variable in the model, but rather enters only through its affects
on the covariance structure of the random disturbance term. In otherwords,
Var() = Σ such that the off-diagonals of Σ are not 0. One common way to
model the error structure is through direct representation, which is similar
to the weighting scheme used in GWR. In this setting, the off-diagonals of
Σ are given by σi,j = σ 2 g(di,j , φ), where again di,j is the Euclidean distance
between locations i and j and φ is a vector of parameters which may include
a bandwidth parameter.
Another way to model the error structure is through a spatial process,
such as specifying the error terms to have a spatial autoregressive structure
as in the spatial lag model from earlier:
= λW(g) + u,

where u is a vector of random error terms. Other spatial processes exist, such
as a conditional autoregressive process and a spatial moving average
process, both which resemble similar time series processes.
Estimation of these spatial regression models can be accomplished through
various techniques, but they differ depending on if you have a spatial lag de-
pendency or a spatial error dependency. Such estimation methods include
maximum likelihood estimation, the use of instrumental variables, and semi-
parametric methods.
There are also tests for the spatial autocorrelation coefficient, of which the
most notable uses Moran’s I statistic. Moran’s I statistics is calculated
as
eT W(g)e/S0
I= ,
eT e/n
where e is a vector of ordinary
Pleast squares residuals, W(g) is a geographic
n Pn
weighting matrix, and S0 = i=1 j=1 wi,j is a normalizing factor. Then,
Moran’s I test can be based on a normal approximation using a standardized
value I statistic such that
E(I) = tr(MW/(n − p))
and
tr(MWMWT ) + tr(MWMW) + [tr(MW)]2
Var(I) = ,
(n − p)(n − p + 2)
where M = In×n − X(XT X)−1 XT .
As an example, let us consider 1978 house prices in Boston which we will
try to fit a spatial regression model with spatial error dependency. There
are 20 variables measured for 506 locations. Certain transformations on the
predictors have already been performed due to the investigator’s claims. In
particular, only 13 of the predictors are of interest. First a test on the spatial
autocorrelation coefficient is performed:
##########
Global Moran’s I for regression residuals
data:
model: lm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS +
I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX +
PTRATIO + B + log(LSTAT), data = boston.c)

weights: boston.listw
Moran I statistic standard deviate = 14.5085, p-value < 2.2e-16

alternative hypothesis: two.sided
sample estimates:
Observed Moran’s I Expectation Variance
0.4364296993 -0.0168870829 0.0009762383
##########
As can be seen, the p-value is very small and so the spatial autocorrelation
coefficient is significant.
Next, we attempt to fit a spatial regression model with spatial error de-
pendency including those variables that the investigator specified:
##########
Call:errorsarlm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS
+ I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX
+ PTRATIO + B + log(LSTAT), data = boston.c,
listw = boston.listw)
Residuals:
Min 1Q Median 3Q Max
-0.6476342 -0.0676007 0.0011091 0.0776939 0.6491629
Type: error
Coefficients: (asymptotic standard errors)
(Intercept) 3.85706025 0.16083867 23.9809 < 2.2e-16
CRIM -0.00545832 0.00097262 -5.6120 2.000e-08
ZN 0.00049195 0.00051835 0.9491 0.3425907
INDUS 0.00019244 0.00282240 0.0682 0.9456389
CHAS1 -0.03303428 0.02836929 -1.1644 0.2442466
I(NOX^2) -0.23369337 0.16219194 -1.4408 0.1496286
I(RM^2) 0.00800078 0.00106472 7.5145 5.707e-14
AGE -0.00090974 0.00050116 -1.8153 0.0694827
log(DIS) -0.10889420 0.04783714 -2.2764 0.0228249
log(RAD) 0.07025730 0.02108181 3.3326 0.0008604
TAX -0.00049870 0.00012072 -4.1311 3.611e-05
PTRATIO -0.01907770 0.00564160 -3.3816 0.0007206

B 0.00057442 0.00011101 5.1744 2.286e-07

log(LSTAT) -0.27212781 0.02323159 -11.7137 < 2.2e-16
Lambda: 0.70175 LR test value: 211.88 p-value: < 2.22e-16

Asymptotic standard error: 0.032698
z-value: 21.461 p-value: < 2.22e-16
Wald statistic: 460.59 p-value: < 2.22e-16
Log likelihood: 255.8946 for error model

ML residual variance (sigma squared): 0.018098, (sigma: 0.13453)
Number of observations: 506
Number of parameters estimated: 16
AIC: -479.79, (AIC for lm: -269.91)
##########
As can be seen, there are some predictors that do not appear to be significant.
Model selection procedures can be employed or other transformations can be
tried in order to improve the fit of this model.
24.10 Circular Regression

A circular random variable is one which takes values on the circumference
of a circle (i.e., the angel is in the range of (0, 2π) radians or (0◦ , 360◦ )). A
circular-circular regression is used to determine the relationship between
a circular predictor variable X and a circular response variable Y . Circular
data occurs when there is periodicity to the phenomena at hand or where
there are naturally angular measurements. An example could be determining
the relationship between wind direction measurements (the response) on an
aircraft and wind direction measurements taken by radar (the predictor).
Another related model is one where only the response is a circular variable
while the predictor is linear. This is called a circular-linear regression.
Both types of circular regression models can be given by
yi = β0 + β1 xi + i (mod 2π).
The expression i (mod 2π) is read as i modulus 2π and is a way of expression

the remainder of the quantity i /(2π).2 In this model, i is a circular random
2
For example, 11(mod 7) = 4 because 11 divided by 7 leaves a remainder of 4.

error assumed to follow a von Mises distribution with circular mean 0 and
concentration parameter κ. The von Mises distribution is the circular analog
of the univariate normal distribution, but has a more “complex” form. The
von Mises distribution with circular mean µ and concentration parameter κ
is defined on the range x ∈ [0, 2π), with probability density function
eκ cos(x−µ)
f (x) =
2πI0 (κ)
and cumulative distribution function
∞
1 X Ij (κ) sin(j(x − µ))
F (x) = xI0 (κ) + 2 .
2πI0 (κ) j=1
j
In the above, Ip (·) is called a modified Bessel function of the first kind
of order p. The Bessel function is the contour integral
I
1
Ip (z) = e(z/2)(t−1/t) t−(p+1) dt,
2πi
where the contour encloses the origin and traverses√ in a counterclockwise
direction in the complex plane such that i = −1. Maximum likelihood
estimates can be obtained for the circular regression models (with minor
differences in the details when dealing with a circular predictor or linear
predictor). Needless to say, such formulas do not lend themselves well to
closed-dorm solutions. Thus we turn to numerical methods, which go beyond
the scope of this course.
As an example, suppose we have a data set of size n = 100 where Y
is a circular response and X is a continuous predictor (so a circular-linear
regression model will be built). The error terms are assumed to follow a von
Mises distribution with circular mean 0 and concentration parameter κ (for
this generated data, κ = 1.9). The error terms used in the generation of this
data can be plotted on a circular histogram as given in Figure 24.10.
Estimates for the circular-linear regression fit are given below:
##########
Circular-Linear Regression
Coefficients:

Errors
● ●
●
● ●
●●●●●
●●
● π ●●
● ● ● ● ● ● ●
●
● ● ●
● ●●
●
● ● ● ●
2 ●● ●
8
● ●●●
● ● ●
●● ●
● ● ●
●● ●●
●●● ●●
●●● ●
●
6
●● ●
●●●●
●
●●● ●
● ● ●
●● ●
●
● ● ●
4
●● ● ● ● ●● ●
●
●●
π + 0 ● ●● ●● ●●● ● ● ● ●
Y
● ● ●● ●
● ●
●●● ●
●● ●
●● ●●
2
●●●● ●
●
●●●● ●
●● ●● ●
●●● ●
●● ●
0
●
● ● ●●
●●
●●
● ●
● ●● ● ●
π
3π ● ●● ● ● ●●
● ● ● ●●
−2
● ●
● ● ●
●● ● ●● ●●●●●
●
2 ●● ● ●●
−2 −1 0 1 2
(a) (b)
Figure 24.7: (a) Plot of the von Mises error terms used in the generation of
the sample data. (b) Plot of the continuous predictor (X) versus the circular
response (Y ) along with the circular-linear regression fit.
[1,] 6.7875 1.1271 6.022 8.61e-10 ***

[2,] 0.9618 0.2223 4.326 7.58e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Log-Likelihood: 55.89
Summary: (mu in radians)

mu: 0.4535 ( 0.08698 ) kappa: 1.954 ( 0.2421 )
p-values are approximated using normal distribution
##########
Notice that the maximum likelihood estimates of µ and κ are 0.4535 and
1.954, respectively. Both estimates are close to the values used for gener-
ation of the error terms. Furthermore, the values in parentheses next to
these estimates are the standard errors for the estimates - both of which are
relatively small.
A rough way of looking at the data and estimated circular-linear regres-
sion equation is given in Figure 24.10. This is difficult to display since we are

trying to look at a circular response versus a continuous predictor. Packages

specific to circular regression modeling provide better graphical alternatives.
24.11 Mixtures of Regressions

Consider a large data set consisting of the heights of males and females.
When looking at the distribution of this data, the data for the males will (on
average) be higher than that of the females. A histogram of this data would
clearly show two distinct bumps or modes. Knowing the gender labels of each
subject would allow one to account for that subgroup in the analysis being
used. However, what happens if the gender label of each subject were lost?
In otherwords, we don’t know which observation belongs to which gender.
The setting where data appears to be from multiple subgroups, but there is
no label providing such identification, is the focus of the area called mixture
modeling.
● ● ●● ● ● ● ●● ●
● ● ● ●
1.2
1.2
● ● ● ●
●● ● ●● ●
●● ●●
● ● ● ●
●● ●●
●● ● ● ●● ● ●
1.1
1.1
●● ● ●● ●
● ● ● ●
●● ● ● ●
● ●● ● ● ●
●
●● ● ● ●● ● ●
1.0
1.0
● ● ● ●
● ●
● ● ● ●
Equivalence
Equivalence
● ● ● ●
0.9
0.9
● ●
●● ● ● ●● ● ●
● ●
●● ●●
● ●
● ● ● ●
0.8
0.8
●
● ● ●
● ●● ●
● ● ●
● ●●
●● ● ● ●● ● ●
● ●
● ● ● ●
● ●
0.7
0.7
● ● ● ●● ● ● ● ●●
● ●
● ●
●
● ●
●
0.6
0.6
● ● ● ● ● ●
● ●
●
● ●
●
● ●
1 2 3 4 1 2 3 4
NO NO
(a) (b)
Figure 24.8: (a) Plot of spark-ignition engine fuel data with equivalence
ratio as the response and the measure of nitrogen oxide emissions. (b) Plot
of the same data with EM algorithm estimates from a 2-component mixture
of regressions fit.
There are many issues one should be cognizant of when building a mixture
model. In particular, maximum likelihood estimation can be quite complex

since the likelihood does not yield closed-form solutions and there are iden-
tifiability issues (however, the use of a Newton-Raphson or EM algorithm
usually provides a good solution). One alternative is to use a Bayesian ap-
proach with Markov Chain Monte Carlo (MCMC) methods, but this too has
its own set of complexities. While we do not explore these issues, we do see
how a mixture model can occur in the regression setting.
A mixture of linear regressions model can be used when it appears
that there is more than one regression line that could fit this data due to
some underlying characteristic (i.e., a latent variable). Suppose we have n
observations which belongs to one of k groups. If we knew to which group
an observation belonged (i.e., its label), then we could write down explicitly
the linear regression model given that observation i belongs to group j:
yi = XT
i β j + ij ,
such that ij is normally distributed with mean 0 and variance σj2 . Notice
how the regression coefficients and variance terms are different for each group.
However, now assume that the labels are unobserved. In this case, we can
only assign a probability that observation i came from group j. Specifically,
the density function for the mixture of linear regression model is:
k
X
2 −1/2 1 T 2
f (yi ) = λj (2πσj ) exp − 2 (yi − Xi βj ) ,
j=1
2σj
such that kj=1 λj = 1. Estimation is done by using the likelihood (or rather
P
log likelihood) function based on the above density. For maximum likelihood,
one typically uses an EM algorithm.
As an example, consider the data set which gives the equivalence ratios
and peak nitrogen oxide emissions in a study using pure ethanol as a spark-
ignition engine fuel. A plot of the equivalence ratios versus the measure
of nitrogen oxide is given in Figure 24.11. Suppose one wanted to predict
the equivalence ratio from the amount of nitrogen oxide emissions. As you
can see, there appears to be groups of data where separate regressions appear
appropriate (one with a positive trend and one with a negative trend). Figure
24.11 gives the same plot, but with estimates from an EM algorithm overlaid.
EM algorithm estimates for this data are β 1 = (0.565 0.085)T , β 1 = (1.247 −
0.083)T , σ 2 = 0.00188, and σ 2 = 0.00058.
It should be noted that mixtures of regressions appear in many areas.
For example, in economics it is called switching regimes. In the social

sciences it is called latent class regressions. As we saw earlier, the neural

networking terminology calls this model (without the hierarchical structure)
the mixture-of-experts problem.

pt4 Adv Regression Models PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

pt4 Adv Regression Models PDF

Caricato da

Copyright:

Formati disponibili

Part IV

Advanced Regression Models

In this chapter, we provide models to account for curvature in a data set.

17.1 Polynomial Regression

• We also look at a scatterplot of the response value versus each predictor.

(which we discuss later) refers to the nonlinear behavior of the coefficients,

STAT 501 D. S. Young

6 8 10 12 14 −400 −300 −200 −100 0

Histogram of Residuals Normal Q−Q Plot

−60 −40 −20 0 20 −2 −1 0 1 2

Residuals Theoretical Quantiles

D. S. Young STAT 501

Some general guidelines to keep in mind when estimating a polynomial

STAT 501 D. S. Young

17.2 Response Surface Regression

∗ Xi,j − [maxi (Xi,j ) + mini (Xi,j )]/2

D. S. Young STAT 501

where i = 1, . . . , n indexes the sample and j = 1, . . . , k indexes the factor.

Some aspects which differentiate a response surface regression model from

• In a response surface regression model, p is usually 2 or 3, while k is

• The factors are treated as categorical variables. Therefore, the X ma-

• The number of factor levels must be at least as large as the number of

• If examining a response surface with interaction terms, then the model

• The number of observations (n) must be greater than the number of

– It is desirable to have a larger n. A rule of thumb is to have at

STAT 501 D. S. Young

(a) (b) (c)

• Typically response surface regression models only have two-way inter-

• The response surface regression models we outlined are for a factorial

We mentioned that response surface regression follows the hierarchy prin-

D. S. Young STAT 501

(a) (b) (c)

STAT 501 D. S. Young

Effect Relevant Terms

factors, n observations, m unique levels of the factor level combinations, and

• The sum of squares due to the quadratic component is

SSQUAD = SSR(X12 , X22 , . . . , Xk2 |X1 , X2 . . . , Xk ).

• The sum of squares due to the linear interaction component is

SSINT = SSR(X1 X2 , . . . , X1 Xk , X2 Xk , . . . , Xk−1 Xk |X1 , X2 . . . , Xk ,

Other analysis techniques are commonly employed in response surface

D. S. Young STAT 501

ridge analysis, which computes the estimated ridge of optimum response

STAT 501 D. S. Young

Table 17.4: The yield measurements data set pertaining to n = 15 observa-

Residual standard error: 0.3913 on 13 degrees of freedom

Here we have the quadratic fit results:

Residual standard error: 0.2444 on 12 degrees of freedom

D. S. Young STAT 501

Multiple R-Squared: 0.6732, Adjusted R-squared: 0.6187

STAT 501 D. S. Young

Example 2: Odor Data Set

Odor Temperature Ratio Height

D. S. Young STAT 501

height -21.375 6.638 -3.220 0.0122 *

Residual standard error: 18.77 on 8 degrees of freedom

Residual standard error: 18.12 on 9 degrees of freedom

STAT 501 D. S. Young

D. S. Young STAT 501

STAT 501 D. S. Young

D. S. Young STAT 501

STAT 501 D. S. Young

Biased Regression Methods

Recall earlier that we dealt with multicollinearity (i.e., a near-linear relation-

Y = β0 + β1 X1 + β2 (X1 − k1 )I{X1 > k1 } + ,

yi = β0 + β1 xi,1 + β2 (xi,1 − k1 )I{xi,1 > k1 } + . . . + βc+1 (xi,1 kc )I{xi,1 > kc } + i .