Sei sulla pagina 1di 27

The Pragmatist's Guide to Statistics: logistic regression 13

2 Logistic Regression Diagnostics 1


There are two things that I want to accomplish in this section. First, while we talked brie y
last time about how to assess the t of a logistic regression model, this discussion was far from
complete. The major packages that carry out logistic regression calculations include di erent types
of diagnostic information, and I want to make the case that both types are very useful, and any
thorough analysis should probably include both.
Second, while I presented a couple of logistic regression models in the last section, I didn't say
anything about how to request this type of analysis. Since I want to discuss the output that comes
from two di erent packages (SAS and BMDP), I also need to discuss the control language that's
needed for each of those packages. I'll go into the control language, but rst I want to discuss some
of the methods that can be used to assess the t and robustness of a logistic regression model.
2.1 Introduction to Diagnostics
I started the previous section (A Backward Introduction to Logistic Regression) by looking at a
couple of examples, of both linear and nonlinear regression. The reason that I did this is that there
are both similarities and dissimilarities between logistic regression and least squares regression, and
it's important to keep in mind where the models are similar and where they aren't. The analogy
carries over to what I'm going to discuss in this section.
The calculation of a (linear) least squares model is a matter that's computationally pretty
routine, in that you can do this type of calculation on pretty much any data set, and you'll come
up with a least squares predictive equation (or model) for the response (dependent variable) in terms
of the predictors (independent variables). The appropriateness or validity of the predictive equation
that comes out of this calculation doesn't follow as routinely, however. In order to get a clue as to
whether the regression equation you just calculated is a good one, you have to look more closely
at your calculated model, and in particular at the residuals from the analysis. This procedure of
looking more closely at a regression model is often called \performing regression diagnostics." Since
the regression diagnostics for a logistic regression are going to involve some of the same ideas as
for a least squares regression, it's worth starting by looking at a couple of linear regression models.
2.2 Least Squares Regression Diagnostics: a Couple of Examples
The only reason I'm looking at more than one example here is that there are some aspects of this
type of analysis that are easier to see when there's only one predictor, and there are others that
are easier to see when there are several predictors. I'll look at the simplest example rst.
This example is a bit silly. I looked at (most of) the years from 1966 to 1994 and looked at how
the earnings of the movie that had the greatest box oce ticket sales (in dollars) changed over the
years. The data (the log of the ticket sales against the year) are plotted in Figure 2.1, along with
the least squares regression line.
In a linear model like this, you can run into problems with the assumptions of the model, or
with the data itself. The assumptions of a linear model are:
 the residuals from the model are normally distributed,
1 Notes prepared by Neil H. Willits, Statistical Laboratory, Division of Statistics, University of California, Davis.
Single copies may be reproduced for personal use at no charge. Other uses require consent of the author.
The Pragmatist's Guide to Statistics: logistic regression 14

8.6 r

8.4 r
r

r r

8.2 r
r
r

r
r

r
8 r r
r
r

r
r r

log($) 7.8 r
r
r
r
r r
7.6 r

7.4 r

7.2 r

7
65 70 75 80 85 90 95
year
Figure 2.1: Movie earnings by year

 the residuals have a constant residual variance,


 the residuals are independent, and
 the residuals have a mean of zero.
This last assumption is equivalent to assuming that the linear form of the model that you've
speci ed is in fact correct, so that the response is in fact linear in the predictors. An old fashioned
but quite common method for testing whether there are trends in the residuals is to run a Durbin-
Watson test. It looks at whether positive residuals tend to be clustered together, as you might
expect if the underlying relationship wasn't speci ed correctly. For this data set, the value of the
Durbin-Watson statistic was 1.666, whereas for this sample size, you would reject only for values
below 1.32 2
You can examine the normality assumption by constructing a normal scores (or Q-Q) plot of
the residuals. On such a plot, the points should form vaguely a straight line if the residuals are
normal. To make this idea more concrete (more quantitative), you can run a hypothesis test on
the residuals, such as a Wilk-Shapiro test, a Kolmogorov-Smirnov test or a 2 goodness of t test.
Of the three, the 2 test and Kolmogorov Smirnov tests are the most general, though this isn't
necessarily a good thing. For moderate sample sizes, the Wilk-Shapiro test will often be more
sensitive than the other two for detecting the sorts of non-normality that will cause problems for
the analysis. Of course, when the sample size is very small, none of these tests are able to detect
any but the most obvious violations of the normality assumption. For this data set, a Wilk-Shapiro
test was run, and it wasn't even close to being signi cant (p = :8982).
As for the remaining assumptions, there are also both graphical and quantitative methods for
looking at the constant variance (homoscedasticity) assumption, though in a regression setting,
2 cf: Draper and Smith, Applied Regression Analysis, 2nd edition, p. 164.
The Pragmatist's Guide to Statistics: logistic regression 15

this can be hampered by the lack of true replication in the data. A Levene test run on this
data was insigni cant (p = :8097), indicating that there are at least no obvious problems with this
assumption. Finally, there's usually not much that you can do about the independence assumption,
other than to come up with a hand-waving argument that di erent data points can't a ect each
other.
These methods of examining the residuals are general enough that they can be used for multiple
regressions problems as well (i.e., when there are several predictors in the model).
As I mentioned above, individual observations can also cause problems, either because the
response is abnormal, relative to the rest of the data, or else because the predictors are abnormal
relative to the rest of the data. To look at the responses, you can calculate standardized residuals,
which indicate how many standard deviations a given observation lies above or below the regression
line. This is a little more complicated than looking for the largest residuals, since the regression
line itself is less variable for values of the predictors that are in the body of the data (near the
mean value of the predictors) than it is at the extremes of the range of the predictors. In addition,
a standardized residual can be calculated either relative to the regression line calculated from the
entire data set, or relative to a regression line that was calculated after removing the data point
that's under consideration. An examination of the standardized residuals showed that there were
two that were larger than 2 in absolute value. (Hawaii was the leading movie in 1966, earning
$15.5 million dollars, which was low, even by the standards of the day. Its standardized residual
was 2:136. Star Wars was the leading movie in 1977, and its gross receipts ($193.5 million at the
time) were surprisingly large relative to the rest of the data. Its standardized residual was 2.145.)
Neither of these standardized residuals is all that extreme, since the probability of getting a Z -value
greater than 2.14 is still greater than .01 on any one of the residuals.
Finally, an observation can cause problems because its predictors are far away from the bulk of
the data. Abnormal predictors aren't bad per se, but the model can't a ord to ignore points like
that completely, and consequently they will have a lot of in uence in determining the estimated
parameters. There are two reasons for wanting to identify in uential points. First, you may not
feel comfortable about extrapolating the model all the way out to those extreme situations, perhaps
because you're not completely certain that the model is linear everywhere. Second, for such an
extreme point, it's often dicult to tell whether the response is strange, since the model will go out
of its way to t those points well, regardless of the value of the response. By identifying in uential
points, you can give them extra scrutiny to determine critically whether they belong in the model
or not. With only one predictor, such a strange observation could only have an abnormally high
or abnormally low predictor, and it's pretty easy to pick points like this out on the graph. As we'll
see, with more predictors in the model, you may need more help with this. Since the independent
variable in this data set was for the most part consecutive years from 1966 to 1994, there aren't
any extreme points like this. If I had changed the data set to include the highest grossing movie
of 1944 (Sergeant York), for example, then it might be viewed as an extreme point. There are a
number of methods for picking out points like this. For example, the leverage, hj , which is de ned
as
hj = xj (X 0X ) 1 x0j ;
where xj is the vector of predictors for the j -th observation, and X is the matrix of predictors for
all variables. Other closely-related measures of the in uence of an observation are Cook's distance,
The Pragmatist's Guide to Statistics: logistic regression 16

900 r

850 r

800 r

r
r

750 r r
r
r
700
r r
r rr
r r r
rr r r r r

Runs 650 r r r
r r
r

600
r
r r
r r

550 r

500
450 r

400
400 450 500 550 600 650 700 750 800 850
Runs against
Figure 2.2: Runs scored for and against San Francisco Giants

predicted residuals, or a method devised by Andrews and Pregibon3 . These methods di er primarily
in whether the measure of in uence is standardized, and in whether the observation whose in uence
is being assessed is included in the linear model calculations. For a better sense of these in uence
diagnostics, I'm going to look at a second example.
This example deals with the relationship in baseball between runs scored, runs given up, and a
team's success (how many games it wins. The data are for the San Francisco Giants in the years
1958 through 1993. In general, you would expect a team to do better, the more runs it scores, and
the fewer that it allows. There have been a number of people who've studied this question in a
mathematical setting, including Earnshaw Cook4 and Bill James5 . A plot of the total runs scored
by the Giants and by their opponents each year is given in Figure 2.2.
One of the rst things you notice is that there's a point way down in the lower left hand corner
of the graph. This point is for the year 1981, which was the year that the rst long baseball strike
occurred. The form of the model will depend to a large extent on how an extreme data point like
3 D.F. Andrews and D. Pregibon, \Finding the Outliers that Matter," Journal of the Royal Statistical Society,
Series B, vol. 58 (1978), pp. 85-93.
4 In his book Percentage Baseball, published in 1964 by M.I.T. Press. His approach was based on linear regression
methods.
5 Bill James published a series of books entitled Bill James' Baseball Abstract, published from around 1981 to
1989. They dealt in a somewhat mathematical way with a variety of questions that arise in baseball folklore. In my
opinion, his greatest contribution is to plant the notion that many generally accepted \truths" that have become part
of baseball folklore, are things that can be investigated quantitatively. His approach to the relationship between runs
and wins was largely ad hoc. His Pythagorean Theorem states that winning percentage should be related to runs via
the equation
p =
(W+ )2
(W )2 + (W )2 :
+
The Pragmatist's Guide to Statistics: logistic regression 17

2 r
r
r
1
r
r r r
r r r
r r r
r r
0
r r r
r r r
r r r r r r
Residual -1 r r r r r
r

-2 r
r

-3
-4 r

55 60 65 70 75 80 85 90 95
Year
0.45
0.4 r

0.35
0.3
Leverage0.25
0.2 r
r r
0.15 r r
0.1 r r
r
r r r
r r r
0.05 r r
r
r r r
r r r r
r r
r r
r r r r
r r r

0
55 60 65 70 75 80 85 90 95
Year
Figure 2.3: Model predicting wins from runs for/against

that is t. For example, if you simply t a linear model for the number of wins as a function of
runs scored (X+ ) and given up (X ), the least squares equation is
W = 49:82 + :127X+ :082  X :
I'm going to look at three possible models here to see how the leverage and residuals can change
depending on how the model is formulated. The rst of these, Figure 2.3, predicts the number of
wins from the numbers of runs the Giants and their opponents scored. Notice that 1981 is seen
as both a point with high leverage (more than twice the leverage of any other point), and it also
has the largest residual in absolute value (Z = 3:87). That's because from the way the data was
presented, the model can't tell the di erence between a season with fewer games, and one in which
the Giants just didn't win much.
If we instead express the dependent variable as the Giants' win percentage, then 1981's data point
is still in uential (since leverage is de ned in terms of the predictors only and not the response),
but the residual isn't bad at all (Z = 0:22). The worst residual in this model was for the year
1972, when the Giants outscored their opponents but wound up winning under 45% of their games.
The leverage and standardized residuals for this model are presented in Figure 2.4. Note that the
leverages are identical to ones from the previous model.
Finally, if we express the dependent variable as the Giants' win percentage and the independent
variables as the runs per game scored either by or against the Giants, then 1981 isn't particularly
strange, either in terms of its response or in terms of its predictors. In this plot, the largest residual
was again for the 1972 season (Z = 2:91), and the largest leverage was for the 1970 season, in
which the Giants led the league both in runs scored, as well as in runs allowed. The standardized
residuals and leverages for this model are plotted in Figure 2.5.
The Pragmatist's Guide to Statistics: logistic regression 18

3
2 r
r
r

1 r
r r
r r r
r
r
r

Residual 0 r
r
r
r
r r
r r r
r r
r
r
r
r
r r

-1
r
r r
r
r
-2 r

-3 r

55 60 65 70 75 80 85 90 95
Year
0.45
0.4 r

0.35
0.3
0.25
Leverage 0.2 r
r r

0.15 r r
0.1 r r
r
r r r
r r r
0.05 r r
r r r r
r r r r
r r
r r
r r r r
r r r

0
55 60 65 70 75 80 85 90 95
Year
Figure 2.4: Model predicting winning percentage from runs for/against

2.5
2 r r

1.5 r
r

1
r
r
r
0.5
r r
r
r r

Residual -0.50
r r r r
r r r r r
r r r r r
r r
r
-1 r r
r
-1.5 r
r
-2 r

-2.5
-3 r

55 60 65 70 75 80 85 90 95
Year
0.3
0.25 r

0.2 r
r
r

Leverage0.15 r
r
r
0.1
r
r r r r
r
r r r r
r
0.05
r r
r r r r r
r r r r r r r
r r r r

0
55 60 65 70 75 80 85 90 95
Year
Figure 2.5: Model predicting winning percentage from runs per game for/against
The Pragmatist's Guide to Statistics: logistic regression 19

To recap the most important aspects of these linear regression models, it's important both
to look at the overall t of the model, using a criterion like Durbin Watson or even the model's
R2 value, but it's also important to look at the impact of individual points on the results of the
regression, based on some in uence diagnostics. What's more, whether a point is an outlier or not,
or whether it's excessively in uential or not will depend strongly on how the model is formulated.
Similar ideas can be applied to logistic regressions, as we'll see in the next section.
2.3 Logistic Regression Diagnostics
In a logistic regression, there's considerably less that can go wrong with the model, but by the
same token, there's less information to tell you when things are going wrong. If we look back at
the assumptions underlying a linear model (regression or analysis of variance), many of them have
limited relevance for a logistic regression:
 The residuals from the model are normally distributed. Of course, the residuals from a logistic
regression aren't normally distributed, since the response is either 0 or 1. (That is, the
distribution of the responses is Bernoulli with parameter p, or at worst, binomial (n; p).) The
residual thus will be either 1 p (with probability p) or p (with probability 1 p). What's
important though is that the distribution of the residuals is determined by the probability
p of a response, unlike the least squares situation, in which the distribution of the residuals
doesn't depend on  at all, but rather on the additional parameter  2.
 The residuals have a constant residual variance. This isn't a concern either, since the mean
(p) determines the variance (p(1 p)).
 The residuals are independent. If you recall, this assumption was the one with the hand-
waving justi cation, and for a logistic regression there's still typically no statistical way to
argue that independence is a reasonable assumption. You have to argue that one individual's
response isn't a ected by how the other individuals responded.
 The residuals have a mean of zero. This is the assumption that was addressed by a Durbin
Watson statistic or by a 2 test for goodness of t in the least squares case. As we saw
last time, there are similar statistics available for a logistic regression, including the Hosmer
Lemeshow statistic and the C. C. Brown statistic.
The other type of residual analysis we did in our regression examples looked at individual
residuals to determine whether either the predictors or the responses for those cases were aberrant.
Similar statistics are available for a logistic regression; the ones that are easiest to come by are
based on the standardized residuals and on the leverage statistic, which is why I emphasized those
in the least squares examples. I'm going to do a couple of examples of logistic regressions, in
which those diagnostics provide useful information. After the examples, I'll talk a little about the
control language that's used to run a logistic regression, and to get the diagnostic information.
Unfortunately, no one package gives all the diagnostics you'd like (at least in the form you'd like),
as we'll see.
2.4 Logistic Regression Diagnostics: Patient Survival Example
The rst example consists of data on close to 300 emergency room patients who were admitted
for a variety of causes and who ultimately either survived to be discharged from the hospital or
The Pragmatist's Guide to Statistics: logistic regression 20

1               

0.8

0.6
P rfDeathg
0.4

0.2

0           
0 2 4 6 8 10 12 14 16
APSM1
Figure 2.6: Mortality in Emergency Room Patients

else they didn't. There are a number of indices which have been proposed as triage tools (as well
as to evaluate the performance (success) of a hospital) that are based on a patient's condition
and produce an estimate of how likely the patient is to survive. Some of these indices, such as
ISS (Injury Severity Score) are pretty ad hoc, while others, such as APACHE II, are based on
logistic regression analyses. The problem with many of these indices is that the predictors on
which they're based often include variables that aren't observable on a timely basis. That is, they
may involve laboratory values that aren't available at the time that decisions have to be made
about a patient's treatment. For this reason, there is a need for somewhat cruder indices that are
based on information that can be collected very easily, as soon as a patient enters the hospital. The
data in this example is for such an index that we've discussed before, called APSM1 here, which
is based on some basic vital signs (heart rate, blood pressure and respiration), as well as a fairly
standard measure of a patient's responsiveness, called the Glasgow Coma Scale (GCS). The index
is integer-valued, and ranges from 0 to 16, with the higher numbers corresponding to the patients
whose medical condition is most critical.
The graph above (Figure 2.6) shows a plot of the raw data, along with the predicted probability
of mortality, based on a logistic regression in which APSM1 was the only predictor. The graph
can be a little misleading, since there are many instances in which patients have identical APSM1
values. Thus there are a bunch of points that get plotted on top of each other. What you can see
from this graph is that there's a lot of overlap in APSM1 score between the patients who survived
and those who died, but it's also noteworthy that nobody with a score above 11 survived, whereas
there were quite a few patients who died and who had scores that high.
In the excerpt from the (BMDP) output which follows, you can see that APSM1 is quite
signi cant as a predictor of mortality (2 = 58:71; p < :0001). You can also see that the model ts
reasonably, if not spectacularly well, since both the Hosmer-Lemeshow and Brown goodness of t
statistics are insigni cant (Hosmer: 26 = 11:180; p = :083; Brown: 22 = 1:617; p = :445). As is
The Pragmatist's Guide to Statistics: logistic regression 21

often the case, the likelihood ratio (2*O*LN(O/E)) chi squared statistic isn't terribly enlightening,
since it's based on a contingency table that has 58 degrees of freedom. This is a lot, given that
there were under 300 observations in the data set. In a sense, we were lucky here, since although
APSM1 ranges from 0 to 16, it takes only integer values. If it had been continuously distributed,
then the chi squared table on which the likelihood ratio chi squared statistic is based would have
had one cell for each unique combination of predictors, possibly as many as there were observations.
STEP NUMBER 1 apsm1 IS ENTERED
---------------
LOG LIKELIHOOD = -101.772
IMPROVEMENT CHI-SQUARE ( 2*(LN(MLR) ) = 92.533 D.F.= 1 P-VALUE= 0.000
GOODNESS OF FIT CHI-SQ (2*O*LN(O/E)) = 59.965 D.F.= 58 P-VALUE= 0.404
GOODNESS OF FIT CHI-SQ (HOSMER-LEMESHOW)= 11.180 D.F.= 6 P-VALUE= 0.083
GOODNESS OF FIT CHI-SQ ( C.C.BROWN ) = 1.617 D.F.= 2 P-VALUE= 0.445
STANDARD 95% C.I. OF EXP(COEF)
TERM COEFFICIENT ERROR COEF/SE EXP(COEF) LOWER-BND UPPER-BND

apsm1 -0.4217 0.543E-01 -7.77 0.656 0.589 0.730


CONSTANT 3.329 0.346 9.61 27.9 14.1 55.2

STATISTICS TO ENTER OR REMOVE TERMS


-----------------------------------
APPROX. APPROX.
TERM F TO D.F. D.F. F TO D.F. D.F.
ENTER REMOVE P-VALUE
dx 1.26 1 277 0.2628
response 0.95 2 276 0.3897
apsm1 58.71 1 278 0.0000
CONSTANT 89.80 1 278 0.0000
CONSTANT IS IN MAY NOT BE REMOVED.

Based on this portion of the output, you can't get all that good an idea of how the model is
performing. The fact that the p-value for the Hosmer goodness of t statistic is less than 0.10 is a
little ominous, but it's not clear why this is happening. To ll in the picture, you need to look at
some additional diagnostic information, of which the leverage and standardized residuals are two
examples. Both BMDP and SAS will calculate this information, but since I prefer the way SAS
presents the information, I'll extract a little of this information from the analogous SAS output.
Regression Diagnostics
Covariates Pearson Residual
Case (1 unit = 0.66)
Number APSM1 Value -8 -4 0 2 4 6 8
1 6.0000 . | | |
2 2.0000 . | | |
3 0 -0.1893 | * |
4 4.0000 -0.4400 | *| |
The Pragmatist's Guide to Statistics: logistic regression 22

0.022
0.02 "apsm1.txt"
0.018
0.016
Leverage0.014
0.012
0.01
0.008
0.006
0.004
0 2 4 6 8 10 12 14 16
APSM1
5
4 r
r "apsm1.txt" r

3 r

Residual 21
r
r
r
r
r r r r r
0
r r r
r r r r r r
-1
r r r
r
r
-2 r

0 2 4 6 8 10 12 14 16
APSM1
Figure 2.7: Diagnostics on LR of Emergency Room Patients

Covariates Pearson Residual


Case (1 unit = 0.66)
Number APSM1 Value -8 -4 0 2 4 6 8
5 2.0000 -0.2886 | * |
6 0 -0.1893 | * |
7 0 -0.1893 | * |
8 2.0000 -0.2886 | * |
9 5.0000 -0.5432 | *| |
10 5.0000 -0.5432 | *| |
Deviance Residual Hat Matrix Diagonal
Case (1 unit = 0.32) (1 unit = 0)
Number Value -8 -4 0 2 4 6 8 Value 0 2 4 6 8 12 16
1 . | | | . | |
2 . | | | . | |
3 -0.2653 | *| | 0.00400 | * |
4 -0.5949 | * | | 0.00523 | * |
5 -0.4000 | *| | 0.00479 | * |
6 -0.2653 | *| | 0.00400 | * |
7 -0.2653 | *| | 0.00400 | * |
8 -0.4000 | *| | 0.00479 | * |
9 -0.7191 | * | | 0.00579 | * |
10 -0.7191 | * | | 0.00579 | * |
The Pragmatist's Guide to Statistics: logistic regression 23

INTERCPT Dfbeta APSM1 Dfbeta


Case (1 unit = 0.04) (1 unit = 0.04)
Number Value -8 -4 0 2 4 6 8 Value -8 -4 0 2 4 6 8
1 . | | | . | | |
2 . | | | . | | |
3 -0.0120 | * | 0.0103 | * |
4 -0.0262 | *| | 0.0129 | * |
5 -0.0196 | * | 0.0145 | * |
6 -0.0120 | * | 0.0103 | * |
7 -0.0120 | * | 0.0103 | * |
8 -0.0196 | * | 0.0145 | * |
9 -0.0262 | *| | 0.00567 | * |
10 -0.0262 | *| | 0.00567 | * |

C CBAR
Case (1 unit = 0.01) (1 unit = 0.01)
Number Value 0 2 4 6 8 12 16 Value 0 2 4 6 8 12 16
1 . | | . | |
2 . | | . | |
3 0.000145 |* | 0.000144 |* |
4 0.00102 |* | 0.00102 |* |
5 0.000402 |* | 0.0004 |* |
6 0.000145 |* | 0.000144 |* |
7 0.000145 |* | 0.000144 |* |
8 0.000402 |* | 0.0004 |* |
9 0.00173 |* | 0.00172 |* |
10 0.00173 |* | 0.00172 |* |

DIFDEV DIFCHISQ
Case (1 unit = 0.43) (1 unit = 1.75)
Number Value 0 2 4 6 8 12 16 Value 0 2 4 6 8 12 16
1 . | | . | |
2 . | | . | |
3 0.0705 |* | 0.0360 |* |
4 0.3549 | * | 0.1946 |* |
5 0.1604 |* | 0.0837 |* |
6 0.0705 |* | 0.0360 |* |
7 0.0705 |* | 0.0360 |* |
8 0.1604 |* | 0.0837 |* |
9 0.5189 | * | 0.2968 |* |
10 0.5189 | * | 0.2968 |* |

First o , Figure 2.7 shows both the leverage of a data point, along with the standardized
residuals of a point, both as a function of APSM1. From the plot of standardized residuals, we see
The Pragmatist's Guide to Statistics: logistic regression 24

two curves stretching across the page. The upper of these corresponds to the standardized residuals
for patients who ultimately died, while the lower one corresponds to the standardized residuals for
patients who survived. By far, the largest residuals occur for the patients with low APSM1 scores
who for whatever reason died. There are many possible explanations for this, but basically these
people had something wrong with them that wasn't re ected in their APSM1 score. Thus, while
you wouldn't think they would die by looking at their score, something else was going on that
meant that they were sicker than they appeared on the surface, based on their APSM1 score.
Carrying this observation a little further, one of the assumptions of a logistic model (or probit,
for that matter) is that as the predictors get extreme in one direction or the other, the probability
of a response will go to zero or one, depending on the direction. This assumption may not be
reasonable for this type of data, since it may be the case that patients have some probability of
dying, regardless of how healthy they look. The mere fact that they're in the hospital should mean
that something is wrong with them. Likewise, it may also be true that even extremely sick patients
may have some probability of surviving. Thus, while the model may t quite well in the body of
the data (well enough to pass a goodness of t test), the model may break down when you get to
the observations with the most extreme predictors. The same can be said of a linear regression
model, for which the regression (EY jX ) may be essentially linear for most of the data, and yet
depart from linearity as you get to the fringes of the data. It's always much easier to tell whether
the model is correctly speci ed where there's a lot of data, than it is at the extremes. That's why
extrapolation is much trickier to do than is interpolation.
A nal observation on the standardized residual plot is that the most extreme extreme residuals
all correspond to patients who died when they weren't expected to. It's also possible to get large
residuals in the other direction, but it's not surprising in this case that we didn't, since many more
patients survived than didn't (78% of them). The \healthiest" patient to die had an estimated
probability of death around 3%, while the sickest patient to survive (who had an APSM1 of 11)
had an estimated probability of survival around 22%.
The plot of leverage values may be a little surprising, since in a linear regression, the leverage
increases as a quadratic function of the di erence between the predictors for the observation in
question and the mean vector of predictors across all observations. In the lower part of Figure 2.7,
we see that the leverage increases up to an APSM1 score of around 12, and decreases beyond that
point. The reason this happens is that for an observation to have leverage, it must have predictors
that are out of the ordinary, but it must also be variable enough that surprises can occur. In a
logistic regression, the points at either extreme will have predicted probabilities close to either 0 or
1, so that their variance (p(1 p)) will be small.
The excerpt from the output contains a number of statistics that I haven't discussed, though
they're still aiming to identify either points that have anomalous responses or ones that have undue
in uence on the outcome of the analysis. Brie y, the additional measures are as follows:
 The Deviance Residual is similar to a Pearson residual, in that it measures the di erence be-
tween the observed response (i.e., 1 or 0) and the model's predicted probability of a response.
The di erence is that the Pearson residual is measured in terms q of squared error, while the
Deviance residual is measured on a log scale, de ned as 2 ln(1 ^j ) if subject j re-
p
sponded, and as 2 ln ^j if subject j didn't respond, where ^j is the predicted probability
that subject j would respond.
 DFBETA measures how much one of the model's parameter estimates will change if the j -th
The Pragmatist's Guide to Statistics: logistic regression 25

observation is deleted from the sample. It's measured relative to the standard error for that
parameter, so it's expressed in standardized units. Since each model will have one parameter
for the intercept and one for each of k covariates in the model, the number of DFBETA
measures will be equal to the number of degrees of freedom in the MODEL, including one for
the intercept term.
 C and C (CBAR are composite measures that approximate how much the estimated param-
eter vector will change if the j -th observation is deleted. This is measured relative to the
estimated covariance matrix of the parameters, so again this is a standardized measure. C 's
approximation is based on the asymptotics of the Pearson 2 measure of t, while C is based
on the asymptotics of the Deviance (log likelihood) measure of t.
 DIFDEV and DIFCHISQ measure how much the deviance (log likelihood 2) or Pearson
2 statistics would change if the j -th observation was deleted. Large values correspond to
observations that aren't being predicted very well by the model.
I'm going to look at one more example, which di ers primarily in that there are going to be
more predictors included in the model, some of which are categorical.
2.5 Logistic Regression Diagnostics: Resistant Infection Example
It's not at all uncommon for hospitalized patients to develop bacterial infections, either in con-
junction with or peripheral to their disease. There are many antibiotic agents that can be used
to treat these infections, so an infected patient will often receive a series of antibiotic treatments
to try and control their infection. This can get nasty if the bacterial strain involved is one that's
resistant to antibiotic treatment. There's some reason to suspect that if a patient has an initial
infection that's sensitive to treatment with antibiotics, then subsequent infections may be more
likely to be resistant, depending on the type of antibiotics that were used to treat the initial infec-
tion. The data set in this example looks at a number of patients who had an initial infection that
was sensitive and who subsequently developed a second infection. (It was known that the second
infection wasn't just a reappearance of the initial infection, since the data set contains only cases
in which a di erent bacterial strain was involved in the second infection.) Logistic regression was
used to look at whether the antibiotics that were used in treating the initial infection in uenced
whether the second infection was resistant or sensitive to antibiotics. I've taken a few liberties with
the original analysis, sticking in a covariate that wasn't important, and rede ning a variable that
originally indicated simply whether or not an animoglycoside was used so that it indicated which
one was used. I made these changes so that I'd have at least one covariate in the model and at
least one categorical e ect that had more than two levels.
SAS and BMDP wound up tting very similar, though not identical models. Some relevant
portions of the BMDP output, including the last step in the stepwise procedure are given below.

NUMBER OF CASES READ. . . . . . . . . . . . . . 131


CASES WITH USE SET TO ZERO . . . . . . . . . 20
REMAINING NUMBER OF CASES . . . . . . . . 111
TOTAL NUMBER OF RESPONSES USED IN THE ANALYSIS 111.
The Pragmatist's Guide to Statistics: logistic regression 26

resist . . . . . . 25.
sens . . . . . . 86.

NUMBER OF DISTINCT COVARIATE PATTERNS . . . . . 110

DESCRIPTIVE STATISTICS OF INDEPENDENT VARIABLES


-----------------------------------------------

VARIABLE STANDARD
NO. N A M E MINIMUM MAXIMUM MEAN DEVIATION SKEWNESS KURTOSIS
3 age 3.0000 77.0000 43.6486 19.5093 0.1232 -1.0894
5 apache 0.0000 30.0000 9.8468 6.9219 0.5329 -0.4116

VARIABLE GROUP DESIGN VARIABLES


NO. N A M E INDEX FREQ ( 1) ( 2) ( 3)

19 esc 0 85 0
1 26 1

20 cef 0 56 0
1 55 1

37 aglyco 1 82 0 0 0
2 21 1 0 0
3 7 0 1 0
4 1 0 0 1

STEP NUMBER 3 cef IS ENTERED


---------------

LOG LIKELIHOOD = -28.016


IMPROVEMENT CHI-SQUARE ( 2*(LN(MLR) ) = 8.115 D.F.= 1 P-VALUE= 0.004
GOODNESS OF FIT CHI-SQ (2*O*LN(O/E)) = 56.032 D.F.= 104 P-VALUE= 1.000
GOODNESS OF FIT CHI-SQ (HOSMER-LEMESHOW)= 2.267 D.F.= 7 P-VALUE= 0.944
GOODNESS OF FIT CHI-SQ ( C.C.BROWN ) = 0.638 D.F.= 2 P-VALUE= 0.727

STANDARD 95% C.I. OF EXP(COEF)


TERM COEFFICIENT ERROR COEF/SE EXP(COEF) LOWER-BND UPPER-BND

esc 4.956 1.12 4.41 142. 15.3 0.132E+04


cef 2.568 1.13 2.27 13.0 1.38 123.
aglyco (1) 0.3244 0.923 0.351 1.38 0.222 8.63
(2) 5.077 1.54 3.29 160. 7.55 0.341E+04
(3) 15.84 0.158E+04 0.100E-01 0.755E+07 0.000 0.000
The Pragmatist's Guide to Statistics: logistic regression 27

CONSTANT -5.063 1.19 -4.24 0.633E-02 0.594E-03 0.673E-01

COVARIANCE MATRIX OF COEFFICIENTS


---------------------------------
esc cef aglyc(1) aglyc(2) aglyc(3)
esc 1.26326
cef 0.91480 1.27972
aglyc(1) 0.15384 -0.12004 0.85266
aglyc(2) 1.16447 1.13312 0.21248 2.37490
aglyc(3) -0.08058 0.23182 0.06073 0.24133 2.494885E+6
CONSTANT -1.18267 -1.14662 -0.21457 -1.40580 -0.24032
CONSTANT
CONSTANT 1.42299

STATISTICS TO ENTER OR REMOVE TERMS


-----------------------------------
APPROX. APPROX.
TERM CHI-SQ. D.F. CHI-SQ. D.F. LOG
ENTER REMOVE P-VALUE LIKELIHOOD
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.00000E+00,PCONV= 0.39588E-01
age 0.68 1 0.4100 -27.6766
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.00000E+00,PCONV= 0.52551E-01
apache 0.09 1 0.7626 -27.9704
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.33780E-05,PCONV= 0.82094E-01
esc 47.86 1 0.0000 -51.9473
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.54711E-05,PCONV= 0.10497E+00
cef 8.11 1 0.0044 -32.0733
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.00000E+00,PCONV= 0.40399E-01
C*D 0.63 1 0.4257 -27.6987
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.00000E+00,PCONV= 0.14759E-02
aglyco 14.38 3 0.0024 -35.2053
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.27618E-06,PCONV= 0.11607E+00
C*E 0.78 3 0.8534 -27.6242
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.72209E-05,PCONV= 0.10931E+00
D*E 2.15 3 0.5425 -26.9427
CONVERGENCE TO ESTIMATE THE SIGNIFICANCE OF THE TERM BELOW WAS NOT REACHED IN
10 ITERATIONS. LAST LCONV= 0.28503E-05,PCONV= 0.11735E+00
The Pragmatist's Guide to Statistics: logistic regression 28

CONSTANT 67.10 1 0.0000 -61.5646


CONSTANT IS IN MAY NOT BE REMOVED.

BMDP chose 3 predictors for inclusion in the model: whether cefazolin was used, whether an
extended spectrum cephalosporin (ESC) was used, and which (or whether an) aminoglycoside was
used. Other terms were listed as possible covariates, including age, Apache score, and interac-
tions between ESC's and either cefazolin or aminoglycosides. (The interest in the interactions was
whether the use of some of the more routine antibiotics made the problems with ESC's worse.)
The overall tests of the model's t came out reasonably well (p = :944 for the Hosmer statistic
and p = :727 for the Brown statistic), so on that level, there's no reason to question the model. As
I've mentioned previously, BMDP will let you calculate leverage and residual diagnostics and save
them in a BMDP output data set, but since this makes it a little cumbersome to work with them,
I'm going to take this information out of the SAS output again. Since there are a few more points
I want to make about the BMDP output, I'll come back to this.
An extra piece of output that I requested that BMDP print is the estimated covariance matrix
for the parameters of the model. The reason that this might be interesting is that for a signi cant
categorical predictor, you probably also want to know which of the categories di er signi cantly.
For example, for the aminoglycoside predictor, we might want to know whether patients that receive
gentamicin are at greater risk of getting a resistant infection than patients who receive tobramicin,
or patient who receive none of the aminoglycosides. I didn't copy the portion of the output that
tells you which of the categories correspond to which treatment (it's in the control language in
Section 2.6). The four categories I de ned (in order) were \no aminoglycoside," \gentamicin,"
\tobramicin," and \amikacin." According to the de nition of the design variables (cf. p. 33),
a patient who was in the \no aminoglycoside" group would have a value of zero for each of the
design variables, a patient who was in the \gentamicin" group would have a one for the rst design
variable and zeroes for the others, and so forth. From this, you can see that the di erence in the
logit between a gentamicin patient and one who didn't receive any aminoglycosides would be the
di erence between a term equal to 1  C1 + 0  C2 + 0  C3 = C1 in the logit, and one that's equal to
0  C1 + 0  C2 + 0  C3 = 0; respectively, where Ci is the coecient of the i-th design variable. The
di erence between these two expressions is just C1. Thus, to compare these two groups, all you
have to do is look at the estimated coecient C1 , divided by its estimated standard error. This will
give a Z statistic which has an approximate normal distribution with mean zero and variance 1.
If it's larger than approximately 2 in absolute value, then patients who received gentamicin would
have a signi cantly di erent response, relative to patients who didn't get any aminoglycoside.
It's more complicated to compare gentamicin with tobramicin, since the corresponding coe-
cients of the three design variables would be 1  C1 +0  C2 +0  C3 and 0  C1 +1  C2 +0  C3, respectively.
The di erence between the two is C1 C2 and while it's easy to calculate this di erence, it's a little
harder to calculate an estimate of the standard error of this di erence. If the di erence in question
was Ci Cj , then you could calculate an approximate Z statistic of the form
Z= psiiC+i sjjCj 2sij ;
where sij is the (i; j )-th element of the covariance matrix of the coecients. (It's also possible to get
at this quantity based on the means, standard errors, and correlation matrix among the coecients,
but this is a more cumbersome calculation.) You can de ne more complicated contrasts among the
The Pragmatist's Guide to Statistics: logistic regression 29

estimated coecients and construct Z statistics based on them, but I won't go into how to do that
here.
I should comment that BMDP uses a number of di erent ways of de ning the design variables,
and the form that's used by default can di er from one version of the program to another, so you
need to check the way the design variables have been de ned. For the record, you can ask BMDP
to use a particular parameterization, by specifying the DVAR option as either MARG (marginal), PART
(partial, which is the one that was used here) or ORTH (orthogonal).
One thing that appears all over the place in the BMDP output are the warning messages about
the model failing to converge. The reason for all of these messages is that for one of the design vari-
ables (the last dummy variable for the aminoglycoside e ect), the parameter was essentially in nite
(the estimated adjusted odds ratio for this dummy variable was over a million). This happened
because there was only one subject that received amikacin (the last of the aminoglycosides), and
that patient developed a resistant infection, so there was no observed variability in the response to
that antibiotic. In such cases, there's no way to estimate the parameter (which truly is in nite)
to within the tolerance factor that's used to de ne convergence, so it's actually no surprise that
the model didn't converge. It's noteworthy that SAS and BMDP react di erently to this situation.
Both programs recognize that the algorithm has failed to converge, but BMDP gives you the chance
to proceed with the calculations anyway, whereas SAS simply prints out an error message and quits.
For this (and one other) reason, the model that SAS wound up selecting was slightly di erent.
The model that SAS t is summarized in the following excerpt from its output:
Step 3. Variable CEF entered:

The LOGISTIC Procedure

Criteria for Assessing Model Fit

Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 120.424 65.546 .
SC 123.134 76.384 .
-2 LOG L 118.424 57.546 60.878 with 3 DF (p=0.0001)
Score . . 58.428 with 3 DF (p=0.0001)

Analysis of Maximum Likelihood Estimates

Parameter Standard Wald Pr > Standardized Odds


Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -4.8898 1.1492 18.1038 0.0001 . 0.008
ESC 1 4.9584 1.1045 20.1530 0.0001 1.163040 142.372
CEF 1 2.5045 1.1107 5.0846 0.0241 0.693507 12.238
TOB 1 4.9041 1.5080 10.5753 0.0011 0.660200 134.838

Association of Predicted Probabilities and Observed Responses


Concordant = 87.3% Somers' D = 0.847
The Pragmatist's Guide to Statistics: logistic regression 30

Discordant = 2.7% Gamma = 0.941


Tied = 10.0% Tau-a = 0.298
(2150 pairs) c = 0.923
Residual Chi-Square = 2.2201 with 4 DF (p=0.6953)

Analysis of Variables Not in the Model

Score Pr >
Variable Chi-Square Chi-Square

AGE 0.2542 0.6141


APACHE 0.0008 0.9778
AGE 0.2542 0.6141
APACHE 0.0008 0.9778
GEN 0.5308 0.4663
AMI 1.0069 0.3156

NOTE: No (additional) variables met the 0.05 significance level for entry into
the model.

As in the BMDP analysis, three predictors got chosen for the model, two of which were the same
as before (ESC's and cefazolin). The third predictor was one of the aminoglycosides, but since SAS
didn't know anything about the relationships among the potential predictors (which ones should
enter and/or leave the model simultaneously), it chose only one of the three dummy variables for
the aminoglycoside factor. This could be viewed as either a strength or as a weakness, but I think
that at the very least, it's unfortunate that you never get a 3 degree of freedom test of whether the
aminoglycosides (as a group) make a signi cant di erence. One implication of this is that the form
of SAS's model will often depend on how you decide to parameterize the categorical e ects (which
set of dummy variables you use) whereas BMDP's model won't. This failure to recognize which of
the predictors need to be linked with each other is particularly unfortunate if you're trying to keep
track of interaction terms (which in a SAS model, are simply represented by products between the
main e ect dummy variables). It's quite possible (as well as likely) that SAS will choose only some
of the dummy variables for a main e ect, and a di erent subset of the dummy variables for the
corresponding interactions. A model like this can be extremely hard to interpret. For this reason,
I nd that the SAS stepwise algorithm in PROC LOGISTIC is useful primarily when all e ects
are covariates and there aren't any interactions being considered. There are some other ways of
running logistic models in SAS, which is one of the things that I'll discuss next time.
The following is an excerpt from the diagnostics on this model. I'm showing only a few of the
cases and only a few of the diagnostic variables, primarily to conserve space.
Covariates Pearson Residual
Case (1 unit = 0.45)
Number ESC CEF TOB Value -8 -4 0 2 4 6 8
41 0 1.0000 0 -0.3034 | *| |
42 1.0000 0 0 -1.0349 | * | |
43 0 0 0 -0.0867 | * |
The Pragmatist's Guide to Statistics: logistic regression 31

44 0 0 0 -0.0867 | * |
45 0 0 1.0000 -1.0072 | * | |
46 0 0 0 -0.0867 | * |

Covariates Pearson Residual


Case (1 unit = 0.45)
Number ESC CEF TOB Value -8 -4 0 2 4 6 8
47 1.0000 1.0000 0 -3.6205 |* | |
48 0 0 0 -0.0867 | * |
49 0 1.0000 0 -0.3034 | *| |
50 1.0000 1.0000 0 0.2762 | |* |

Deviance Residual Hat Matrix Diagonal

Case (1 unit = 0.29) (1 unit = 0.02)


Number Value -8 -4 0 2 4 6 8 Value 0 2 4 6 8 12 16
41 -0.4197 | *| | 0.0215 | * |
42 -1.2067 | * | | 0.0728 | * |
43 -0.1224 | * | 0.00979 | * |
44 -0.1224 | * | 0.00979 | * |
45 -1.1835 | * | | 0.2466 | *|
46 -0.1224 | * | 0.00979 | * |
47 -2.3007 |* | | 0.0726 | * |
48 -0.1224 | * | 0.00979 | * |
49 -0.4197 | *| | 0.0215 | * |
50 0.3835 | |* | 0.0726 | * |

From this, you can see that of these ten caes, the one with the worst standardized residual
was case 47, which was a subject who had received two of the \bad" antibiotics, and yet whose
reinfection was sensitive to antibiotic therapy. The large residual re ects the fact that the predicted
probability of sensitivity for that subject's infection was only 7:1%. By contrast, subject 45 had a
residual that was quite reasonable (around 1), but its leverage (hat matrix diagonal) was much
larger than for the other listed points. This was because this was the only subject in the study who
received cephazolin, but not the other two antibiotics that were chosen for the model.
I should comment that for both BMDP and SAS, the leverage diagnostics pertain only to the
variables that were selected for the model. This is probably a reasonable choice, since otherwise,
you would have to be concerned with the in uence of a case under a lot of di erent circumstances,
depending on which predictors were in the model. However, this doesn't help you to diagnose
whether one of the variables that wasn't chosen, failed to be chosen because of what happened for
a few in uential cases. The only way to get this kind of information is to calculate the in uence
diagnostics for some additional models. (Those models would have to be chosen explicitly, rather
than using the stepwise algorithms, since of course the stepwise procedure would start by eliminating
the extra variables that you had just put into the model.)
It's possible to get a Hosmer goodness of t statistic out of PROC LOGISTIC (version 6.07 or
The Pragmatist's Guide to Statistics: logistic regression 32

later), but the statistic you get isn't exactly the same as the one from BMDP. The next piece of
output is what SAS prints out for its Hosmer statistic.

Hosmer and Lemeshow Goodness-of-Fit Test


RESTNT2 = 2 RESTNT2 = 0
-------------------- --------------------
Group Total Observed Expected Observed Expected
1 11 0 0.08 11 10.92
2 11 0 0.08 11 10.92
3 11 0 0.08 11 10.92
4 11 0 0.62 11 10.38
5 11 0 0.93 11 10.07
6 11 1 0.93 10 10.07
7 11 1 0.93 10 10.07
8 11 5 3.90 6 7.10
9 11 7 6.10 4 4.90
10 12 11 11.35 1 0.65
Goodness-of-fit Statistic = 2.9019 with 8 DF (p=0.9404)

Last time, I described the Hosmer statistic as a goodness of t 2 statistic, based on a ten-
cell table, in which the rst cell contained the 10% of the cases that had the lowest predicted
Prfresponseg, the second cell consisted of the next lowest 10% of the cases, and so forth. The
problem with this is that if there are a bunch of cases with very low or very high estimated
probabilities, then the expected frequencies in some of these cells can be lower than you probably
ought to use in a 2 table. SAS's Hosmer statistic went ahead and used all 10 cells anyway. BMDP's
Hosmer statistic used a little more restraint and based the statistic on an 8 cell table (hence it had
6 degrees of freedom). Because of this, I prefer the way that BMDP does the Hosmer calculation.
2.6 Control language for Logistic Regression, using SAS and BMDP
As I've mentioned in my lecture notes on Repeated Measures Analysis of Variance6 , SAS and
BMDP control language are similar in structure, in that there are two main parts to the control
language, the rst of which tells you how to read the data correctly, and the second of which tells
you what to do with the data, once you've read it. In BMDP, the \read the data" sections are the
PROBLEM, INPUT, VARIABLE, GROUP, and TRANSFORM paragraphs, while in SAS, all of this is contained
in the DATA step. For a logistic regression in BMDP, the description of what to do with the data
is in the DESIGN and PRINT paragraphs, while in SAS, it's what follows PROC LOGISTIC. I'm going
to list out the two programs that I used to create the example on resistant infections (Section 2.5)
in their entirety, and then discuss the purpose of the various subcommands.
The control language I used for the BMDP analysis of this data was:
/problem title is 'stu cohen ESC data revisited, infants excluded'.
/input variables are 35.
6 The Pragmatist's Guide to Statistics: repeated measures, which is available on the UCD Division of Statistics
gopher in a series of three LATEX les.
The Pragmatist's Guide to Statistics: logistic regression 33

file is 'c:\home\sc4x.dat'.
reclen is 132.
format is free.
/variable names are firstcnt, los, age, sex, apache, apache2, restnt1,
date1, losfps, restnt2, losfpr, restnt3, noabx, ags,
oneceph, thrcph, exceph, noexceph, esc, cef, van, pip,
gen, tob, ami, tmp, imi, surgery, drains, intub,
foley, lines, feedtube, radiol, gilab, los1, aglyco.
add = 2.
missing is 35*-1.
use = age, apache, restnt2, esc, cef, aglyco.
/group codes(restnt2) are 2, 0.
names(restnt2) are resist, sens.
codes(aglyco) are 0, 1, 2, 3.
names(aglyco) are none, gen, tob, ami.
/transform if (firstcnt ne 1) then use = 0.
if (age le 1) then use = 0.
los1 = max(losfps, losfpr).
aglyco = 0.
if (gen = 1) then aglyco = 1.
if (tob = 1) then aglyco = 2.
if (ami = 1) then aglyco = 3.
/regress dependent is restnt2.
categorical are esc, cef, aglyco.
interval are age, apache.
method is mlr.
model is age, apache, esc, cef, esc*cef, aglyco, esc*aglyco,
cef*aglyco.
start = out, out, out, out, out, out, out, out.
move = 2, 2, 2, 2, 2, 2, 2, 2.
/print cova.
/end
The most important aspects of the \data" section of the program are the USE statement in the
VARIABLE paragraph, and the GROUP paragraph. By default, BMDP will use all of the variables in
the model, treating any variables that aren't listed as covariates (interval variables) as categorical
predictors. To limit the number of variables that are potential predictors in the model, you list
them, along with the dependent variable in the USE statement. This is particularly important,
since the analysis will be able to use only cases for which all of the potential predictors have been
measured. It's a bad thing if cases get deleted from the analysis because of missing values for
variables that you weren't going to use anyway.
The importance of the GROUP paragraph is primarily that it can be used to de ne which logistic
model is being t. If a dependent variable has responses of 0 and 1, then you could do a logistic
regression to predict either PrfY = 0g or PrfY = 1g. The di erence between the two wouldn't
appear in terms of what predictors were signi cant; in that sense the models would be identical.
However, each of the model coecients for one of these models would have the opposite sign from
The Pragmatist's Guide to Statistics: logistic regression 34

the corresponding coecient from the other model. Unless you're sure which of the two models
is being t, you can misinterpret the results badly. By including a GROUP paragraph and de ning
groups for the dependent variable (restnt2 in this case), you can force BMDP to t a model that
predicts the rst of these responses.
The REGRESS paragraph has a number of features, and it's best to go over them one at a time.
 The dependent statement de nes the dependent variable. It should be a dichotomous vari-
able, and it should re ect the response for a single individual. If the data have been entered
with a set of predictors, followed by the number of subjects who responded positively or neg-
atively, then there's an alternate way of specifying the dependent variable, by listing variables
that contain the number of successes (SCOUNT), the number of failures (FCOUNT) or the total
number of cases (COUNT). You need only two of these three variables, since you can always
calculate the third one from the other two.
 interval and categorical are used to de ne which of the variables are covariates and which
are categorical predictors. If you forget to mention some of the predictors here, BMDP will
assume they're categorical.
 The method statement tells BMDP whether to use the fast and sloppy algorithm (ACE) or the
slower, maximum likelihood method (MLR). The default is to use the fast method.
 The model statement tells BMDP which predictors to use in the model. If you're not inter-
ested in any interactions, this statement can be omitted and BMDP will use a model that
(potentially) includes all of the main e ects and no interactions.
 The start and move statements tell BMDP which terms should be included in the model at
the start of its calculations, and how often they can be moved into or out of the model. The
default choices here are a little complicated, since it depends on whether a model statement
was used or not. If you didn't use a model statement, then the default is to start with no
predictors (except the intercept) in the model and move each of them up to twice. If you did
use a model statement, the default is to include all the terms in the model at the start and
not move them at all. This is a strange default, since it de nes a stepwise algorithm whose
default is never to step. To override this, you need the start and move statements.
 The print cova statement was used to request that BMDP print out the covariance matrix
for the predicted coecients. As I've pointed out, this can be very useful if you want to nd
which categories of a categorical predictor are responsible for its signi cance.
 Finally, if I had included a save paragraph, the program would have saved a BMDP le
that would contain lots of nifty things, including the in uence and residual diagnostics that
I discussed.
The control language that I used to t the analogous model in SAS was:
options ls=80;
data stucohen;
array data{35} firstcnt los age sex apache apache2 restnt1
date1 losfps restnt2 losfpr restnt3 noabx ags
The Pragmatist's Guide to Statistics: logistic regression 35

oneceph thrcph exceph noexceph esc cef van pip


gen tob ami tmp imi surgery drains intub
foley lines feedtube radiol gilab;
infile 'c:\home\sc4x.dat' lrecl=132;
input firstcnt los age sex apache apache2 restnt1 date1 losfps restnt2
losfpr restnt3 noabx ags oneceph thrcph exceph noexceph esc
cef van pip gen tob ami tmp imi surgery drains intub
foley lines feedtube radiol gilab;
if firstcnt ne 1 then delete;
if age le 1 then delete;
do i = 1 to 35;
if data{i} = -1 then data{i} = .;
end;
proc logistic descending;
model restnt2 = age apache age apache esc cef gen tob ami/details
influence covb selection=stepwise;
title 'ESC logistic regression example';
proc logistic descending;
model restnt2 = age apache age apache esc cef gen tob ami/
selection=stepwise lackfit;
title 'ESC logistic regression example';
There's not a lot to be said about the DATA step. The reason for the array statement and the
do loop within the data step was to adjust to the fact that in the BMDP le, missing values were
coded as 1's, whereas in SAS I wanted to use its internal missing value codes.
There are two PROC LOGISTIC paragraphs listed, mostly because I included the LACKFIT option
on only one and I wanted to abbreviate the output for that part of the analysis. The model
statement describes pretty much everything. The dependent variable is to the left of the equals
sign, the predictors (covariates) are listed to the right of the equals sign. Notice that three of
the predictors are GEN, TOB and AMI, which were three of the four categories that de ned the
aminoglycoside predictor in the BMDP analysis. Thus, they're being used as dummy variables in
this analysis.
Everything to the right of the slash are options that I listed. There are many such options that
are possible. I used the INFLUENCE option to print out all of the residual and leverage diagnostics.
COVB tells SAS to print out a covariance matrix for the estimated parameters, rather than just
the correlation matrix. The SELECTION option tells SAS whether to use no stepping (NONE), or a
forward, backward or full stepwise algorithm. The LACKFIT option in the second PROC paragraph
requests that the Hosmer goodness of t statistic be calculated and printed out. This option is
available only in Version 6.07 or later. Finally, the DESCENDING option on the PROC statement
tells SAS whether to t a model that predicts the probability of success or one that predicts the
probability of failure. By default, SAS will order the responses and t a model that predicts the
probability of the smaller level (or levels). (You can also decide which one to predict based on
the order of the responses in the data base, by using the ORDER=DATA option, but this is a little
risky.) Since I wanted to predict the probability of a resistant infection, which was the larger (2,
as opposed to 0) of the two responses, I had to use the DESCENDING option to reverse the default
ordering. The DESCENDING option is available in Version 6.07 or later. To do an equivalent switch
The Pragmatist's Guide to Statistics: logistic regression 36

in earlier versions of SAS, the data would have to be sorted rst in descending order according to
the response, and then you would have to use the ORDER=DATA option. A non trivial disadvantage
to this approach is that when the diagnostic information is printed, the cases would be numbered
in their new (sorted) order, rather than in the original order in the data base. This can make it
tough to gure out which of the observations are causing problems for the analysis.
2.7 Di erences between BMDPLR and PROC LOGISTIC
I've covered most of these items in going through the examples, but I think it's worth summarizing
them here, for easy reference if nothing else. Each of these packages does a pretty good job of
the analysis, but it's also true that each package provides some information that the other either
doesn't provide or least doesn't provide in a terribly useful form. It may seem at times that I have
a preference for one package over the other. In fact, I often do, but I don't always prefer the same
package. It's reasonable to say that you can do a better job of this type of analysis using both
packages in conjunction with each other than you can using either package on its own. With those
comments out of the way, I'll go on to my list of some of the more notable di erences between the
two.
 Handling of categorical predictors. As I mentioned above, PROC LOGISTIC really isn't set
up to handle categorical predictors very well (about as well as a typical regression package is
to handle ANOVA problems). In order to include a categorical predictor, you have to create
a series of dummy variables and list all of them in the MODEL statement. At that point, SAS
will pick and choose among the predictors (assuming you're doing a stepwise analysis), rather
than choosing all of them or none of them. Moreover, there are no checks to insure that
interaction terms are included only in conjunction with the corresponding main e ects (or
lower order interactions). BMDP does handle categorical predictors in a reasonable way.
 Categorical predictors with many levels. The drawback in BMDP's handling of categorical
variables is that it has an upper limit (equal to ten) on how many categories a categorical
predictor can have. If your predictor has more than ten levels, then there's no straightforward
way to use it in the model, short of de ning a series of dummy variables. SAS can do this
too, so on this score, the two packages are about the same.
 Goodness of t statistics. Prior to Version 6.07, SAS gives you a number of criteria that can
be used to choose among competing models, but it doesn't given you any indication (other
than the deviance test) of whether any of the models ts the data. Starting in Version 6.07,
it calculates a Hosmer goodness of t statistic, though not quite in the form you might like.
BMDP gives both a Hosmer statistic, along with a Brown goodness of t statistic. These
complement each other, since they look at di erent aspects of lack of t. Both packages give
a goodness of t test that's based on the likelihood ratio, or deviance chi squared statistic.
This statistic is of very little use whenever there are (continuous) covariates in the model, and
its presence in the output may give you the false impression that you've done an adequate
check of the model's t.
 In uence diagnostics. Both BMDP and SAS calculate a number of regression diagnostics
(SAS o ers a somewhat greater variety of statistics) that measure whether individual points
are abnormal either in their response, in their predictors, or in a combination of the two.
The Pragmatist's Guide to Statistics: logistic regression 37

BMDP's diagnostics are output to a BMDP output le, from which it's cumbersome to
extract information. SAS does a better job of presenting these results in an easily usable
form.
 Mathematical algorithms. Both BMDPLR and PROC LOGISTIC have two available algo-
rithms for doing these calculations, one of which is quicker (less computationally intense)
and somewhat approximate, and the other of which is more involved. The use of the more
approximate method can lead to what may seem to be contradictions, in which a variable
outside the model is added to the model based on its (signi cant) predictive power, but then
is immediately removed for being insigni cant. BMDP's slow algorithm is somewhat slower
than SAS's, but it's also a true maximum likelihood solution, rather than an approximation
to one. The fast algorithms may be useful in screening variables, but it's a good idea to do
the nal calculations on a model using the slower of the two algorithms.
 Handling of models that fail to converge. Working with real data (i.e., reasonable sample
sizes), situations can and will arise in which one or more of the actual model parameters is
in nite. Both SAS and BMDP are trying to make the model converge to a nite solution,
so they'll recognize this as a problem and tell you that the model has failed to converge.
BMDP's reaction is to tell you that this problem has arisen and ask you if you want to go
ahead anyway. (The correct answer is quite often \yes.") SAS simply prints out an error
message and terminates the procedure. Since it's genuinely impossible to make a model like
this converge (to a nite solution), BMDP's handling of this problem is more reasonable. In
SAS, the only ways of handling the problem are to weaken the convergence criterion (a bad
idea) or to remove the o ending data from the analysis.
 General data manipulations. One of SAS's strengths relative to BMDP are the data handling
capabilities in the DATA step and the programming exibility o ered by its macro capabilities.
This can be a nontrivial advantage if you're interested in tting a variety of models using
the same data. Another advantage SAS has it that it handles the input and manipulation
of character data more easily than BMDP. For the infection data, I rst had to recode the
data into numerical form, and then rede ne the character labels I wanted once the data was
in BMDP. This is a nuisance that could have been avoided in SAS.
 Polychotomous responses. BMDP has a separate program (BMDPPR) that handles similar
models for situations in which the response has more than two levels. (I'll spend some
time discussing these models next time.) SAS has combined the two programs into one,
so that if you specify a dependent variable that has 3 or more response levels, then it will
t a polychotomous logistic regression. (More often than not, this is called a polytomous
logistic regression, which I don't view as an actual English word, so I'm going to resist this
terminology.) If this isn't what you intended, then you have to de ne a dichotomous response
before you run PROC LOGISTIC.
2.8 Some comments about R2
I mentioned in the rst session that in a logistic regression, there was nothing analogous to the R2
value that you get from a linear regression. To some extent this is an oversimpli cation, although
The Pragmatist's Guide to Statistics: logistic regression 38

it's certainly true that there's not a lot of consensus about what the appropriate statistic is to use,
and the statistics that have been proposed aren't being used all that widely in practice.
There are basically two types of statistics that have been proposed. The rst came out of some
work by McFadden7 and it's closely related to the likelihood ratio goodness of t statistic (the one
I didn't much like). In linear regression, R2 is de ned as one minus the proportion of the variability
in a model containing only a constant term that's explained by the model in question. Another
way of thinking of this is that it's the point on a continuum that the model in question lies where
the two extremes are de ned as an R2 of zero (the constant response model) and an R2 of one (for
a model with a residual sum of squares equal to zero). For a logistic regression, you can de ne
something similar, along the lines of
L L0
R2L = M ;
L L
S 0
where L0; LS , and LM are the log likelihoods of the constant model, a saturated model, and
the model in question, respectively. There's no problem with the constant model and the model
in question, but the saturated model could be taken either as a model that contains all of the
available covariates, or else one that predicts a di erent value for each of the di erent covariate
combinations. Neither of these de nitions is terribly satisfying, since they both depend on which
covariates you're considering for the model. If you added some new covariates, this R2 value would
change, and that doesn't seem quite right.
The second type of statistic was originally proposed by Morrison8 and it's based on the predicted
probabilities of a positive response that come out of the model. If the dependent variable Y is
de ned to be either 0 or 1, and for individual i, the model's estimate of PrfY = 1g is p^i , then
Morrison suggested de ning his R2 measure as
Pi(Yi p^i )2
R2 = Pi(Yi Y )2
;

where Y is the arithmetic average of the Y 's. The main problem with this measure is that some
of the Y 's are more variable than others (since VarY = p(1 p)), and yet this formula gives them
equal weight. Amemiya9 has suggested a generalization of this measure that takes this into account.
I wish I could say that I knew exactly what was meant by R2 in the context of a logistic
regression, but unfortunately, each of these measures can be referred to as \R2 ," as a \pseudo R2 ,"
or as a \quasi R2."
2.9 Next time.. .
. . . we'll talk about some other types of analysis that you might be considering as alternatives to
logistic regression. My goal in discussing this is that this will help to clarify logistic regression's
place relative to the other methods, to give a sense of how the alternative methods di er and of
why one method might be preferable over another under a given set of circumstances. I'm not
7 D. McFadden: \Conditional logit analysis of qualitative choice behavior," in Frontiers in Econometrics, P.
Zarembka, ed., MIT Press, 1974
8 D. Morrison, \Upper bounds for correlations between binary outcomes and probabilistic predictions," Journal of
the American Statistical Association, v. 67, pp. 68-70, 1972.
9 T. Amemiya, \Qualitative Response Models: a summary," Journal of Econometric Literature, v. 19, pp. 1483-
1536, 1981.
The Pragmatist's Guide to Statistics: logistic regression 39

going to discuss the other methods in all that much detail; not so much that you'd become expert
in how to carry them out, but at least you'll recognize them and how they di er from a logistic
analysis. There are ve methods that I'm particularly interested in discussing, namely:
 probit analysis,
 loglinear models,
 polychotomous logistic regression,
 discriminant analysis, and
 cluster analysis.
All of these are at least super cially similar to logistic regression in that on some level there are
groups of things involved, but for most of these methods, the similarities don't go a heck of a lot
further than that.

Potrebbero piacerti anche