Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
www.elsevier.com/locate/aap
Abstract
Logistic regression was applied to accident-related data collected from traffic police records in order to examine the
contribution of several variables to accident severity. A total of 560 subjects involved in serious accidents were sampled. Accident
severity (the dependent variable) in this study is a dichotomous variable with two categories, fatal and non-fatal. Therefore, each
of the subjects sampled was classified as being in either a fatal or non-fatal accident. Because of the binary nature of this
dependent variable, a logistic regression approach was found suitable. Of nine independent variables obtained from police accident
reports, two were found most significantly associated with accident severity, namely, location and cause of accident. A statistical
interpretation is given of the model-developed estimates in terms of the odds ratio concept. The findings show that logistic
regression as used in this research is a promising tool in providing meaningful interpretations that can be used for future safety
improvements in Riyadh. 2002 Elsevier Science Ltd. All rights reserved.
Keywords: Logistic regression; Accident severity
1. Introduction
Accident severity is of special concern to researchers in traffic safety since this research is aimed
not only at prevention of accidents but also at reduction of their severity. One way to accomplish the latter is to identify the most probable factors that affect
accident severity. This study aims at examining not
all factors, but some believed to have a higher potential for serious injury or death, such as accident location, type, and time; collision type; and age and
nationality of the driver at fault, his license status,
and vehicle type. Other factors were not examined
because of substantial limitations in the data obtained
from accident reports. Logistic regression was used in
this study to estimate the effect of the statistically
significant factors on accident severity. Logistic regression and other related categorical-data regression
methods have often been used to assess risk factors
for various diseases. However, logistic regression has
been used as well in transportation studies. A brief
2. Literature review
Regression methods have become an integral component of any data analysis concerned with the relationship between a response variable and one or more
explanatory variables. The most common regression
method is conventional regression analysis (CRA), either linear or nonlinear, when the response variable is
continuous (iid). However, when the outcome (the response variable) is discrete, CRA is not appropriate.
Among several reasons, the following two are the
most significant:
1. The response variable in CRA must be continuous,
and
2. The response variable in CRA can take nonnegative
values.
These two primary assumptions are not satisfied when
the response variable is categorical.
0001-4575/02/$ - see front matter 2002 Elsevier Science Ltd. All rights reserved.
PII: S 0 0 0 1 - 4 5 7 5 ( 0 1 ) 0 0 0 7 3 - 2
730
Jovanis and Chang (1986) found a number of problems with the use of linear regression in their study
applying Poisson regression as a means to predict accidents. For example, they discovered that as vehiclekilometers traveled increases, so does the variance of
the accident frequency. Thus, this analysis violates the
homoscedasticity assumption of linear regression.
In a well-summarized review of models predicting
accident frequency, Milton and Mannering (1997) state:
The use of linear regression models is inappropriate
for making probabilistic statements about the occurrences of vehicle accidents on the road. They showed
that the negative binomial regression is a powerful
predictive tool and one that should be increasingly
applied in future accident frequency studies.
Kim et al. (1996) developed a logistic model and used
it to explain the likelihood of motorists being at fault in
collisions with cyclists. Covariates that increase the
likelihood of motorist fault include motorist age, cyclist
age (squared), cyclist alcohol use, cyclists making turning actions, and rural locations.
Kim et al. (1994) attempted to explain the relationship between types of crashes and injuries sustained in
motor vehicle accidents. By using techniques of categorical data analysis and comprehensive data on
crashes in Hawaii during 1990, a model was built to
relate the type of crash (e.g. rollover, head-on,
sideswipe, rear-end, etc.) to a KABCO injury scale.
They also developed an odds multiplier that enabled
comparison according to crash type of the odds of
particular levels of injury relative to noninjury. The
effects of seatbelt use on injury level were also examined, and interactions among belt use, crash type, and
injury level were considered. They discussed how loglinear analysis, logit modeling, and estimation of odds
multipliers may contribute to traffic safety research.
Kim et al. (1995) built a structural model relating
driver characteristics and behavior to type of crash and
injury severity. They explained that the structural
model helps to clarify the role of driver characteristics
and behavior in the causal sequence leading to more
severe injuries. They estimated the effects of various
factors in terms of odds multipliers that is, how
much does each factor increase or decrease the odds of
more severe crash types and injuries.
Nassar et al. (1997) developed an integrated accident
risk model (ARM) for policy decisions using risk factors affecting both accident occurrences on road sections and severity of injury to occupants involved in the
accidents. Using negative binomial regression and a
sequential binary logit formulation, they developed
models that are practical and easy to use. Mercier et al.
(1997) used logistic regression to determine whether
either age or gender (or both) was a factor influencing
severity of injuries suffered in head-on automobile collisions on rural highways.
e i0 + i1x
1+ e i0 + i1x
(1)
y(x)
= i0 + i1x
1 y(x)
(2)
l(i) = 5 n(xi )
i=1
(3)
731
(3.1)
i=1
(4)
i=1
n
(5)
i=1
i=1
i=1
% yi = % y (xi )
That is, the sum of the observed values of y is equal to
the sum of the expected (predicted) values. This property is especially useful in assessing the fit of the model
(Hosmer and Lemeshow, 1989).
After the coefficients are estimated, the significance
of the variables in the model is assessed. If yi denotes
the observed value and yi denotes the predicted value
for the ith individual under the model, the statistic used
in the linear regression is
n
SSE = % (yi yi )2
i=1
n
= % (yi yi )2 % (yi yi )2
i=1
i=1
732
D= 2 ln
(6)
Using Eqs. (3.1) and (6), the following test statistic can
be obtained:
D= 2 % yi ln
i=1
n
y i
1 y i
+(1 yi ) ln
yi
1 yi
(7)
where y i =y (xi ).
The statistic D in Eq. (7), for the purpose of this
study, is called the deviance, and it plays an essential
role in some approaches to the assessment of goodness
of fit. The deviance for logistic regression plays the
same role that the residual sum of squares plays in
linear regression (i.e. it is identically equal to SSE).
For the purpose of assessing the significance of an
independent variable, the value of D should be compared with and without the independent variable in the
model. The change in D due to inclusion of the independent variable in the model is obtained as follows:
G=D(for the model without the variable)
D(for the model with the variable)
This statistic plays the same role in logistic regression as
does the numerator of the partial F-test in linear regression. Because the likelihood of the saturated model is
common to both values of D being the difference to
compute G, this likelihood ratio can be expressed as
G = 2 ln
(8)
i. 1
SE. (i. 1)
(9)
4. Model description
The dependent variable in this research, ACCIDENT, is of the dichotomous type and stands for
accident severity. It should be mentioned that the defin-
1
1+ e g(x)
5. Data description
The data set used in this study was derived from a
sample of 560 subjects involved in serious accidents
reported in traffic police records in Riyadh, the capital
of Saudi Arabia. Only accidents occurring on urban
roads in Riyadh were examined. Unfortunately, police
reports at accident sites do not describe injuries in
much detail because of the lack of police qualifications
and training as well as facilities needed to perform
complex examinations. Also, medical reports are hard
to obtain because police accident data and medical data
are not kept together (Al-Ghamdi, 1996). Consequently, it was impossible for this study to obtain
details on the degree of severity of the accidents. All
that can be learned from the police records is that the
accident is a property damage only (PDO) accident,
injury accident (no injury classification is available), or
fatal accident. The subjects were selected in a systematic
random process from all accident records filed for the
period from August 1997 to November 1998. The data
search was done manually because of the lack of com-
Table 1
Description of the study variables
Number
Description
Codes/values
Abbreviation
Accident
0=Non-fatal
1= Fatal
ACCIDENT
Location
1= Intersection
2= Non-intersection
LOC
Accident type
1=With vehicle(s)
2= Fixed-object
3= Over-turn
4= Pedestrian
ATYP
Collision type
1= Right-angle
2= Sideswipe
3=Rear-end
4=Front
5=Unknown
CTYP
Accident time
1 =Day
2=Night
TIME
Accident cause
1= Speed
2=Run red light
3= Follow too close
4= Wrong way
5= Failure to yield
6= Other
CAUS
Driver age at
fault
Years
AGE
Nationality
1 =Saudi
2= Non-Saudi
NAT
Vehicle type
1 = Small passenger
car
2= Large passenger
car
3= Pick-up truck
4= Taxi
5= Other
VEH
1 = Yes (valid)
2= Expired
3= No (no license)
LIC
License status
10
733
With vehicle(s)
Fixed-object
Overturn
Pedestrian
Design variable
D1
D2
D3
0
1
0
0
0
0
1
0
0
0
0
1
734
Ho: pi = 0
Ha : pi " 0
where pi is the proportion of class i (level i ) within the
designated design variable.
For example, the design variables for ATYP were
reduced from three (four levels) to two (three levels)
after it was shown that the proportion of fixed-object
and overturn accidents was not statistically significant
at the 5% level using the foregoing hypothesis. Table 3
summarizes the hypothesis testing results for all categorical variables in the study, and Table 4 shows the
number of design variables after reduction.
The study variables were now ready to use in the
model development stage, as discussed in the next
section.
735
Table 3
Hypothesis testing for proportions
Description
P-value
Upper
Distribution by location
Intersection
Opening*
Circle*
Exit*
Road section
159
38
5
20
357
579
579
579
579
579
0.275
0.066
0.009
0.035
0.617
0.2
0
0
0
0.6
0.3
0.1
0
0
0.7
235
39
28
303
605
605
605
605
0.388
0.064
0.046
0.501
0.4
0
0
0.5
0.4
0.1
0.1
0.5
119
173
40
225
48
605
605
605
605
605
0.197
0.286
0.066
0.372
0.079
0.2
0.3
0
0.3
0.1
0.2
0.3
0.1
0.4
0.1
Distribution by time
Morning
Evening
Night
Unknown*
122
414
55
14
605
605
605
605
0.202
0.684
0.091
0.023
0.2
0.7
0.1
0
0.2
0.7
0.1
0
226
51
18
96
80
134
605
605
605
605
605
605
0.374
0.084
0.03
0.159
0.132
0.221
0.3
0.1
0
0.1
0.1
0.2
0.4
0.1
0
0.2
0.2
0.2
By 6ehicle type
PC
Big PC*
Pick-up
Taxi*
Other*
335
37
139
26
23
560
560
560
560
560
0.598
0.066
0.248
0.046
0.041
0.6
0
0.2
0
0
0.6
0.1
0.3
0.1
0.1
By Licensing status
Yes
Expired*
No
370
37
153
560
560
560
0.661
0.066
0.273
0.6
0
0.2
0.7
0.1
0.3
736
Table 4
Number of design variables after reduction
Categorical
variable
Locationa
Accident
typea
Collision
typea
Accident
time
Accident
causea
Nationality
Vehicle typea
License
statusa
a
Before reduction
After reduction
Levels
Design
variables
Levels
Design
variables
5
4
4
3
2
3
1
2
2
5
3
1
4
2
2
2
2
1
1
1
Understanding and quantifying the relationship between driver characteristics, particularly age and accident risk, has long been a high priority of
accident-related research. In addition, other studies (Hilakivi et al., 1989; Mercier et al., 1997) have shown that
young drivers as well as older drivers are more at risk
of being involved in serious accidents. Numerous research studies have attempted to examine the complex
relationship between driver characteristics and accident
risk. Drivers risk-taking behavior is often defined in
terms of several variables, one of which is age. Manner-
737
Table 5
Estimated coefficients, estimated standard errors, and Wald statistic for the model variables
Variable
Estimated coefficient
P-value
LOC(2)
ATYP(2)
ATYP(3)
CTYP(2)
CTYP(3)
CTYP(4)
TIME(2)
CAUS(2)
CAUS(3)
CAUS(4)
CAUS(5)
AGE
NAT(2)
VEH(2)
LIC(2)
0.905
0.02367
0.3032
0.3218
0.02615
0.1194
0.01806
0.3071
0.1447
0.9131
0.7727
0.01883
0.3469
0.2871
0.4257
0.352
0.369
0.449
0.3904
0.306
0.4545
0.3917
0.5849
0.314
0.4635
0.3217
0.01232
0.2739
0.2406
0.2514
2.57
0.06
0.68
0.82
0.09
0.26
0.05
0.53
0.46
1.97
2.4
1.53
1.27
1.19
1.69
0.005*
0.48
0.75
0.79
0.54
0.40
0.52
0.30
0.32
0.024*
0.01*
0.06
0.10
0.79
0.045*
g (x)
= 2.029+ 0.9697LOC(2) 0.3558CAUS(2)
+ 0.2130CAUS(3) 0.8971CAUS(4)
0.6705CAUS(5)
Hence the logistic regression model developed in this
study is
8. Logit model
According to the previous analysis, the logit model
with the significant variables is as follows:
Once the model has been fit, the process of assessment of the model begins. Several tests, including
Pearson 2 and deviance, the Wald statistic, and the
HosmerLemeshow tests, can be used to determine
how effective the model is in describing the response variable, or its goodness of fit. These tests resulted
in a 2 criterion to make the decision on the model
fit. A very good source for the theory of such tests is,
for example, Hosmer and Lemeshow (1989). The validity of the model in this study was first checked by
examining the statistical level of significance for its
coefficients using deviance and the Wald statistic, as
discussed earlier.
Graphical assessment of the fit to the logistic
model developed in the study also shows that the
model appears to fit the data reasonably, as shown
in Figs. 4 and 5. Fig. 4 shows the plot of Pearson
residuals, in which no trend can be detected. Fig. 5 shows
Hi-Leverage points (outliers) in which very small points
appear to be outliers [less than 4% of the data set;
compare PRES with 1.96 (z-value at the 5% level of
significance)]. That is, 95% of the points in this plot lie
between 0.5 and 1.9.
738
9. Model interpretation
Interpretation of any fitted model requires the ability
to draw practical inferences from the estimated
coefficients. The estimated coefficients for the
independent variables represent the slope or rate of
change of the dependent variable per unit of change in
the independent variable. Thus, interpretation involves
two issues: determining the functional relationship
between the dependent variable and the independent
variable (i.e. the link function; McCullagh and Nelder,
1982) and appropriately defining the unit change for the
independent variable. In the logistic regression model,
the link function is the logit transformation (Eq. (2)).
The slope coefficient in this model represents the
change in the logit for a change of one unit in the
independent variable x. Proper interpretation of the
coefficient in a logistic regression model depends on
being able to place a meaning on the difference between
two logits. The exponent of this difference gives the
odds ratio, which is defined as the ratio of the odds that
the independent variable will be present to the odds
that it will not be present. Thus, the relationship
between the logistic regression coefficient and the odds
ratio provides the foundation for interpretation of all
logistic regression results. It should be noted that odds
greater than 1 in this study increase the likelihood that
the accident will be fatal. Illustrations follow of the
interpretation of the model developed in this study.
Table 6
A summary of P-values after dropping variables from saturated model
Variable dropped from the saturated model
Change in deviance
P-valuea
Saturated mode
LIC
VEH
NAT
CAUS
TIME
CTYP
ATYP
AGE
LOC
2.83b
1.04
3.36
10.73
0.05
1.38
0.36
+0.003
+16.45
1
1
1
4
1
3
2
1
1
0.093
0.31
0.067
0.03
0.82
0.71
0.55
0.452
0.000a
a
b
Based on 2 for the log-likelihood ratio test. For example: the P-value for LIC variable is obtained such as P( 22]3.095)= 0.213.
The +ve sign due to backward strategy. If the Forward strategy is chosen this sign would be ve.
739
Table 7
The results of testing interactions and a quadratic effect of age
Variable dropped from the saturated model
Change in deviance
P-value
Full modela
LOCCAUS
AGE 2
1.241
0.51
1
1
0.265
0.52
740
Table 9
Odds of being in a fatal accident at non-intersection to that in a
non-intersection accidenta
Non-intersection accident
Speed
RRL
WW
FTY
Non-intersection accident
Speed
RRL
WW
FTY
1
0.70
1.24
0.41
1.43
1
1.77
0.58
0.81
0.57
1
0.33
2.45
1.72
3.03
1
a
Example: The odds of being in a fatal accident at a non-intersection location due to WW is 1.24 times higher than that due to speed
at same type of location.
Table 8
Odds of being in a fatal accident at intersection to that in a
non-intersection accidenta
Table 10
Odds of being in a fatal accident at non-intersection to that at an
intersection accidenta
Non-intersection accident
Non-intersection Accident
10. Conclusions
Speed
RRL
WW
FTY
Intersection accident
Speed
RRL
WW
FTY
2.64
1.85
3.26
1.08
3.76
2.64
4.66
1.54
2.13
1.50
2.64
0.87
6.47
4.5
8.0
2.64
a
Example: The odds of being in a fatal accident at a non-intersection location due to WW is 3.26 times higher than that due to speed
at an intersection-related accident.
Speed
RRL
WW
FTY
Intersection accident
Speed
RRL
WW
FTY
0.38
0.27
0.47
0.16
0.54
0.38
0.67
0.22
0.31
0.21
0.38
0.13
0.93
0.65
1.15
0.38
a
Example: The odds of being in a fatal accident at a non-intersection location due to WW is far less than that due to speed at an
intersection (0.47 which is less than 1).
Fig. 6. Odds ratio of being involved in a fatal accident at a non-intersection location to that of an intersection relative to cause.
References
Agresti, A., 1984. Analysis of Ordinal Categorical Data. Wiley, New
York.
Al-Ghamdi, A.S., 1996. Road accidents in Saudi Arabia: Comparative and analytical study. Presented at the 75th Annual Meeting
of the Transportation Research Board, Washington, DC.
Feinberg, S., 1980. The Analysis of Cross-Classified Categorical
Data. MIT Press, Cambridge, MA.
GLIM, 1987. Generalised Linear Interactive Modeling Manual. Release 3.77, second ed. Royal Statistical Society, UK.
Hilakivi, I., et al., 1989. A sixteen-factor personality test for predicting automobile driving accidents of young drivers. Accident Analysis and Prevention 21 (5), 413 418.
Hosmer, D.W., Lemeshow, S., 1989. Applied Logistic Regression.
Wiley, New York.
James, J.L., Kim, K.E., 1996. Restraint use by children involved in
crashes in Hawaii, 1986 1991. In: Transportation Research
Record 1560, TRB, National Research Council, Washington, DC,
741
pp. 8 11.
Jovanis, P.P., Chang, H., 1986. Modeling the relationship of accidents to miles traveled. In: Transportation Research Record 1068,
TRB, National Research Council, Washington, DC, pp. 42 51.
Kim, K., Lawrence, N., Richardson, J., Li, L., 1994. Analyzing the
relationship between crash types and injury severity in motor
vehicle collisions in Hawaii. In: Transportation Research Record
1467, TRB, National Research Council, Washington, DC, pp.
9 13.
Kim, K., Lawrence, N., Richardson, J., Li, L., 1995. Personal and
behavioral predictors of automobile crash and injury severity.
Accident Analysis and Prevention 27 (4), 469 481.
Kim, K., Lawrence, N., Richardson, J., Li, L., 1996. Modeling fault
among bicyclists and drivers involved in collisions in Hawaii
1986 1991. In: Transportation Research Record 1538, TRB, National Research Council, Washington, DC, pp. 75 80.
Mannering, F.L., 1992. Male/female driver characteristics and accident risk: some new evidence. Presented at the 71st Annual
Meeting of the Transportation Research Board, Washington, DC,
1992.
McCullagh and Nelder, 1982. Generalized Linear Models. Cambridge
University Press, Cambridge, UK, 1982.
Mercier, C.R., Shelley, M.C., Rimkus, J., Mercier, J.M., 1997. Age
and gender as predictors of injury severity in head-on highway
vehicular collisions. In: Transportation Research Record 1581,
TRB, National Research Council, Washington, DC.
Milton, J., Mannering, F., 1997. Relationship among highway geometric, traffic-related elements, and motor-vehicle accident frequencies. Presented at the 76th Annual Meeting of the
Transportation Research Board, Washington, DC.
Nassar, S.A., Saccomanno, F.F., Shortreed, J.H., 1997. Integrated
Risk Model (ARM) of Ontario. Presented at the 76th Annual
Meeting of the Transportation Research Board, Washington, DC.
Official Statistics, 1997. Annual Statistics for the Period 1971 1997.
Ministry of Interior, Riyadh, Saudi Arabia.