Statistical Issues in Traffic Accident Modeling

Statistical Issues in Traffic Accident Modeling
Ziad Sawalha
Research Associate
Department of Civil Engineering
University of British Columbia
Vancouver, B.C., Canada V6T 1Z4
Tarek Sayed
Associate Professor
Department of Civil Engineering
The University of British Columbia
2324 Main Mall
Vancouver, BC, Canada V6T 1Z4
Tel: (604) 822-4379
Fax: (604) 822-6901
E-mail: tsayed@civil.ubc.ca
TRB 2003 Annual Meeting CD-ROM Paper revised from original submittal.
ABSTRACT
Accident prediction models are invaluable tools that have many applications in road safety
analysis. However, there are certain statistical issues related to accident modeling that either
deserve further attention or have not been dealt with adequately in the road safety literature. This
paper discusses and illustrates how to deal with two statistical issues related to modeling
accidents using Poisson and negative binomial regression. The first issue is that of model
building or deciding which explanatory variables to include in an accident prediction model. The
study differentiates between applications for which it is advisable to avoid model over-fitting and
other applications for which it is desirable to fit the model to the data as closely as possible. It
then suggests procedures for developing parsimonious models, i.e. models that are not over-
fitted, and best-fit models. The second issue discussed in the paper is that of outlier analysis. The
study suggests a procedure for the identification and exclusion of extremely influential outliers
from the development of Poisson and negative binomial regression models. The procedures
suggested for model building and conducting outlier analysis are more straightforward to apply
in the case of Poisson regression models due to an added complexity presented by the shape
parameter of the negative binomial distribution. The paper, therefore, presents flowcharts
detailing the application of the procedures when modeling is carried out using negative binomial
regression. The described procedures are then applied in the development of negative binomial
accident prediction models for the urban arterials of the cities of Vancouver and Richmond
located in the province of British Columbia, Canada.
1. INTRODUCTION
Statistical modeling is used to develop accident prediction models (APMs) relating accident
occurrence on various road facilities to the traffic and geometric characteristics of these facilities.
These models have several applications such as estimation of the safety potential of road entities,
identification and ranking of hazardous or accident-prone locations, evaluation of the
effectiveness of safety improvement measures, and safety planning.
Currently, generalized linear regression modeling (GLM) is used almost exclusively for the
development of APMs since several researchers (e.g. Jovanis and Chang, 1986; Miaou and Lum,
1993) have demonstrated that certain standard conditions under which conventional linear
regression modeling is appropriate (Normal error structure and constant error variance) are
violated by traffic accident data. Most safety researchers now adopt either a Poisson or a
negative binomial error structure in the development of APMs. Several GLM statistical software
packages are available for the development of these models. This software allows the modeling
of data that follow a wide range of probability distributions belonging to the exponential family
(among which are the Poisson and the negative binomial distributions).
The road safety literature is rich with APMs developed by Poisson or negative binomial
regression. However, there are certain statistical issues related to accident modeling that either
deserve further attention or have not been dealt with adequately in the road safety literature. This
paper discusses and illustrates how to deal with two statistical issues related to accident
modeling, namely the issue of selecting the explanatory variables of an APM and the issue of
outlier analysis. The procedures given for dealing with these issues are then applied in the
development of APMs for the urban arterials of the cities of Vancouver and Richmond located in
the province of British Columbia, Canada.
2. DATA DESCRIPTION
A total of 58 arterials in the cities of Vancouver and Richmond were investigated for the purpose
of developing APMs relating the safety of urban arterial sections to their traffic and geometric
characteristics. Geometric data were directly collected from the field. The approach to geometric
data collection consisted of dividing each arterial into sections between consecutive signalized
intersections and gathering field information for each section separately. The geometric data
collected consisted of: (a) section length, (b) between-signal number of lanes, (c) number of
unsignalized intersections, (d) number of driveways, (e) number of bus stops, (f) number of
pedestrian crosswalks, (g) type of median (no median, flush median, or raised-curb median), (h)
type of land use (residential, business, or other), and (i) percentage of section length along which
parking is allowed.
The data on accident frequencies and traffic volumes along the arterials were obtained from the
cities of Vancouver and Richmond and covered the period from 1994 to 1996. Accidents that
occurred at signalized intersections were excluded from the accident data used to develop the
models. Summary statistics describing the data are provided in Table 1.
3. MODEL DEVELOPMENT
The development of the APMs for the urban arterial sections of Vancouver and Richmond was
carried out using the GLIM4 statistical software package (Numerical Algorithms Group, 1994).
The following elements were necessary for model development: appropriate model form,
appropriate error structure, procedure for selecting the model explanatory variables, procedure
for outlier analysis, and methods for assessing model goodness of fit. The procedures for
selecting the model explanatory variables and conducting outlier analysis are the statistical issues
targeted by this paper and are discussed in the following sections.
3.1 Model Form
The mathematical form used for any APM should satisfy two conditions. First, it must yield
logical results. This means that a) it must not lead to the prediction of a negative number of
accidents and b) it must ensure a prediction of zero accident frequency for zero values of the
exposure variables, which, for road sections, are section length and annual average daily traffic
(AADT).
The second condition that must be satisfied by the model form is that, in order to use generalized
linear regression in the modeling procedure, there must exist a known link function that can
linearize this form for the purpose of coefficient estimation. These conditions are satisfied by a
model form that consists of the product of powers of the exposure measures multiplied by an
exponential incorporating the remaining explanatory variables. Such a model form can be
linearized by the logarithm link function. Expressed mathematically, the model form that was
used is as follows:
Eˆ (Y ) = a0 × La1 × V a 2 × exp ∑ b j x j (1)
j
where Eˆ (Y ) = predicted accident frequency, L = section length, V = section AADT, x j =

any variable additional to L and V , and a0 , a1 , a2 , b j = the model parameters.
3.2 Error Structure
As mentioned earlier, the GLM approach to modeling traffic accident occurrence assumes an
error structure that is Poisson or negative binomial. Let Y be the random variable that represents
the accident frequency at a given location during a specific time period, and let y be a certain
realization of Y . The mean of Y , denoted by Λ , is itself a random variable (Kulmala, 1995). For
Λ = λ , Y is Poisson distributed with parameter λ :
λy e−λ
P (Y = y | Λ = λ ) = ; E (Y | Λ = λ ) = λ ;Var (Y | Λ = λ ) = λ (2a,b,c)
y!
It is the usual practice to assume that the distribution of Λ can be described by a gamma
probability density function. Hauer (1997) examined many accident data sets and the empirical
evidence he obtained supported the gamma assumption for the distribution of Λ . If Λ is
described by a gamma distribution with shape parameter κ and scale parameter κ / µ , then its
density function is:

(κ / µ ) κ λκ −1e − (κ / µ ) λ µ2
f Λ (λ ) = ; E (Λ ) = µ ;Var (Λ ) = (3a,b,c)
Γ(κ ) κ
and the distribution of Y around E (Λ ) = µ is negative binomial (Hinde and Demetrio, 1998;
Hauer et al., 1988). Therefore, unconditionally:

κ y
Γ(κ + y )  κ   µ  µ2
P (Y = y ) =     ; E (Y ) = µ ;Var (Y ) = µ + (4a,b,c)
Γ(κ ) y!  κ + µ   κ + µ  κ
As shown by equations (4b,c), the variance of the accident frequency is generally larger than its
expected value reflecting the fact that accident data are generally over-dispersed. The only
exception is when κ → ∞, in which case the distribution of Λ is concentrated at a point and the
negative binomial distribution becomes identical to the Poisson distribution.
The decision on whether to use a Poisson or negative binomial error structure was based on the
following methodology. First, the model parameters are estimated based on a Poisson error
structure. Then, the dispersion parameter (σ d ) is calculated as follows:
Pearson χ 2
σd = (5)
n− p
where n is the number of observations, p is the number of model parameters, and Pearsonχ 2
is defined as:
Pearsonχ 2
=∑
[y − Eˆ (Y )]
n
i i
2
(6)
i =1 Var (Yi )
where y i is the observed number of accidents on section i , Eˆ (Yi ) is the predicted accident
frequency for section i as obtained from the APM, and Var (Yi ) is the variance of the accident
frequency for section i . The dispersion parameter, σ d , is noted by McCullagh and Nelder
(1989) to be a useful statistic for assessing the amount of variation in the observed data. If σ d
turns out to be significantly greater than 1.0, then the data have greater dispersion than is
explained by the Poisson distribution, and a negative binomial regression model is fitted to the
data.
3.3 Assessment of Model Goodness of Fit
Several statistical measures can be used to assess the goodness of fit of GLM models. The two
statistical measures used are those cited by McCullagh and Nelder (1989) for assessing model
goodness of fit. These are a) the Pearsonχ 2 statistic, defined in equation (6), and b) the scaled
deviance. The scaled deviance is the likelihood ratio test statistic measuring twice the difference
between the maximized log-likelihoods of the studied model and the full or saturated model. The
full model has as many parameters as there are observations so that the model fits the data
perfectly. Therefore, the full model, which possesses the maximum log-likelihood achievable
under the given data, provides a baseline for assessing the goodness of fit of an intermediate
model with p parameters.
McCullagh and Nelder (1989) have shown that if the error structure is Poisson distributed, then
the scaled deviance is as follows:
n  yi 
SD = 2∑ yi ln  (7)
ˆ (Y ) 
E
i =1  i 
while if the error structure follows the negative binomial distribution, the scaled deviance is:
n   yi   yi + κ 
SD = 2∑  y i ln  −(y

+ κ ) ln 

(8)
 E (Yi ) + κ
i
ˆ ˆ
i =1   E (Yi )  
Both the scaled deviance and the Pearsonχ 2 have χ 2 distributions for Normal theory linear
models, but they are asymptotically χ 2 distributed with n − p degrees of freedom for other
distributions of the exponential family.
4. SELECTION OF MODEL EXPLANATORY VARIABLES
There seems to be a belief among many safety researchers that the more variables in an APM the
better the model. Some researchers have even reported models containing variables with highly
insignificant parameters based on the belief that such variables would still improve model
prediction. Such variables are hardly of any value for explaining the variability of the specific
accident data used in generating the model much less of any value for predicting accident
frequencies at new locations not used in the model development. Explanatory variables that have
statistically significant model parameters, on the other hand, contribute to the explanation of the
variability of the accident data, and their inclusion in the model therefore improves its fit to this
data. Nevertheless, improvement of a model’s fit to the accident data is not enough justification
for retaining a variable in the model. This paper presents a detailed analysis on how to select
which explanatory variables to include in an APM.
The procedure that is suggested in this paper for selecting the explanatory variables to include in
an APM depends on the locations whose safety is to be studied by the model. Safety researchers
and road safety authorities are not interested only in the safety study of the limited number of
locations used for model building. They want to be able to study the safety of other locations in
the same region that have traffic and geometric characteristics similar to those of the locations
used to build a model. Various reasons could prevent many locations from being included in the
development of a model. For example, accident data might not be available for some of these
locations at the time of model development. Another reason might be that some of these
locations simply did not exist at that time. Therefore generality is a characteristic that is required
in a model that will be used for the safety study of new locations not included in its development.
Model generality requires that a model be developed in accordance with the principle of
parsimony, which calls for explaining as much of the variability of the data using the least
number of explanatory variables. The idea behind the principle of parsimony is to avoid over-
fitting. It is tempting to include many variables in a model in an effort to make it fit the observed
data as closely as desired. In fact, if a number of statistically significant variables equal to the
number of observations can be found, a perfect fit to the data can be achieved by including these
variables in the model. However, the question that must be asked is whether a model having a
perfect or extremely close fit to a particular set of observations can produce reliable predictions
when applied to a different set of locations. The answer is that such a model is unstable and may
perform poorly when applied to a new sample drawn from the same population. For the safety
study of new locations, more is not better, and it may be worse. The procedure that is proposed
in this paper for building parsimonious models is explained in the next section.
On the other hand, if an APM is to be used for studying the safety of the particular set of
locations used to develop it, then a more accurate study would result by using a model that fits
the accident data as closely as possible. This best-fit model is achieved by including all the
available statistically significant explanatory variables. The procedure for building best-fit
models is also explained in the next section.
4.1 Procedures for Building Parsimonious and Best-Fit Models
The procedure that is suggested in this paper for developing parsimonious APMs is a forward
procedure by which the variables are added to a model one by one. The decision on whether a
variable should be retained in the model is based on two criteria. The first criterion is whether the
t-ratio of its estimated parameter (equal to the parameter estimate divided by its standard error
and equivalent to the Wald statistic) is significant at the 95 percent confidence level (or any other
level selected by the model developer). The second criterion is whether the addition of the
variable to the model causes a significant drop in the scaled deviance at the 95 percent
confidence level.
The second criterion represents an analysis of deviance procedure for comparing two nested
models. This procedure is equivalent to carrying out a likelihood ratio test to determine whether
the model containing the additional variable significantly increases the likelihood of the observed
sample of accident data. As mentioned earlier, the scaled deviance is asymptotically χ 2
distributed with n − p degrees of freedom, and therefore, owing to the reproductive property of
the χ 2 distribution, this second criterion is met if the addition of the variable causes a drop in
scaled deviance exceeding χ 02.05,1 (equal to 3.84).
However, it is important to note at this point that the analysis of deviance procedure for
developing parsimonious models should be conducted in different manners for Poisson and
negative binomial regression models. In the case of Poisson regression models, the difference in
scaled deviance between the model with p + 1 variables and the model with p variables is equal
to the likelihood ratio test statistic, which compares the maximized likelihood functions of the
sample of accident data under the two models. This statistic is defined as follows:
 L p +1 
LRTS = 2 ln  (9)
 L 
 p 
where
LRTS = the likelihood ratio test statistic
L p +1 = the maximum likelihood of the accident data under the model with p + 1 variables
Lp = the maximum likelihood of the accident data under the model with p variables
The likelihood of the accident data under a certain model is equal to the joint probability of the
accident observations. It is therefore a function of the model parameters and the error structure
assumed by the model. LRTS is a non-negative statistic and the minimum value it can have is
zero in which case the additional variable contributes nothing to increase the likelihood of the
sample of accident data. LRTS is used to test the null hypothesis H 0 : β p +1 = 0 , where β p +1 is
the parameter of the additional variable. A statistically significant LRTS leads to the rejection of
the null hypothesis and the adoption of the model with p + 1 variables. A statistically
insignificant LRTS means failing to reject the null hypothesis and concluding that the model
with p variables is perfectly satisfactory.
Determining the statistical significance of LRTS requires knowledge of its sampling
distribution. Aitkin et al. (1989) state that if two nested models with degrees of freedom df1 and
df2 are correct, the sampling distribution of their LRTS is an asymptotic χ 2 distribution with
degrees of freedom equal to (df1-df2). Therefore for the LRTS in equation (10) to be statistically
significant at the 95 percent confidence level it has to exceed χ 02.05,1 = 3.84. The fact that, in the
case of Poisson regression models, the difference in scaled deviance between the model with
p + 1 variables and the model with p variables is equal to their LRTS means that this
difference can be directly used to assess the usefulness of the additional variable. A drop in
scaled deviance exceeding 3.84 can be taken as the basis for choosing the model with p + 1
variables.
In the case of negative binomial regression, models with different variables have different κ
values. The scaled deviances of two negative binomial regression models, with different values
of κ , cannot be directly compared (Hinde, 1996). The reason behind this is that the difference in
scaled deviance between the two models is not equal to their likelihood ratio test statistic. In this
case, conducting a meaningful analysis of deviance procedure for developing parsimonious
models requires that the value of κ for the model with p variables be imposed on the model
with p + 1 variables. The drop in scaled deviance is then compared with χ 02.05,1 in order to assess
the usefulness of the additional variable. The procedure for developing parsimonious negative
binomial regression models is illustrated in the next section with the end result being a
parsimonious APM for the urban arterial sections of Vancouver and Richmond.
Now that the procedure for developing parsimonious APMs has been outlined, it is necessary to
justify this procedure. In other words, it is necessary to state why the t-test alone is not sufficient
for developing parsimonious Poisson and negative binomial regression models. It should be
stated at the outset that the t-test and the likelihood ratio test, or analysis of deviance, are
completely equivalent for two nested Normal regression linear models differing by one variable.
They can both be used to test the null hypothesis H 0 : β p +1 = 0 . This is not the case for the rest
of the GLM family models. The following explains why:
1. The use of the t-distribution when parameter estimates are compared with their standard
errors is exact for the classical Normal regression linear model, but is otherwise justified only
by asymptotic theory (Numerical Algorithms Group, 1994). No general results are known
about the adequacy of this approximation for all the other models covered by GLM, so the t-
test must be regarded as only a general guide to the accuracy of the estimates.
2. Aitkin et al. (1989) state that an asymptotic χ 2 distribution for the LRTS of two nested
models is not used in the case of Normal regression linear models, since exact results are
available. The exact distribution for the LRTS of two nested Normal regression linear
models containing p + g and p explanatory variables is the Fg , n − p − g distribution with g
degrees of freedom in the numerator and n − p − g degrees of freedom in the denominator.

Therefore when g = 1 , the distribution of LRTS is F1, n − p −1 which is equal to tn2− p −1 . This
means that, in the case of Normal regression linear models, the t-test and the likelihood ratio
test are completely equivalent. For regression models in which the response variable is not
normally distributed, the value of tn2− p −1 may be substantially different from the LRTS of two
nested models differing by one variable.
The procedure used in this paper for developing APMs that fit the accident data as closely as
possible is relatively straightforward. It is also a forward procedure by which variables are added
to a model one by one. However, in this case, the only condition for retaining a variable in the
model is that the t-ratio of its estimated coefficient is significant at the 95 percent confidence
level.
4.2 Illustration of the Procedure for Developing Parsimonious Negative Binomial APMs
The development of parsimonious negative binomial APMs is best illustrated by the flowchart in
Figure 1. This flowchart shows the detailed sequence of steps necessary to develop such models
according to the procedure outlined in the previous section. Following is the definition of the
letter symbols used in the flowchart:
i = an integer variable that is used as an index for the explanatory variables.

j = an integer variable that is used to refer to the model number.
κ (j) = shape parameter of the negative binomial error structure under Reference Model(j).
N = an integer variable representing the number of explanatory variables, other than the
exposure variables, that are under investigation at any phase of the process.
X(i) = explanatory variable i.

D(i) = drop in scaled deviance resulting from the addition of variable X(i) to a model.
S = a real variable used to store the value of the greatest drop in scaled deviance during a
cycle of variable additions to Reference Model(j).
The flowchart in Figure 1 shows that several models are run with and without fixing the value of
κ at various phases of the model development process. The default method used by GLIM 4 to
determine the vector of model parameters, β, and the shape parameter, κ , is the method of
maximum likelihood. This method provides maximum likelihood estimates of β and κ . These
are the values of β and κ that maximize the log-likelihood function l (β, κ ) given by:
n   yˆ i   κ 
l (β, κ ) = ∑ ln Γ ( y i + κ ) − ln Γ(κ ) − ln( y i !) − κ ln1 +
 κ
−

y i ln1 +
yˆ i


(10)
i =1   
The log-likelihood function depends on β through the terms yˆ i = Eˆ (Yi ) , which represent the
model predictions or fitted values. The model form most commonly used in accident modeling
was given in equation (1), and it specifies the fitted values as ŷ i = exp(xi′β) where xi is the vector
of explanatory variables corresponding to the ith observation. The method of maximum
likelihood uses the Newton-Raphson technique to obtain β and κ as the solution to the equations
∂ l /∂ β = 0 and ∂ l /∂ κ = 0. GLIM 4 also allows the user to pre-specify the value of κ . The value
of β that maximizes the log-likelihood function is then calculated using the Newton–Raphson
technique with Fisher scoring. The Fisher scoring allows the calculation of β through an
iteratively re-weighted least squares (IRLS) procedure.
The starting point in the model development process is a basic model containing the exposure
variables only. The reason for this start is that any accident model should at least contain the
exposure variables since no accidents occur without exposure. The basic model acts as the first
reference model. The term “reference model” is used here to denote a model that serves as the
base for generating a new model containing one more additional variable. Subsequently,
secondary models are developed by adding the rest of the variables individually to the basic
model with κ fixed to the value obtained under that model. Since κ is fixed at this stage in the
process, it is possible to compare the scaled deviance of the model resulting from the addition of
a certain variable to the scaled deviance of the basic model. No further consideration is given to
any variable whose parameter has an insignificant t-ratio or that causes an insignificant drop in
the scaled deviance obtained under the basic model. The drop in scaled deviance caused by each
of the other variables is noted and recorded.
The next stage in the process consists of developing a new reference model containing the
exposure variables and that explanatory variable which caused the maximum drop in scaled
deviance during the previous stage; this model will have a new κ value. The remaining variables
that were not discarded in the previous stage are now individually added to the second reference
model with κ fixed to the new value, and a procedure similar to the one in the previous stage is
used to obtain the next reference model. The model building process continues in this manner
using the method of maximum likelihood to obtain the reference models and the fixed κ method
to obtain the secondary models. The process terminates when all the explanatory variables have
been examined, and the final reference model will be the parsimonious model sought.
4.3 Parsimonious and Best-Fit Models Developed
The procedures that were described for developing parsimonious and best-fit accident
predictions models were used to develop such models for the urban arterial sections of
Vancouver and Richmond. These models are listed in Table 2, and they are models for predicting
total accident frequency on a section in a period of 3 yrs. The symbols used in the table are: a) L
= section length in km, b) V = annual average daily traffic in vehicles per day, c) UNSD =
unsignalized intersection density in intersection per km, f) CROD = crosswalk density in
crosswalks per km, g) NL = between-signal number of lanes, h) IRES = indicator variable for
residential land use (equal to 1 if land use is residential; 0 otherwise), i) IBUS = indicator
variable for business land use (equal to 1 if land use is for business; 0 otherwise), j) IUND =
indicator variable for undivided cross section (equal to 1 if section has no median; 0 otherwise).
Both models have values of scaled deviance and Pearsonχ 2 that are significant at the 95%
confidence level indicating that the models have an acceptable fit to the data. The t-ratios of the
parameter estimates of both models are also significant at the 95% confidence level. It should be
noted that the parsimonious model contains fewer variables than the best-fit model that was
developed using only the significant t-ratio criterion for retaining variables in a model.
5. Outlier Analysis
Most data sets contain a few unusual or extreme observations called outliers. Outliers occur in a
set of data either because they are genuinely different from the rest of the data or because errors
took place during data collection and recording. Errors can be detected and corrected, but how
should outliers that are genuinely different from the rest of the data be dealt with? This paper
proposes excluding outliers from model development only if they have extreme influence on the
model equation. Such exclusion relieves the model equation of the sensitivity to a very small
percentage of the data and establishes more confidence in its validity. However, it should be
emphasized that excluding influential outliers from the development of an accident model is not
synonymous with neglecting these outliers or removing them from the database. Rather, they
should be investigated to determine what makes them different and whether any information can
be extracted from them.
Kulmala (1995) used the leverage statistic to identify outliers that should be excluded from the
model development. In the Normal linear regression case, the leverage of data point i is a
measure of the distance of the p-dimensional point xi from the centroid of the other variate points
in the p-dimensional design space, which is the vector space spanned by the column vectors of
an n×p design or model matrix X. The design space will hereafter be referred to as the X-space.
High-leverage data points are therefore outliers in the X-space, and they tend to pull the
regression hyper-surface towards them because they exist alone in their region of the X-space.
The data point leverage values are the diagonal elements of the n×n hat matrix, which is the
matrix that multiplies the vector of observations to yield the vector of predicted values. For
Normal linear regression models, the hat matrix is H = X(X′X)-1X′, and the ith diagonal element
of H is hii = xi′(X′X)-1xi. hii has a value between zero and unity. In the case of Normal linear
regression models, a value of unity means that the value for response predicted by the model is
equal to the observed value (Jorgensen, 1993). Thus a large value of hii is an indication that the
corresponding observation may be influential in determining the position of the fitted model.
The concept of leverage may be extended to other models of the GLM family. For these other
models, the hat matrix is based on the IRLS procedure for model fitting. Therefore, H =
W1/2X(X′WX)-1X′ W1/2 where W is the n×n diagonal matrix of weights:

2
1  dµ i 
wii =   (11)
Var ( y i )  dη i 

where µ i = the mean of y i and ηi = xi′β is the linear predictor associated with y i . In GLM,
µ i and ηi are related by a link function which is the log-link for Poisson and negative binomial
 dµ i 
regression models. Therefore, for these models, µ i = exp(ηi ) and 
 dη


= exp(xi′β). The ith
 i 
diagonal element of H in the case of non-Normal GLM models is: hii = wiixi′(X′WX)-1xi.
It should be noted that in the case of non-Normal GLM models a data point i at the extreme X-
range will not necessarily have high leverage if its weight wii is very small (McCullagh and
Nelder, 1989). The sum of the leverage values hii, namely trace(H), is equal to the number of
model parameters p. The average value of the leverage is therefore equal to p/n, where n is the
number of data points. Many sources (e.g. Hoaglin and Welsch, 1978) consider that a high
leverage is one that exceeds 2p/n and that data points with leverage in excess of this value should
be subject to further consideration.
High-leverage data points have potential for being influential by virtue of their location in the X-
space. However, high leverage alone is not sufficient to render a point influential and warrant
excluding it from the model development. Influence requires that a point be extreme in its value
of the response variable in addition to being an outlier in the X-space. An appropriate measure of
influence is the Cook’s distance, which is defined as follows:
hii
ci =
PS '
(r i )2 (12)
p (1 − hii )
PS '
where r i is the studentized residual of data point i. In the case of Poisson and negative
binomial regression models, the studentized residual is given by:

yi − yˆ i
r PS 'i = (13)
(1 − hii )Var ( yi )
Thus Cook’s distance is made up of a component that reflects how well or how poorly the model
fits the ith observation y i , namely ŷi , and a component that measures how far the data point is
from the rest of the points in the X-space, namely hii.
Data points having a high Cook’s distance are influential points that play a significant role in
determining the model parameter estimates. Situations in which a relatively small percentage of
the data has a significant impact on the model may not be acceptable to model users. Generally,
they are more content assuming that a regression equation is valid if it is not overly sensitive to a
few observations. Unfortunately, there is no rule for how high Cook’s distance has to be to make
a data point extremely influential. Therefore, a procedure is needed by which to decide whether a
data point is extremely influential with a high degree of confidence. This paper proposes the
following procedure for identifying extremely influential outliers:
1. The data are sorted in descending order according to the ci values.

2. Starting with the data point having the largest ci value, data points are removed one by one,
and the drop in scaled deviance is observed after the removal of each point.
3. Points causing a significant drop in scaled deviance are extremely influential outliers.
The same analysis of deviance procedure used to develop parsimonious APMs can be used in the
identification of extremely influential outliers. As mentioned before, the scaled deviances of two
negative binomial regression models, with different values of κ , cannot be directly compared.
Performing outlier analysis for a negative binomial regression case requires that the value of κ
for the model with n data points be imposed upon the model with n − 1 data points; then the
difference in scaled deviance can be compared to χ α2 ,1 in order to assess whether the removed
data point is an extremely influential outlier. The following section illustrates the procedure that
is proposed by this paper to perform outlier analysis for negative binomial regression models.
5.1 Illustration of the Procedure for the Identification and Removal of Influential Negative
Binomial Model Outliers
The flowchart in Figure 2 gives a complete description of the procedure that is proposed by this
paper for the identification and removal of extremely influential negative binomial model
outliers. Following is the definition of the letter symbols used in the flowchart:
n = the total number of data points available.
i = an integer variable that is used as an index for the data points.

j = an integer variable that is used to refer to the model number.
κ(j) = shape parameter of the negative binomial distribution under Reference Model(j).
D(i) = drop in scaled deviance resulting from the removal of point i.
The term “ reference model” is used here to denote a model that serves as the base for generating
a new model using one less data point (the identified influential outlier).
The procedure outlined in Figure 2 was employed to identify and remove the extremely
influential outliers of the parsimonious model in Table 2. It resulted in the removal of seven data
points, and the final parsimonious APM for the urban arterials of Vancouver and Richmond is
listed in Table 3.
6. SUMMARY
This study presents a detailed discussion of two statistical issues related to accident modeling
using Poisson and negative binomial regression. These are the issue of model building, i.e.
selecting from the available safety database which explanatory variables to include in an accident
prediction model, and the issue of the identification and removal of extremely influential model
outliers. The study suggests procedures for model building and conducting outlier analysis and
presents enough justification in defense of the validity of these procedures. The procedures are
straightforward to apply in connection with Poisson regression models. However, in the case of
negative binomial regression models, there is an added complexity arising from the fact that the
scaled deviances of two negative binomial regression models with different values of the shape
parameter, κ , cannot be directly compared. The study therefore presents flowcharts detailing the
application of the procedures in connection with negative binomial regression models. The
procedures are then applied in the development of accident prediction models for the urban
arterials of the cities of Vancouver and Richmond in the province of British Columbia, Canada.
ACKNOWLEDGEMENTS
Financial support for this study was provided by the Natural Sciences and Engineering Research
Council of Canada and the Insurance Corporation of British Columbia.
REFERENCES
Aitkin, M., Anderson, D., Francis, B., Hinde, J., 1989. Statistical Modelling in GLIM. Oxford
University Press, New York.
Hauer, E., Ng, J.C.N., Lovell, J., 1988. Estimation of safety at signalized intersections.
Transportation Research Record 1185, 48-61.
Hauer, E., 1997. Observational before-after studies in road safety: estimating the effect of
highway and traffic engineering measures on road safety. Pergamon Press, Elsevier Science Ltd.,
Oxford, U.K.
Hinde, J., 1996. Macros for fitting overdispersion models. Glim Newsletter 26, 10-26.
Hinde, J., Demetrio, C.G.B., 1998. Overdispersion: models and estimation. Computational
Statistics & Data Analysis 27, 151-170.
Hoaglin, D.C., Welsch, R.E., 1978. The hat matrix in regression and ANOVA. Am. Statist. 32
(1), 17-22.
Jorgensen, B., 1993. The Theory of Linear Models. Chapman and Hall, New York.
Jovanis, P.P., Chang, H.L., 1986. Modeling the relationship of accidents to miles traveled.
Transportation Research Record 1068, 42-51.
Kulmala, R., 1995. Safety at rural three and four-arm junctions: Development and application of
accident prediction models. VTT Publications, Espoo, Finland.
McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models. Chapman and Hall, New York.
Miaou, S., Lum, H., 1993. Modeling vehicle accident and highway geometric design
relationships. Accident Analysis & Prevention 25 (6), 689-709.
Numerical Algorithms Group, 1994. The GLIM System Release 4 Manual, The Royal Statistical
Society, Oxford, UK.
Initialize i = 1, j = 1, S = 0
Start with the basic model containing only the exposure variables. This is Reference Model(1).
Add X(i) to Reference Model(j) with k fixed to k(j)
i=i+1
Is t-ratio of X(i)
Yes
significant?
No
N = 0? No
Is D(i)
N=N-1 Discard X(i) No
Yes significant?
Stop. Reference
Model(j) is the Yes
final model.
j= j+1 No Is D(i) > S
Add the X(i) whose D(i) = S to Reference

Yes
Model(j-1) without fixing k. Obtain
Reference Model(j).
S = D(i)
N=N-1
i=i+1
Yes N = 0?
Yes i = N? No
No
Renumber remaining
variables from 1 to N
Stop. Reference
Model(j) is the
Reset i to 1
final model.
and S to 0
Figure 1 Flowchart Demonstrating the Process of Developing a Parsimonious

Negative Binomial Regression Model
Initialize i = 1, j = 1.
Develop a model using all n data points. This is Reference Model(1).
Obtain the Cook's distance of each data point as computer output for
Reference Model(1).
Arrange the data points in order of decreasing Cook's distance.

Number the points from 1 to n.
Remove point(i) and run a model with k fixed to k(j).
Is D(i)
No
significant?
Stop. Reference Yes

Model(j) is the
final model. j=j+1
Develop Reference Model(j) using the remaining (n-i)

points without fixing k.
i=i+1
Figure 2 Flowchart Demonstrating the Procedure for Identification and Removal of

Negative Binomial Regression Model Outliers
Table 1 Safety Data Base Summary
General Statistics
City
Total
Vancouver Richmond
Number of Street Sections
By Median Type
Raised-Curb Median 64 17 81
Two Way Left Turn Lane (TWLTL) 0 11 11
Undivided Cross Section 225 74 299
Total: 289 102 391
Number of Street Sections
By Land Use
Residential 193 84 277

Business 92 10 102
Other 4 8 12
Total: 289 102 391
Total Length By Median Type
Raised-Curb Median 48.857 km 14.943 km 63.800 km

Two Way Left Turn Lane (TWLTL) 0 km 9.543 km 9.543 km
Undivided Cross Section 181.419 km 65.504 km 246.923 km
Total: 230.276 km 89.990 km 320.266 km
Accident Data (1994-1996)
City
Total
Vancouver Richmond
Fatalities
26 17 43
Injuries
4478 1080 5558
Property Damage Only Accidents
12816 1722 14538
(PDO)
Total:
17320 2819 20139
(PDO) Accidents as Percentage
74.0% 61.1% 72.2%
of Total Accidents
Range of Database Geometric and Traffic Data
City
Vancouver Richmond
Geometric Data
Number of through traffic lanes 2-6 2-4

Section length, meters 113-3608 405-2510
Number of driveways per km (two-way total) 0-58 0-95
Number of unsignalized intersections per km (two-way total) 0-21 0-9
Number of pedestrian cross walks per km (two-way total) 0-12 0-9
Number of bus stops per km (two-way total) 0-25 0-12
Traffic Data
Average daily traffic, vpd 4236-62931 4232-33862
Table 2 Parsimonious and Best-Fit Accident Prediction Models
Parsimonious Model:
Acc per 3 yrs = 0.02907 × L0.8015 × V 0.6571 × e 0.09323×UNSD + 0.07663×CROD + 0.09115× NL − 0.2350× IRES
Scaled Deviance df Shape Parameter κ Pearson χ 2 χ 02.05,384
414.31 384 3.139 413.7 430.7
Variable Constant L V UNSD CROD NL IRES
Coefficient 0.02907 0.8015 0.6571 0.09323 0.07663 0.09115 -0.2350
t-ratio 5.320 12.265 8.846 9.815 5.443 2.743 -3.188
Best-Fit Model:
Acc per 3 yrs = 0.01841× L0.8005 × V 0.6664 × e 0.08561×UNSD + 0.07846×CROD + 0.08838× NL + 0.01509× BUSD + 0.1414× IUND + 0.01470× DRID. IBUS
414.34 382 3.210 424.4 428.6
Variable Constant L V UNSD CROD NL BUSD IUND DRID.IBUS
Coefficient 0.01841 0.8005 0.6664 0.08561 0.07846 0.08838 0.01509 0.1414 0.01470
t-ratio 5.854 12.350 8.832 8.788 5.653 2.723 2.041 1.970 3.763
Table 3 Parsimonious Accident Prediction Model After Outlier Analysis
Model:
Acc per 3 yrs = 0.03608 × L0.7798 × V 0.6337 × e 0.09445×UNSD + 0.07944×CROD + 0.07633× NL − 0.2174× IRES

408.73 377 3.547 370.9 423.3
Variable Constant L V UNSD CROD NL IRES
Coefficient 0.03608 0.7798 0.6337 0.09445 0.07944 0.07633 -0.2174
t-ratio 5.245 12.493 8.957 10.471 5.903 2.401 -3.050

Statistical Issues in Traffic Accident Modeling

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Statistical Issues in Traffic Accident Modeling

Caricato da

Copyright:

Formati disponibili

Statistical Issues in Traffic Accident Modeling

3.1 Model Form

where Eˆ (Y ) = predicted accident frequency, L = section length, V = section AADT, x j =

density function is:

Hauer et al., 1988). Therefore, unconditionally:

negative binomial distribution becomes identical to the Poisson distribution.

3.3 Assessment of Model Goodness of Fit

distributions of the exponential family.

4. SELECTION OF MODEL EXPLANATORY VARIABLES

4.1 Procedures for Building Parsimonious and Best-Fit Models

Determining the statistical significance of LRTS requires knowledge of its sampling

of the GLM family models. The following explains why:

degrees of freedom in the numerator and n − p − g degrees of freedom in the denominator.

nested models differing by one variable.

i = an integer variable that is used as an index for the explanatory variables.

X(i) = explanatory variable i.

cycle of variable additions to Reference Model(j).

iteratively re-weighted least squares (IRLS) procedure.

4.3 Parsimonious and Best-Fit Models Developed

W1/2X(X′WX)-1X′ W1/2 where W is the n×n diagonal matrix of weights:

binomial regression models, the studentized residual is given by:

from the rest of the points in the X-space, namely hii.

following procedure for identifying extremely influential outliers:

1. The data are sorted in descending order according to the ci values.

n = the total number of data points available.

i = an integer variable that is used as an index for the data points.

Council of Canada and the Insurance Corporation of British Columbia.

accident prediction models. VTT Publications, Espoo, Finland.

relationships. Accident Analysis & Prevention 25 (6), 689-709.

Add X(i) to Reference Model(j) with k fixed to k(j)

j= j+1 No Is D(i) > S

Add the X(i) whose D(i) = S to Reference

Figure 1 Flowchart Demonstrating the Process of Developing a Parsimonious

Arrange the data points in order of decreasing Cook's distance.

Remove point(i) and run a model with k fixed to k(j).

Stop. Reference Yes

Develop Reference Model(j) using the remaining (n-i)

Figure 2 Flowchart Demonstrating the Procedure for Identification and Removal of

Residential 193 84 277

Raised-Curb Median 48.857 km 14.943 km 63.800 km

Number of through traffic lanes 2-6 2-4

Average daily traffic, vpd 4236-62931 4232-33862

Scaled Deviance df Shape Parameter κ Pearson χ 2 χ 02.05,382

414.34 382 3.210 424.4 428.6

Variable Constant L V UNSD CROD NL BUSD IUND DRID.IBUS

Scaled Deviance df Shape Parameter κ Pearson χ 2 χ 02.05,377

Potrebbero piacerti anche