Sei sulla pagina 1di 10

Travel Behaviour and Society 14 (2019) 1–10

Contents lists available at ScienceDirect

Travel Behaviour and Society


journal homepage: www.elsevier.com/locate/tbs

Applying a random forest method approach to model travel mode choice T


behavior

Long Chenga,b, Xuewu Chena, , Jonas De Vosb, Xinjun Laic, Frank Witloxb,d,e
a
Jiangsu Key Laboratory of Urban ITS, Southeast University, Si Pai Lou #2, Nanjing 210096, China
b
Department of Geography, Ghent University, Krijgslaan 281 S8, Ghent 9000, Belgium
c
School of Electro-Mechanical Engineering, Guangdong University of Technology, No. 100 Waihuan Xi Road, Guangzhou 510006, China
d
Department of Geography, University of Tartu, Vanemuise 46, 51014 Tartu, Estonia
e
College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, 29 Yudao Street, Nanjing 210016, China

A R T I C LE I N FO A B S T R A C T

Keywords: The analysis of travel mode choice is important in transportation planning and policy-making in order to un-
Travel mode choice derstand and forecast travel demands. Research in the field of machine learning has been exploring the use of
Prediction performance random forest as a framework within which many traffic and transport problems can be investigated. The
Variable importance random forest (RF) is a powerful method for constructing an ensemble of random decision trees. It de-correlates
Random forest
the decision trees in the ensemble via randomization that leads to an improvement of forecasting and reduces the
Nanjing (China)
variance when averaged over the trees. However, the usefulness of RF for travel mode choice behavior remains
largely unexplored. This paper proposes a robust random forest method to analyze travel mode choices for
examining the prediction capability and model interpretability. Using the travel diary data from Nanjing, China
in 2013, enriched with variables on the built environment, the effects of different model parameters on the
prediction performance are investigated. The comparison results show that the random forest method performs
significantly better in travel mode choice prediction for higher accuracy and less computation cost. In addition,
the proposed method estimates the relative importance of explanatory variables and how they relate to mode
choices. This is fundamental for a better understanding and effective modeling of people’s travel behavior.

1. Introduction income, driving license availability, education level, car ownership, and
household structure. Li et al. (2012)—exploring travel mode choices in
In order to develop a socially desirable and environmentally sus- the UK—confirmed that the share of car use decreases at higher ages.
tainable transport system in line with the traveler’s demands, trans- With respect to gender, Cheng et al. (2017) found that women rely
portation planners must improve their understanding of the hierarchy more on public transit than men. Bhat and Srinivasan (2005) ad-
of individual and contextual variables that drive people’s travel mode ditionally reported that travelers with higher income are more likely to
choice. Understanding mode choice is important since it affects how travel by car. Similar findings have been found in the studies where
efficiently people can travel and how much urban space is devoted to Bhat and Lockwood (2004) observed that people with a high income as
transportation functions, as well as the range of alternatives available to well as those who have a driving license drive more frequently. With
travelers (De Dios Ortúzar and Willumsen, 1999). regard to education level, Plaut (2005) and van den Berg et al. (2011)
both revealed that highly educated people conduct more trips (in par-
1.1. Determinants of travel mode choice ticular leisure trips) by public transit. Car ownership is an important
determinant of car trips (Ding et al., 2017). Finally, people living in
In fact, a large body of literature shows that travel mode choice is larger households are less likely to use non-motorized modes than those
affected by a variety of factors including socio-demographics, built living in smaller households (Ryley, 2006).
environment and attitudes (Cervero, 2002; Van Acker and Witlox, In addition, a number of key attributes of the built environment
2011; Ermagun et al., 2015; De Vos et al., 2016). Behavioral hetero- have been identified to exert pronounced influences on mode choice,
geneity in travel mode choices is observed, varying by age, gender, such as building density, land-use mixture, dedicated infrastructure for


Corresponding author.
E-mail addresses: chenxuewu@seu.edu.cn (X. Chen), jonas.devos@ugent.be (J. De Vos), xinjun.lai@gdut.edu.cn (X. Lai), frank.witlox@ugent.be (F. Witlox).

https://doi.org/10.1016/j.tbs.2018.09.002
Received 23 May 2018; Received in revised form 8 August 2018; Accepted 13 September 2018
Available online 21 September 2018
2214-367X/ © 2018 Hong Kong Society for Transportation Studies. Published by Elsevier Ltd. All rights reserved.
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

pedestrians and cyclists, distance to various facilities, and transporta- of a so-called ensemble method to predict travel mode choices. In
tion provisions (Cervero, 2002; Schwanen and Mokhtarian, 2005; Ding machine learning, ensemble methods use multiple learning algorithms,
et al., 2017). Generally, high density and mixed land use encourage obtaining better predictive performance compared to any of the con-
people to walk or cycle, due to relatively short distances and walking/ stituent learning algorithms alone (Ding et al., 2016). Of all ensemble
cycling infrastructure, or use the available public transit facilities. It methods, the random forest (RF) method-developed by Breiman (2001)
should be noted that dedicated neighborhood design towards walking – is popular and shows very good capability in solving prediction and
and cycling advances the use of active modes. Furthermore, the en- classification problems (Zaklouta and Stanciulescu, 2012; Zhang and
hanced accessibility tends to have positive effects on walking (Cao Haghani, 2015). Instead of fitting a single “best” tree model, the RF
et al., 2006). In order to improve accessibility, we need to decrease the strategically combines multiple simple decision trees to optimize pre-
travel distance to public facilities as well as increase transport network dictive performance. In terms of travel mode choices, the application of
connectivity. the random forest method as a multitude of decision trees means that
Recently, the influences exerted by attitudes toward less tangible we allow for differences in travel decision heuristics. Different decision
attributes such as comfort, convenience and travel satisfaction have trees in the ensemble may pick up different sources of uncertainty and
gained considerable attention (Scheiner and Holz-Rau, 2007; De Vos variability in the data. Thus, from a purely technical viewpoint, the
et al., 2016). Studies indicate that attitudes may be better predictors of accuracies of model estimation and prediction would be expected to
mode choice than the traditionally used objective measures. With a enhance. Drawing on insights and techniques from both statistical and
sample of Swedish commuters, studies found that attitudes toward machine learning methods, the random forest method can identify and
flexibility and comfort and a pro-environmental inclination influence interpret relevant variables and interactions. The interpretability of this
the individual’s choice of mode (Johansson et al., 2006). Heinen et al. method enables us to better understand model results, and is important
(2011) analyzed the influence of commuters’ attitudes toward the to analyze the relationships between mode choice and its contributing
benefits of cycling (e.g., convenience, low cost, health benefits) on the factors.
mode choice decision for commutes to work. Findings showed that at- The random forest method has witnessed a wide application to
titudinal factors provide an additional explanation for commuter’s cy- different research fields and achieved great success. In the Appendix
cling choice. section of this paper, we provide a broad summary of recent studies that
use the RF method to solve transportation prediction and classification
1.2. Modeling approach of travel mode choice problems. They are generally classified into four categories: travel
choice behavior, traffic incident prediction, traffic time/flow predic-
Models have traditionally been estimating travel mode choice using tion, and pattern recognition. Elhenawy et al. (2014), Rasouli and
statistical regression framework, e.g. linear regression model, Poisson Timmermans (2014), Ermagun et al. (2015) employed RF to predict
regression model, multinomial logit model, nested logit model etc. traveler’s behavior, such as driving behavior at the onset of a yellow
(Cervero, 2002; Bhat and Srinivasan, 2005; Cheng et al., 2016). How- indication at signalized interactions and travel mode choice. The
ever, these models have their own model assumptions and require pre- method has shown to be able to deal with mixed types of data and be
defined underlying relationships between the dependent and ex- effective in predicting multi-category classification problems. Brown
planatory variables. For example, the multinomial logit model assumes (2016) indicated that the predictive accuracy of rail accidents severity
that the choice probabilities of each pair of alternatives are independent improved through the use of RF, and that influential variables could be
of the presence or characteristics of all other alternatives. Violations of identified so as to gain a better understanding of the contributors.
these assumptions produce inconsistent parameter estimates and biased Rebollo and Balakrishnan (2014), Zhang and Haghani (2015), and
predictions. Another critical problem of statistical regression models is Semanjski (2015) applied this method to predict travel time. Their
that the relative influences of explanatory variables on travel mode proposed methods can fit complex nonlinear relationships while re-
choices are not evaluated. Understanding the relative importance of quiring little data preprocessing. Hou et al. (2015) developed different
explanatory variables could significantly help travel mode choice pre- models to forecast long-term and short-term traffic flows. The experi-
diction and therefore contribute to the improvement of travel demand ments suggested that RF has a considerable advantage over other ma-
forecasting. Although the significance test or sensitivity analysis can be chine learning methods. A number of studies using RF are found in
conducted for conventional statistical regression models, just one pattern recognition research, including traffic sign recognition, driving
variable is evaluated at one time under the assumption that other posture recognition, vehicle type recognition, trip purpose recognition,
variables remain unchanged. As a result, the important interactions travel mode recognition, and drowsy behavior detection (Zaklouta and
among variables might be ignored (Ding et al., 2016). Stanciulescu, 2012; Zhao et al., 2012; Zhang, 2013; Montini et al.,
In contrast to statistical models, methods from the field of machine 2014; Shafique and Hato, 2015; Jahangiri and Rakha, 2015; Yang et al.,
learning are promising alternatives for modeling travel mode choices. 2016; Kamkar and Safabakhsh, 2016; Wang et al., 2016; Gong et al.,
Instead of making strict assumptions, machine learning methods learn 2018).
to represent complex relationships in a data-driven manner. The use-
fulness of machine learning methods for predicting travel mode choices 1.3. Objectives of this research
has been demonstrated in transportation research, including decision
tree (Lindner et al., 2017), neural network (Golshani et al., 2018), and The review of these studies reveals that RF is a promising data
support vector machine (Zhang and Xie, 2008; Semanjski et al., 2017). mining approach for its ability to consider different types of variables
The common practice of these machine learning methods is to identify and determine variable importance with no prior specification of model
the single best performing model and utilize its estimated parameters to structures. However, there are limited studies on the application of RF
predict outputs under different scenarios. However, it is arguable that in travel mode choice analysis. To the best of our knowledge, only two
the development and application of a single model is not necessarily the studies—conducted by Rasouli and Timmermans (2014) and Ermagun
best approach considering the various sources of error/uncertainty in et al. (2015)—used the random forest method to predict travel mode
the analysis of travel mode choices. The input data might contain er- choice. However, they adopted the default RF specification which
rors, the sample might be biased, the model itself might be stochastic, might not produce the best prediction results. In our analysis, we ca-
and the scenarios used for predicting might not be consistent with the librate the model parameter for achieving better performance.
actual evolution of transportation systems (Rasouli and Timmermans, The contribution of the present study to the literature review is
2012, 2014). twofold. First, it adds to the existing literature by providing recent
In order to deal with this problem, this study explores the potential developments of random forest method on travel mode choice analysis,

2
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

using the 2013 Nanjing Travel Survey. Second, it offers a more robust Table 1
RF model specification after parameter calibration for better predictive Travel mode choice distribution.
performance. In particular, the model performance is compared with Mode Choice Category Frequency Percentage
other approaches within the travel mode choice prediction context.
The remainder of this paper is organized as follows. Section 2 pro- Walk 1376 18.9%
Bicycle 1058 14.5%
vides a brief description of data collection in the study area (i.e.,
E-motorcycle 1876 25.8%
Nanjing in China), followed by the methodology and model specifica- Public Transport (PT) 1267 17.4%
tions in Sections 3 and 4. Model results are presented in Section 5, while Automobile 1699 23.4%
the main conclusions together with recommendations for future studies
are given in Section 6.
The built environment data were collected from the Nanjing Urban
Planning Bureau in 2013. In order to identify the land use conditions
2. Data
and transportation features around each household location, we used
the “buffer” function in the ArcGIS software (version 10.2) to obtain the
The data collection consisted of two phases: in the first phase
built environment variables, which are described in Table 2. Land use
household surveys were conducted to get resident’s travel information
entropy is utilized to represent the mixed land utilization within 1-
and their socio-demographics. In the second phase, the information on
kilometer-neighborhood. It can be calculated as
the study area’s built environment is obtained to examine its influences
LandEntropy = − ∑i (Pi ln(Pi )) , where Pi is the proportion of the ith land
on travel mode choice.
category. There are five categories in our analysis: (i) residential, (ii)
A household travel diary survey was conducted in Nanjing
commerce and business, (iii) public services, (iv) education, and (v)
(China)—by the Nanjing Institute of City and Transport Planning—on a
entertainment. Transportation features refer to road network and
typical weekday (i.e., Wednesday; October 30th, 2013). Nanjing, the
transit network attributes. Road network density is measured by di-
capital of China’s eastern Jiangsu Province, has an urban area of
viding the total length of arterial, collector streets and local streets by
4733 km2 and a total population of approximately 5.5 million. In 2013,
the buffer area size. Transit network attributes include the distance to
there were two metro lines and 487 regular bus routes. Residents’ travel
the nearest metro station, the number of metro stations within one
frequency is 2.71 trips per day and average trip time is 30 min. A map of
kilometer, bus network density within 500 m, the distance to the
Nanjing and respondents’ household locations are shown in Fig. 1. The
nearest bus stop, and the number of bus stops within 500 m. Travel
survey included two parts: (a) household and individual characteristics,
information is also included as explanatory variable. Travel time is
(b) travel information of all trips made within the 24 h of the previous
derived from the centroids of each pair of traffic analysis zones. Travel
day. A total of 6000 questionnaires were assigned randomly to residents
purpose is divided into four categories: work, school, shopping and
in traffic analysis zones (TAZs) in accordance with their population.
recreation.
Taking a whole household as a unit, face-to-face, structured interviews
We randomly divided the sample data into two subsets: a training
were adopted to record all activities, involving travel details for all
dataset and a testing dataset. 80% of the total sample is used as training
individuals above six years old in the household. In the end, 7276 trips
data to determine the optimal model parameters. The remaining 20%
made by 2991 individuals from 1435 households were used for analysis
dataset is used as testing data to evaluate the predictive accuracy. On
after data cleaning.
the other hand, the continuous explanatory variables are standardized
Traveler’s mode choice is categorized as follows: walk, bicycle, E-
to have a zero mean and unit standard deviation before the model
motorcycle, public transport (PT), and automobile; with the distribu-
specification process.
tion of mode share shown in Table 1. The E-motorcycle has the largest
proportion with a rate of 25.8%. E-motorcycle is a flexible and con-
venient mode and has become a popular choice for daily travel in many 3. Methodology
cities in China. The explanatory variables of travel mode choice, in-
cluding household and individual socio-demographics and the built This section discusses the random forest (RF) method used for
environment are presented in Table 2. modeling travel mode choice. It includes an overview of a single deci-
sion tree method followed by random forest procedures and how the
variable importance is measured.

3.1. Single decision tree

A single decision tree model partitions the feature space into a set of
mutually exclusive regions. For example, there are K observations; each
observation consists of p inputs with one response variable, like
(yi , x i1, x i2 , ...,x ij , ...,x ip) for i = 1, 2, ...,K ; j = 1, 2, ...,p. In terms of travel
mode choice prediction, yi can be travel mode for each trip i;
(x i1, x i2 , ...,x ij , ...,x ip) are variables that are relevant to predicted travel
mode, such as socio-demographics, built environment and travel in-
formation in this study.
The classification tree recursively partitions trips into categories
based on the input explanatory variables. Each divided sub-region thus
contains fewer and fewer observations. The process continues until
some stopping criterion is reached. Finally, the feature space is divided
into Q regions {R1, R2 , ...RQ} . For each splitting variable, the best split-
ting point can be determined by scanning all possible values. By scan-
ning all input variables, finding out the best pair of splitting variable
and splitting point is feasible. A single decision tree is the basic model
Fig. 1. Map of Nanjing and the respondents’ household locations. for random forest method.

3
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

Table 2
Explanatory variables for travel mode choice.
Variable Coding Description Mean SDa

Household attributes Size The number of persons in the household 2.41 0.67
Income Annual income, discrete, [0, 50], (50, 100], (100, 150], (150, 200], (200, ∞). (in 1000 Yuan)c NAb NA
Child Having children under 6 years old or not, dummy 0.13 0.33
Car Having cars or not, dummy 0.58 0.49
Ecycle Having E-motorcycles or not, dummy 0.76 0.43
Bicycle Having bicycles or not, dummy 0.70 0.46

Individual attributes Gender Male or female (1 = male, 0 = female) 0.52 0.50


Age Discrete, 5 types (20 s, 30 s, 40 s, 50 s, 60 s) NA NA
Education Discrete, 3 types (middle school, high school, college) NA NA
Transitcard Having transit IC card or not, dummy 0.81 0.39
License Having driving license or not, dummy 0.49 0.50

Built environment LandEntropy Mixed land utilization (1 km neighborhood) 0.96 0.26


RoadDens Road network density (km/km2, 1 km neighborhood) 8.15 3.18
MetroDist The distance to the nearest metro station (km) 1.78 2.07
MetroNo The number of metro stations (1 km neighborhood) 0.81 1.02
BusDens Bus network density (km/km2, 500 m neighborhood) 3.31 1.28
BusDist The distance to the nearest bus stop (km) 0.18 0.11
BusNo The number of bus stops (500 m neighborhood) 5.15 2.71

Travel information TravelTime The derived travel time between TAZs (min) 30.16 16.87
TripPurpose Discrete, 4 types (work, school, shopping, recreation) NA NA

a
SD: standard deviation.
b
NA: not applicable.
c
1 Yuan = US$0.16 in 2013.

3.2. Random forest method

The random forest method combines the bootstrapping and the


random feature selection. In the bootstrapping, each single tree grows
with a different training sample which is randomly selected from the
training dataset with replacement. Through sampling with replace-
ment, some observations might appear more than once, while others
will be left out in the bootstrap sample, known as out-of-bag (OOB)
observations (Breiman, 2001). To avoid the possible correlation be-
tween base trees, random feature selection is incorporated into the
bootstrapping procedure. Instead of using all the explanatory variables,
it only allows a random subset of variables at each splitting node.
RF generally employs two levels of randomness to ensure the di-
versity of single trees: (a) a different training data set with the same
sample size, and (b) a different set of explanatory variables to split each Fig. 2. Illustration of the random forest method.
node. As only a subset of samples and explanatory variables are used for
base tree estimation, the prediction errors caused by biased samples and of variable importance based on the Gini impurity index used for the
noisy data could be reduced. In this sense, RF has a strong technical calculation of splits during training (Atkinson, 1970). Every time a split
component that picks up the variability in the data. RF prediction of a node is made on variable m, the Gini impurity criterion for the two
performance is mainly influenced by three aspects: (a) correlation be- descendent nodes is less than the parent node. In this study, we use the
tween base trees: the correlation needs to be reduced; (b) the perfor- Gini impurity index to identify significant explanatory variables that
mance of each tree: the performance of each base tree needs to be contribute to travel mode choice. The decrease of the Gini impurity
strengthened; and (c) the total number of trees: the number needs to be index at an internal tree node is calculated for the variable used to make
large under computational efficiency constraints. More specifically, the split. Then, the importance measure for a particular variable is
there are three parameters to be calibrated for model specification: the obtained as the average decrease of Gini impurity index over all trees in
total number of trees n (forest size), the number of splitting variables m, the forest. For a candidate splitting variable Xi with a possible number
and the maximum tree depth d. of categories as L1, ...,LJ , Gini impurity index for this variable is calcu-
The illustration of the RF working process is shown in Fig. 2. Dif- lated as:
ferent observations are bootstrapped for each tree. Also, the candidate
J J
splitting variables in each tree are chosen by a random selection from G (Xi ) = ∑ P (Xi = Lj)(1−P (Xi = Lj)) = 1− ∑ P (Xi = Lj)2
the full set of explanatory variables. In each single tree, splitting is j=1 j (1)
continued until the tree reaches the maximum depth. After estimating
base tree models, the majority voting strategy is employed to compute where G (Xi ) denotes Gini impurity index for the variable Xi ; P (Xi = Lj )
the result. The final predicted outcome is the one that is predicted most represents the estimated category Xi = Lj probabilities. Once Gini im-
across the ensemble. purity indices are calculated for each candidate splitting variable, the
split is conducted on the variable that with the highest value.

3.3. Relative importance of explanatory variables 4. Model specification

The random forest method utilizes a MeanDecreaseGini as a measure To optimize the model, it is critical to know the effects of different

4
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

performance while being computationally efficient at the same time.

5. Model results

5.1. Model interpretation

Using the travel diary data, we also explored the influences of ex-
planatory variables on travel mode choice. The variable’s importance is
calculated based on the calibrated model in Section 4. A higher value of
relative importance indicates stronger influences of explanatory vari-
ables in predicting travel mode choice. It should be noted that in con-
trast with the typical approach used in significance test or sensitivity
analysis-which alters the value of one explanatory variable and then
estimate the change of output at one time-the RF measures the influ-
ence of each variable simultaneously, accounting for the possible in-
teraction between variables.
Fig. 3. RF performance with the number of trees. To achieve stable estimates of variable importance, the model was
trained using the whole training sample 100 times. The variable im-
parameters on the model’s performance. Based on this information, we portance for travel mode choice is shown in Table 3. It is obvious that
can set the optimal parameters to achieve higher prediction accuracy. each explanatory variable has different impacts. Travel time plays the
This section shows how the performance varies with different choices of most important role in predicting travel mode choice. The factor of the
parameters (forest size n, the number of splitting variables m, and the built environment generally contributes more compared to household
maximum tree depth d) by using the training data. All the experiments and individual attributes. Land use (LandEntropy) exerts more sig-
were carried out on an Intel Xeon E5620 2.4 GHz processor, with a 8 GB nificant influences than transportation provisions. This indicates that
RAM memory. the land use pattern is strongly related to how we travel, and is a key
During the model specification process, it is not required to run a element shaping travel demands. As a result, changing land use patterns
cross-validation procedure to measure the RF performance. The pre- might be a significant way to shift travelers’ mode choice. Age is the
diction performance is quantified by the out-of-bag (OOB) sample. most significant variable within individual attributes, while car own-
Prediction error rate is applied to measure the model performance of ership (Car) is the most significant variable within household attributes.
different parameters. It is calculated by dividing the number of wrong The results show that behavioral heterogeneity is more apparent be-
predictions for all modes by the total quantity of data in the OOB da- tween different age groups, or between car-owner households and non-
taset, shown in Eqs. (2) and (3). For simplicity, this specification pro- car-owner households. Therefore, formulating different transportation
cess was conducted in two steps: (a) determining the optimal forest size policies targeting each distinct segment should be a priority. For
n and; (b) determining the optimal m and d values. First, we constructed household attributes, car ownership is more important than household
forests with a size ranging from 5 to 500 trees in increments of 5 trees. income on affecting mode choice. Transport policy thus might place
The comparison results are shown in Fig. 3 where we observed that priority on the benefits of non-car-ownership households, e.g., by
building trees beyond 200 did not result in considerable additional subsidizing the public transport use. Similarly, for individual attributes
performance, yet increased the runtime considerably. Thus, we settled a proposed policy might first consider the responses of different age
on a forest size of 200 as a reasonable trade-off between execution time cohorts, such as the differences of older people and young adults.
and prediction performance. Having children under 6 years old (Child) contributes the least to mode
choice behavior. It is presumably because in most Chinese households,
⎛ 1 ⎞ the retired grandparents, rather than the parents, are responsible for
ErrorOOB = ⎜1− ∑ δi⎟ × 100%
N (2) taking care of children.
⎝ i ∈ OOB ⎠

1 predict correctly 5.2. Model comparison


δi = ⎧

⎩0 otherwise (3)
In order to determine the methods that most accurately predict
where ErrorOOB is the prediction error rate using the OOB sample; δi is travel mode choice, a comparison is made between random forest
the correctness indicator variable; N is the number of observations in method, support vector machine (SVM), adaptive boosting (AdaBoost),
the OOB sample. and multinomial logit (MNL). These methods are selected for their su-
The suggested value for the number of splitting variables m is log 2 p perior predictive capabilities from existing study findings (shown in the
(Breiman, 2001), which is set to 4 (p = 20 is the full set of explanatory Appendix section). We also want to make a comparison with the
variables) in the study. In order to identify a good value for m, we commonly used MNL model to see the differences between machine
considered m = 2, 4, 6, 8. For each m value, we experimented with d learning methods and statistical regression models. The same training
values ranging from 20 to 2000. For each m and d combinations con- data and testing data are used to train these models and evaluate their
sidered, we used a forest size of 200. Experimental results are depicted performances.
in Fig. 4. It reveals that (a) 4 splitting variables exhibit the best per- The SVM model is developed using the statistical learning theory
formance and (b) error rates practically flatten out after a tree depth of and the structural risk minimization principle (Zhang and Xie, 2008).
1300 regardless of the m value. On the other hand, increasing the To introduce the classification function of the SVM model, a simple
number of splitting variables might strengthen the correlations between binary response separable classification problem is illustrated in Fig. 5.
each single tree, thus reducing the diversity of base models. Therefore, A SVM constructs a hyperplane in a high-dimensional space to achieve
in our implementation, we settled on m = 4 and d = 1300. the largest separation between different classes. The hyperplane could
Now we take a look at the default values of these parameters in our be linear or non-linear depending on whether the input data are linear
case: n′ = 500, m′ = 4, and d′ > 5000. Compared to this, the cali- or nonlinear. In the nonlinear case, SVM uses a kernel function ϕ (x i ) to
brated model with parameters of n = 200, m = 4, and map the input data into a high-dimensional feature space, whereby the
d = 1300—which is more robust—achieves better prediction input data can be linearly separated. In our study, the most commonly

5
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

Fig. 4. RF performance with the number of variables and maximum depth.

Table 3 where the margin between the linear decision boundaries (depicted as
Variable importance of the mode choice. dashed lines in Fig. 5) can be maximized.
Variable Rank Relative Importance
Adaptive boosting uses the training data to develop weak classifiers
repeatedly (Shafique and Hato, 2015). Beginning with equal weights for
Travel information TravelTime 1 10.1% all the observations, the weights for misclassified observations will be
TripPurpose 5 8.2% increased after each round. Therefore, each newly created classifier
Built environment LandEntropy 2 9.6%
RoadDens 3 9.2%
places emphasis on the observations that have been misclassified by the
MetroDist 4 8.8% previous classifier, as shown in Fig. 6. In our example, the initial
MetroNo 17 1.8% weights are wi1 = 1 for all observations and a set of 1000 classifiers was
BusDens 7 7.8% specified.
BusDist 6 8.0%
The MNL model, a statistical regression analysis, gives the choice
BusNo 9 4.8%
probabilities of each alternative as a function of the systematic portion
Household attributes Size 16 2.0% of the utility of all the alternatives. The model in this study is estimated
Income 12 3.4%
Child 20 0.8%
based on the maximum likelihood method. Model fit statistics are re-
Car 11 3.7% ported as follows: log-likelihood at convergence = −7244.2; goodness
Bicycle 18 1.5% of fitρ2 = 0.2866.
Ecycle 14 2.5% For performance comparisons, we use three indicators, accuracy,
Individual attributes Gender 13 2.9% mean absolute percentage error (MAPE) and runtime, to show the
Transitcard 19 1.2% prediction capability of each method. The accuracy and MAPE are
Age 8 7.5%
calculated based on the confusion matrices. Runtime is the execution
License 10 3.9%
Education 15 2.3% time consumed when implementing these approaches in R software
(version 3.3.2).
∑k CPk
used radial basis function (RBF) is utilized as the kernel function for Accuracy = × 100%
N (4)
travel mode choice prediction. The RBF is defined as
K (x i , x j ) = exp(−γ ∥x i−x j ∥2 ) whereγ is the kernel parameter, and x i , x j Nk−CPk
are input data. The objective of SVM is to find an optimal hyperplane MAPE = ∑ Nk
/ K × 100%
k (5)

Fig. 5. Representation of the SVM model.

6
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

Fig. 6. Representation of the adaptive boosting.

where CPk is the number of correctly predicted for the kth mode; Nk is
the number of observations for the kth mode in the testing data; N is the
total number of observations in the testing data for all modes,
N = ∑k Nk .
Table 4 shows the overall prediction accuracies while Fig. 7 gives
the comparison of MAPE and runtime. It can be observed that RF and
SVM perform best, with high accuracy and low MAPE results. The next
best performer is MNL model, followed by AdaBoost. A careful ex-
amination of the testing dataset reveals that many noisy data exist.
More specifically, there are 135 (out of total 1455) observations sharing
the same explanatory variables but showing different choice results.
AdaBoost would increase weights of noisy observations and thus yield
inaccurate results. In contrast, RF is less sensitive to noisy data as re-
sampling is not based on weighting. Furthermore, it is also computa-
tionally lighter than methods based on boosting. Compared to the MNL
model, RF is able to handle complex interactions among explanatory
Fig. 7. Comparison of MAPE of travel mode choice and model runtime.
variables and leads to superior performance. By looking at more de-
tailed predicted market shares of each category, it can also be seen that
the RF and SVM perform best in all categories, for which the accuracy confidence level. Also, land use mixture contributes a great deal. As for
of the two models is similar. But SVM takes much longer runtime (al- the least important influencing factors, “Having transit IC card” and
most triple of RF) even though it has good prediction ability. Therefore, “Having children under 6 years old” have been identified, which is re-
we conclude that RF is a promising method to improve prediction ac- vealed both by RF model and MNL model.
curacy and meanwhile be computationally efficient.
In order to shed light on the credibility of the RF model results, we
6. Conclusions
additionally reported the significance of the explanatory variables es-
timated from the MNL model. The results are shown in Table 5. It
The random forest method is more effective in addressing the
should be noted that the variable importance presents the similar pat-
variability in data and reducing prediction variance compared to single
tern from the estimation results of RF model and MNL model. Travel
decision trees through majority voting. It enhances diversity by ran-
time plays a significant role in predicting travel mode choice at 99%
domly selecting different training samples and different variables at
each splitting node for every single tree within the ensemble. To the
Table 4 best of our knowledge, little attention has been given to the RF appli-
Comparison of the prediction ability. cations for travel mode choice prediction in the transportation field.
Based on the travel data and built environment information collected in
RF AdaBoost SVM MNL
Nanjing, China, this paper proposes a robust random forest method to
Predicted accuracy Overall 85.36% 62.68% 83.44% 63.02% analyze and predict travel mode choices.
Walk / 80.29% 45.52% 78.85% 46.59% An important issue for the application of RF is related to model
Bicycle / 85.64% 48.51% 83.66% 54.95%
specification or parameter optimization. As discussed in the model
E-motorcycle / 87.19% 72.70% 84.40% 65.74%
PT / 85.71% 48.50% 84.96% 52.63% specification section, the RF performance is significantly influenced by
Automobile / 87.11% 85.10% 84.81% 85.96% its parameters, including the number of trees in the forest, the number
Market share Actual share Predicted share
of splitting variables, and the maximum tree depth. Therefore, it is
necessary to set the optimal parameters when developing the RF model.
Walk 18.9% 19.38% 11.82% 19.38% 14.85% Computational time is another essential aspect when the number of
Bicycle 14.5% 14.85% 10.03% 14.64% 12.58% trees or maximum tree depth increases. The trade-off between model
E-motorcycle 25.8% 24.33% 30.58% 24.26% 26.94%
accuracy and computational cost should be considered.
PT 17.4% 17.94% 11.27% 17.87% 14.91%
Automobile 23.4% 23.51% 36.29% 23.85% 30.72% Among the four prediction methods that have been analyzed in this
paper, it was found that RF and SVM are the best models in prediction

7
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

Table 5
Comparison of the variable importance.
Variable RF results MNL results

Rank Importance Bicycle E-motorcycle PT Automobile

TravelTime 1 10.1% −0.508 ***

LandEntropy 2 9.6% −0.103*** −0.195** −0.126*** −0.295***


RoadDens 3 9.2% −0.179** −0.068** −0.340*** 0.136**
MetroDist 4 8.8% 0.053*** 0.155** −0.105*** 0.221*
TripPurpose 5 8.2%
School 0.975* 0.631** 0.233** 0.619**
Shopping 0.348*** 0.215** 0.343* 0.557**
Recreation −0.594*** −0.566 0.126* −0.439**
BusDist 6 8.0% 0.132 0.113*** −0.242** 0.027*
BusDens 7 7.8% −0.007 −0.152* 0.218*** −0.139*
Age 8 7.5% −0.039 0.220** 0.142*** 0.146*
BusNo 9 4.8% −0.233 −0.066** 0.187** −0.010
License (No = ref.) 10 3.9% −0.272 0.158* 0.079 0.786***
Car (No = ref.) 11 3.7% −0.377 −0.373* −0.519** 1.507***
Gender (Female = ref.) 13 2.9% 0.347** −0.106 −0.193* 0.098**
Ecycle (No = ref.) 14 2.5% −0.224 1.624*** −0.265* −0.067
Education 15 2.3% 0.295* 0.491 0.084* 0.601*
Size 16 2.0% −0.244* −0.368 −0.141 0.021**
MetroNo 17 1.8% 0.195 −0.207 0.192** −0.102
Bicycle (No = ref.) 18 1.5% 1.135** −0.315 −0.390 −0.370
Transitcard (No = ref.) 19 1.2% 0.232 0.309 0.897 0.063
Child (No = ref.) 20 0.8% 0.197 0.163 0.227 0.173

a b* ** ***
In the MNL model, walking is set as the reference category. p < 0.10, p < 0.05, p < 0.01.

accuracy. However, RF is more computationally efficient than SVM However, machine learning research seems to be developing rapidly at
because it takes less execution time to train the model. In addition, the intersection of information science and statistical analysis in order
different from other machine learning methods working like a “black to discover patterns in complex datasets. Moreover, the emphasis on
box”, the relative importance of explanatory variables can be de- applications and theoretical investigations in today’s massive data-
termined by the RF method. This is critical for providing important driven world has made improvements of analytical techniques with
insights into formulating effective and appropriate transport policies. machine learning very relevant, although statistical and probability
Among the explanatory variables, the built environment generally theory, together with econometric analysis will remain the cornerstone
contributes more compared to household and individual attributes. of travel behavior analysis.
Land use mixture, car ownership and age are the most significant Further studies can model the mode choice behavior for different
variables within built environment attributes, household attributes and activities and examine the correlation between mode choice and ac-
individual attributes respectively. It indicates that land use strategy is tivity participation with the use of RF method. On the other hand, the
more related to how we travel and is key to shifting travel demands. RF concept and logit model framework can be combined to develop an
Also, when evaluating the effects of a proposed transport policy, special efficient hybrid model so that we can obtain better model results for
attention should be paid to the responses of car-owner and non-car- interpreting complex human activity-travel behavior, such as travel
owner households and different age cohorts. choice probability and demand elasticity.
Generally, the RF method has superior advantages in travel mode
choice prediction. In particular, the recent advanced technology en-
ables collecting different sets of travel data from WiFi scanners, Acknowledgments
smartphones, GPS devices and so on. As more travel information is
accessible, it is important to develop a robust model that is able to make This research is sponsored by the Research Grants Council of the
a good use of the big data so as to capture the complex relationships Hong Kong Special Administrative Region (PolyU 152095/17E), the
between different datasets from different sources. The capability of RF Research Committee of The Hong Kong Polytechnic University (Project
in handling different types of variables and modeling complex non- No. 4-ZZFY), and the National Natural Science Foundation of China
linear relationships makes it a promising method for travel behavior (71801041, 51338003, 71601052 and 71771049). The authors ap-
analysis in the era of big data. preciate the Nanjing Institute of City and Transport Planning for pro-
It should be noted that the random forest method in this study is a viding the data used in this study, and helpful suggestions from Prof.
kind of machine learning algorithm. Despite the many benefits, the William H.K. Lam and Prof. Sheng-Guo Wang on this research. We also
behavioral interpretation of the machine learning method is still diffi- would like to thank the anonymous referees whose comments on the
cult due to the complexity of-and number of-parameters in this system. earlier version led to significant improvements.

Appendix

Summary of the recent studies applying random forest method.

8
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

Category Studies Dependent Explanatory variables Methods compared Best


variables methods

Travel Elhenawy et al. Behavior on Time/speed, aggressiveness predictor RF, K-nearest neighbors (KNN), RF,
behavior (2014) yellow light adaptive boosting (AdaBoost), AdaBoost
choice indication logistic
Rasouli and Travel mode Socio-demographics /a /
Timmermans choice
(2014)
Ermagun et al. Travel mode Socioeconomics, trip information RF, Nested logit RF
(2015) choice
This paper Travel mode Socio-demographics, trip information, built RF, MNL, AdaBoost, SVM RF
choice environment
Traffic Brown (2016) Rail accidents Topics in the report narratives RF, gradient boosting (GB) RF
incident severity
prediction
Travel Rebollo and Departure delay Temporal and spatial information / /
Time or Balakrishnan
flow (2014)
prediction Zhang and Travel time Recent travel time, growth rate, temporal RF, ARIMA, GB GB
Haghani (2015) information
Semanjski Travel time Spatiotemporal data, infrastructural data, RF, KNN, SVM, boosting trees RF, KNN
(2015) meteorological data
Hou et al. Traffic flow Temporal information, number of lanes, RF, regression tree, multilayer RF
(2015) work zone characteristics, speed limit, feedforward neural network
direction
Pattern Zaklouta and Traffic sign Histogram-of-oriented-gradient descriptors, RF, K-D trees, SVM RF
recogni- Stanciulescu distance transforms of images
tion (2012)
Zhao et al. Driving posture Images of driving postures RF, linear perceptron classifier, RF
(2012) KNN, multilayer perceptron (MLP)
Zhang (2013) Vehicle type Vehicle images RF, KNN, MLP, SVM MLP
Montini et al. Trip purpose Socio-demographics, activity information, / /
(2014) location features
Shafique and Travel mode Accelerations, directions RF, SVM, AdaBoost, decision tree RF
Hato (2015) (DT)
Jahangiri and Travel mode Speed, acceleration RF, DT, KNN, SVM RF, SVM
Rakha (2015)
Yang et al. Travel mode GPS trajectories, GIS information RF, artificial neural network SVM
(2016) (ANN), SVM, Bayesian network
Kamkar and Vehicle type Vehicle images / /
Safabakhsh
(2016)
Wang et al. Drowsy behavior Steering angle, acceleration RF, ANN RF
(2016)
Gong et al. Trip purpose and Longitudinal GPS data / /
(2018) travel mode
a
“/” indicates not applicable.

References Cervero, R., 2002. Built environments and mode choice: toward a normative framework.
Transp. Res. Part D: Transp. Environ. 7 (4), 265–284.
Cheng, L., Chen, X., Yang, S., 2016. An exploration of the relationships between socio-
Atkinson, A.B., 1970. On the measurement of inequality. J. Econ. Theory 2 (3), 244–263. economics, land use and daily trip chain pattern among low-income residents.
Bhat, C., Lockwood, A., 2004. On distinguishing between physically active and physically Transp. Plann. Technol. 39 (4), 358–369.
passive episodes and between travel and activity episodes: an analysis of weekend Cheng, L., Chen, X., Yang, S., Wu, J., Yang, M., 2017. Structural equation models to
recreational participation in the San Francisco Bay area. Transp. Res. Part A: Policy analyze activity participation, trip generation, and mode choice of low-income
Pract. 38 (8), 573–592. commuters. Transp. Lett. 1–9.
Bhat, C., Srinivasan, S., 2005. A multidimensional mixed ordered-response model for De Dios Ortúzar, J., Willumsen, L.G., 1999. Modeling transport, second ed. John Wiley &
analyzing weekend activity participation. Transp. Res. Part B: Methodol. 39 (3), Sons Inc, New York.
255–278. De Vos, J., Moktharian, P.L., Schwanen, T., Van Acker, V., Witlox, F., 2016. Travel mode
Breiman, L., 2001. Random forest. Mach. Learn. 45, 5–32. choice and travel satisfaction: bridging the gap between decision utility and experi-
Brown, D.E., 2016. Text mining: the contributors to rail accidents. IEEE Trans. Intell. enced utility. Transportation 43 (5), 771–796.
Transp. Syst. 17 (2), 346–355. Ding, C., Wang, D., Liu, C., Zhang, Y., Yang, J., 2017. Exploring the influence of built
Cao, X., Handy, S.L., Mokhtarian, P.L., 2006. The influences of the built environment and environment on travel mode choice considering the mediating effects of car owner-
residential self-selection on pedestrian behavior: evidence from Austin TX. ship and travel distance. Transp. Res. Part A: Policy Pract. 100, 65–80.
Transportation 33 (1), 1–20. Ding, C., Wu, X.K., Yu, G.Z., Wang, Y.P., 2016. A gradient boosting logit model to

9
L. Cheng et al. Travel Behaviour and Society 14 (2019) 1–10

investigate driver’s stop-or-run behavior at signalized intersections using high-re- Rasouli, S., Timmermans, H.J.P., 2014. Using ensembles of decision trees to predict
solution traffic data. Transp. Res. Part C: Emerg. Technol. 72, 225–238. transport mode choice decisions: effects on predictive success and uncertainty esti-
Elhenawy, M., Rakha, H.A., El-Shawarby, I., 2014. Enhanced modeling of driver stop-or- mates. Eur. J. Transp. Infrastruct. Res. 14, 412–424.
run actions at a yellow indication: use of historical behavior and machine learning Rebollo, J.J., Balakrishnan, H., 2014. Characterization and prediction of air traffic delays.
methods. Transp. Res. Rec. 2426, 24–34. Transp. Res. Part C: Emerg. Technol. 44, 225–238.
Ermagun, A., Rashidi, T.H., Lari, Z.A., 2015. Mode choice for school trips: long-term Ryley, T., 2006. Use of non-motorised modes and life stage in Edinburgh. J. Transp.
planning and impact of modal specification on policy assessments. Transp. Res. Rec. Geogr. 14 (5), 367–375.
2513, 97–105. Scheiner, J., Holz-Rau, C., 2007. Travel mode choice: affected by objective or subjective
Golshani, N., Shabanpour, R., Mahmoudifard, S.M., Derrible, S., Mohammadian, A., 2018. determinants? Transportation 34 (4), 487–511.
Modeling travel mode and timing decisions: comparison of artificial neural networks Schwanen, T., Mokhtarian, P.L., 2005. What affects commute mode choice: neighborhood
and copula-based joint model. Travel Behav. Soc. 10, 21–32. physical structure or preferences toward neighborhoods? J. Transp. Geogr. 13 (1),
Gong, L., Kanamori, R., Yamamoto, T., 2018. Data selection in machine learning for 83–99.
identifying trip purposes and travel modes from longitudinal GPS data collection Semanjski, I., 2015. Potential of big data in forecasting travel times. Promet-Traffic
lasting for seasons. Travel Behav. Soc. 11, 131–140. Transp. 27 (6), 515–528.
Heinen, E., Maat, K., Van Wee, B., 2011. The role of attitudes toward characteristics of Semanjski, I., Gautama, S., Ahas, R., Witlox, F., 2017. Spatial context mining approach for
bicycle commuting on the choice to cycle to work over various distances. Transp. Res. transport mode recognition from mobile sensed big data. Comput. Environ. Urban
Part D: Transport Environ. 16 (2), 102–109. Syst. 66, 38–52.
Hou, Y., Edara, P., Sun, C., 2015. Traffic flow forecasting for urban work zones. IEEE Shafique, M.A., Hato, E., 2015. Use of acceleration data for transportation mode pre-
Trans. Intell. Transp. Syst. 16 (4), 1761–1770. diction. Transportation 42 (1), 163–188.
Jahangiri, A., Rakha, H.A., 2015. Applying machine learning techniques to transportation Van Acker, V., Witlox, F., 2011. Commuting trips within tours: How is commuting related
mode recognition using mobile phone sensor data. IEEE Trans. Intell. Transp. Syst. 16 to land use? Transportation 38 (3), 465–486.
(5), 2406–2417. Van den Berg, P., Arentze, T., Timmermans, H., 2011. Estimating social travel demand of
Johansson, M.V., Heldt, T., Johansson, P., 2006. The effects of attitudes and personality senior citizens in the Netherlands. J. Transp. Geogr. 19 (2), 323–331.
traits on mode choice. Transp. Res. Part A: Policy Pract. 40 (6), 507–525. Wang, M.S., Jeong, N.T., Kim, K.S., et al., 2016. Drowsy behavior detection based on
Kamkar, S., Safabakhsh, R., 2016. Vehicle detection, counting and classification in var- driving information. Int. J. Automot. Technol. 17 (1), 165–173.
ious conditions. IET Intel. Transp. Syst. 10 (6), 406–413. Yang, F., Yao, Z.X., Cheng, Y., Ran, B., Yang, D., 2016. Multimode trip information de-
Li, H., Raeside, R., Chen, T., McQuaid, R.W., 2012. Population ageing, gender and the tection using personal trajectory data. J. Intell. Transp. Syst. 20 (5), 449–460.
transportation system. Res. Transp. Econ. 34 (1), 39–47. Zaklouta, F., Stanciulescu, B., 2012. Real-time traffic-sign recognition using tree classi-
Lindner, A., Pitombo, C.S., Cunha, A.L., 2017. Estimating motorized travel mode choice fiers. IEEE Trans. Intell. Transp. Syst. 13 (4), 1507–1514.
using classifiers: an application for high-dimensional multicollinear data. Travel Zhao, C.H., Zhang, B.L., He, J., Lian, J., 2012. Recognition of driving postures by con-
Behav. Soc. 6, 100–109. tourlet transform and random forests. IET Intel. Transp. Syst. 6 (2), 161–168.
Montini, L., Rieser-Schussler, N., Horni, A., Axhausen, K.W., 2014. Trip purpose identi- Zhang, Y.L., Xie, Y.C., 2008. Travel mode choice modeling with support vector machine.
fication from GPS tracks. Transp. Res. Rec. 2405, 16–23. Transp. Res. Rec. 2076, 141–150.
Plaut, P.O., 2005. Non-motorized commuting in the US. Transp. Res. Part D: Transp. Zhang, B.L., 2013. Reliable classification of vehicle types based on cascade classifier
Environ. 10 (5), 347–356. ensembles. IEEE Trans. Intell. Transp. Syst. 14 (1), 322–332.
Rasouli, S., Timmermans, H.J.P., 2012. Uncertainty in travel demand forecasting models: Zhang, Y.R., Haghani, A., 2015. A gradient boosting method to improve travel time
literature review and research agenda. Transp. Lett. 4, 55–73. prediction. Transp. Res. Part C: Emerg. Technol. 58, 308–324.

10

Potrebbero piacerti anche