Sei sulla pagina 1di 11

1

Optimizing Decision Tree Learning Pipelines


in Distributed Environments
Nikhil Buduma (nkbuduma@mit.edu), Firas Abuzaid (fabuzaid@mit.edu)
Matei Zaharia (matei@mit.edu)

Abstract
We propose Yggdrasil and iTree, new frameworks for computing distributed decision trees for
large-scale datasets across multiple nodes. While decision trees are particularly popular and effective in
many machine learning contexts, little has been done to improve these models in distributed environments
and in the context of real world workflows. Implemented as a module in Apache Spark, Yggdrasil uses
a novel partitioning strategy, which we believe offers significant speedups over the existing approaches
in the literature. Moreover, although the state-of-the art learning algorithms are designed to function in
a one-shot fashion, model building is inherently an iterative process. To accelerate decision tree learning
pipelines, iTree is a system developed on top of the popular single node machine learning library
scikit-learn and our distributed tree learning Apache Spark module Yggdrasil that augments
the decision tree data structure to efficiently add/remove features, add new training data, and modify
regularization parameters. Taken together, these contributions have the potential to vastly accelerate
learning pipelines in distributed environments.

I. P ROBLEM S TATEMENT
Decision tree learning is one of the most practical models for inductive inference. In addition to being
highly expressive, decision trees are also capable of providing powerful insight into why the model makes
certain classifications. This combination of characteristics has made decision tree learning one of the most
precious tools in the data scientists arsenal.
However, as dataset size increases, the state-of-the-art in decision tree learning faces a number of
critical bottlenecks. Classically, decision tree algorithms have been designed to train efficiently on singlenode systems in a one-shot fashion. In other words, decision trees learn in a timely manner only when
all data lives in main memory (and perhaps to some extent, on local disk); every time a data scientist
desires to modify an existing model, she must restart from scratch. However, data in real-world systems
is now commonly spread out across several servers that may span multiple datacenters. Moreover, data

science is inherently an iterative process, where new models are crafted in response to the performance
of previous model versions. This means that the data scientist is severely handicapped in the context of
the modern data ecosystems and workflows.
While some work has been done to tackle these challenges, we believe there there is significant room for
improvement. Our goal, therefore, is two-fold. First, we are looking to improve on existing systems that
train decision trees in a distributed environment. One of the major mechanisms to improve performance
of distributed learning algorithms is to optimize the data partitioning strategy (i.e. how to split up our
dataset onto the hard drives and memory stores of separate machines). Specifically, we are exploring the
advantages of column-based vs. row-based data partitioning strategies on the iteration times for decision
tree learning. In a future section, we discuss these strategies and why we believe column-based partitioning
reduces communication overhead.
Second, we explore mechanisms to incrementalize decision tree building so that modifications to
previously trained models do not require recreating the tree from scratch. For example, in addition to
the traditional model.fit(X, Y) and model.predict(X) API methods for using decision trees,
we introduce a model.increment(add_fset, rm_fset) method which will create a new model
based on a previously trained one with features added for and removed from consideration. In a future
section, we describe our approach to augmenting the internal representation of the classical decision tree
to efficiently perform this operation. We also explore modifying regularization parameters and adding
additional training data, though we omit the details of these strategies from this proposal for the sake of
simplicity.
Overall, our goal is to understand the details and trade-offs involved in making these optimizations
and to marry them into a cohesive distributed decision tree learning system that significantly outperforms
the state-of-the-art.
II. OVERVIEW OF D ECISION T REE M ODELS
A. Notation and Setup
Let X = {X1 , X2 , ...XN } be a set of attributes with the respective domains D1 , D2 , ...DN . Let Y
represent the universe of legal labels over the domain DY . A legal dataset D is of the form {(xi , yi )|xi ,i
D1 D2 ... DN , yi DY }, where the ith data vector xi has the label yi . We can consider this dataset
as a sampling from an unknown, underlying distribution that we hope to discover. We desire to glean a
predictor F : D1 D2 ... DN DY that best approximates this underlying distribution D of D .
Such a predictor can then be used to estimate the labels for unobserved data points in D1 D2 ... DN .

To quantify how well an arbitrary predictor F performs, consider a loss function L. We desire to
P
minimize the value of (xi ,yi )Dt L(xi , yi ), drawing an arbitrarily new sampling Dt from D. We can
P
proxy this rather effectively by minimizing (xi ,yi )D L(xi , yi ) over the training data, provided we take
appropriate measures to prevent overfitting.
B. Model Structure

Fig. 1. An example decision tree. Labels on the nodes are the split criteria and the labels on the edges correspond to the sizes
of dataset on each branch.

Figure 1 shows an example decision tree. The decision tree recursively partition the D1 D2 ... DN
data space into non-overlapping regions. At each node we draw a decision boundary of the form Xi < v
where v Di (if the attribute is ordered) or of the form Xi {v1 , ...vk } where v1 , ..., vk Di (if the
attribute is unordered). Regions are defined by paths from the root node to the leaf node. Each leaf node
contains a region prediction which is usually a constant value or a simplistic function. To evaluate a

prediction for an unseen datapoint xi , we traverse the decision tree to reach the region that contains xi .
The prediction given by the leaf is utilized as the value for F (x).

C. Learning Decision Trees


Algorithm 1 SingleNodeDecisionTree
Require: Node n, Data D D
1:

(n split), DL , DR FindBestSplit(D)

2:

if StoppingCriteria(DL ) then

3:
4:
5:

(n leftPrediction) FindPrediction(DL ))

else
SingleNodeDecisionTree(n left, DL )

6:

end if

7:

if StoppingCriteria(DR ) then

8:
9:
10:
11:

(n rightPrediction) FindPrediction(DR ))

else
SingleNodeDecisionTree(n right, DR )
end if

Finding the optimal decision tree given a data set D is NP-hard [1]. Consequently, most algorithms
use a greedy top-down strategy to construct the tree. At each node we recursively attempt to find the
best split criterion (i.e. decision boundary). The algorithmic sketch for a single node is shown.
We further this description by focusing on regression trees, a special case of the decision tree when
the output label has a continuous domain [2]. This is an extremely common use case in modern machine
learning. The operations in Algorithm 1 as employed in a regression tree are implemented as follows.
FindBestSplit(D): Finding the best split criterion for a node is essentially the most important
step in the decision tree learning algorithm. The basic idea is to reduce the impurity of our data as it
splits through a node. Intuitively, the impurity is a measure in the dissimilarity in Y labels in our dataset.
This can be described by measures based on the variance or the entropy in a dataset. The general strategy
is to greedily pick the criterion that maximizes I(D) (I(DL ) + I(DR )), where DL and DR are the
datasets after partitioning D based on a nodes partitioning criterion.
For ordered domains, criteria are of the form Xi < v where v Di . To find the best split, we sort D
and consider a split point between each adjacent pair of values in the sorted list.

For unordered domains, splits are of the form Xi {v1 , ...vk } where v1 , ..., vk Di . In other words,
{v1 , ...vk } is an element of the power set of Di . We do not discuss how to efficiently evaluate the most

effective split criterion for unordered domains in detail here and instead refer the curious reader to [3].
Here, it is sufficient to know that the algorithm is based on the following elegant observation: The optimal
split criterion is a contiguous subsequence in the list of values in Xi after sorting by the average Y value.
Stopping Criteria(D): There are two major modes of regularization in decision trees, both of
which manifest themselves as stopping criteria. A node in the decision tree will not be expanded if the
number of data points in D falls below a user-provided threshold. Alternatively, one can also specify the
decision trees maximum depth on any given path from root to leaf.
FindPrediction(D): The prediction at a leaf is the average of the all the Y values of the data
points in D.
III. R ELATED W ORK
Any discussion of decision trees and random forests must start with [3], which remains the seminal
reference in this field. The effectiveness of random forest has been empirically demonstrated in multiple
studies, such as [4], [5], and [2].
However, surprisingly little work exists in the literature on the topic of distributed decision trees.
There has been work focused on parallelizing decision trees for GPUs [6] and for specialized hardware
architectures [7], which are somewhat limited in applicability.
For purely distributed systems with commodity hardware, the most relevant work can be found in the
PLANET framework developed by Google [8]. In PLANET, decision trees (and, by extension, random
forests) are computed using a MapReduce model of distributed computation. A key difference between
PLANET and our proposed solution is the partition strategy: PLANET partitions the dataset by row
rather than by column. Moreover, PLANET shuffles the entire dataset from node to node during the
computation to avoid bookkeeping, a key design decision that presents opportunities for optimization.
In contrast with PLANET, Caragea et al. [9] suggest that a vertical fragmentation strategy can provide
efficiency gains for decision trees. However, no empirical studies are done, nor are any concrete algorithms
suggested that confirm their hypotheses.
The same holds true for incrementalization: while some work has been done to incrementalize certain
linear machine learning models, such as in the COLUMBUS system developed at Stanford [10], we are
not aware of any analogous work in decision tree learning.

IV. T ECHNICAL A PPROACH


We propose two frameworks for accelerating tree learning pipelines. The first is Yggdrasil, a
distribution framework that we believe is much faster than the state-of-the-art in distributed tree learning.
The second is iTree, which to our knowledge is the only incrementalization framework for decision
tree models.
A. Yggdrasil
We propose the following framework for learning distributed tree models over large scale datasets:
Let W = {W1 , W2 , . . . , WM } be the set of workers in our distributed system, and let M be the master
node. Let P : Xi Wj be a partition function which assigns an attribute to a given. For now, we
assume that P is ideal, i.e. the set of attributes X is uniformly partitioned across the workers W . Given
the partition function P , the algorithm Yggdrasil (Algorithm 2) computes a decision tree of maximum
depth d over the workers W and the master M .
Algorithm 2 Yggdrasil
1: for i = 1, . . . , d do
for each node k in level i do

2:
3:

ComputeBestSplits(Wj , k , i) for all Wj W

4:

BroadcastSplits(Wj , M ) for all Wj W


end for

5:
6:

end for

Figure 2 demonstrates the vertical fragmentation for our algorithm and the communication pattern between the workers and master. The sub-routine ComputeBestSplits is analogous to FindBestSplit
from Section II-C. It differs, however, in the following way: the worker Wj computes the best split
for a particular node k in the level i in the tree. In Yggdrasil, nodes in the tree are represented
logically a worker may have an entire column from the dataset, and, therefore, only a subset of that
column should be used to find the best split for that particular node. Thus, ComputeBestSplits
takes in k and i as arguments so that a worker can compute the split across the correct subset of a
Xi . In BroadcastSplits, each worker Wi sends a bit vector to the master to communicate the split

information the bit vector represents the split destination for each data instance for that particular
feature.
With this framework, we can readily update the our decision with a new feature XN +1 by assigning
that feature to a new partition.

Fig. 2. In Yggdrasil, we partition the dataset by columns note that a worker may be assigned multiple partitions. After
each iteration of the algorithm, each worker sends a bit vector to the master that encodes the split information for that feature.

B. iTree
Here, we describe the single node strategy for adding features or removing features from an existing
model. We omit the dataset augmentation and regularization parameter modification algorithms for
simplicity. We expect that we will be able to extend these strategies to function on top of the distributed
framework Yggdrasil, although this is still an open question for the time being.
At a given node n, let the features we are considering for our split criterion to be X1 , ..., XN . Let
the optimal criterion splits at the node to be Pn (X1 ), ..., Pn (XN ) with quality Qn (X1 ), ..., Qn (XN )
respectively. Note that our quality function Q is based on the impurity criterion we discussed earlier.
We modify our representation for node n so that n.split is no longer a singular split criterion, but a
specialized data structure that tracks the tuples (Pn (Xi ), Qn (Xi )) for all i. This data structure is optimized
so that we can efficiently add and remove tuples and query for the tuple with the maximum quality Qn .
This is shown for a simple example in Figure 3.
Now we walk through how we update the tree given new features to add and remove in Algorithm 3.
We recursively traverse the tree, just as we did while building it. At each nodes split data structure, we
add the appropriate tuples measuring how the dataset is split by the optimal criterion on each new feature.

We also delete the tuples corresponding to the features that we wish to remove. We check whether the
optimal split has changed, and if it has, we know that we have re-evaluate the subtree starting from
that node using Algorithm 1. In the worse case, if we add or remove an extremely important feature,
we essentially have to reconstruct the tree from scratch. This, however, is a reasonable worst case if the
original model was very poorly constructed (although we are actively investigating potential optimizations
for this situation). On the other hand, if the feature modifications are less critical to correctness, updating
the decision tree is highly efficient as only part of the tree needs to be reconstructed.
Algorithm 3 DTNodeFeatureUpdate
, Feature add-set X
Require: Node n, Updated data Dnew Dnew
add , Feature remove-set Xrm
1:

oldSplit (n split).max()

2:

for each feature X in Xadd do

3:

newCriterion Criterion(X, Dnew )

4:

newTuple (newCriterion, Quality(newCriterion, Dnew ))

5:

(n split).add(newTuple)

6:

end for

7:

for each feature X in Xrm do

8:
9:

(n split).rm(X)

end for

10:

newSplit (n split).max()

11:

if newSplit 6= oldSplit then

12:

DL , DR = SplitData(Dnew , newSplit)

13:

SingleNodeDecisionTree(n left, DL )

14:

SingleNodeDecisionTree(n right, DR )

15:

end if

In addition to modifying feature sets, we are currently working to build strategies that allow us to
effectively add new training data and modify regularization parameters. We expect the latter to be
much easier, because we merely have to determine whether we need to prune or expand from each
leaf node (depending on whether the user has decided to regularize by specifying a minimum leaf size
threshold or a maximum depth threshold). On the other hand, dataset augmentation is a much more
difficult problem because adding new data could potentially modify the optimal splits for each feature at
every node. Additional data structure augmentation will be necessary to prevent large amounts of model

recomputation. We are also still in the process of fitting this strategy for a distributed environment on
top of Yggdrasil.

Fig. 3. In iTree, we augment the tree data structure so that a nodes split has a specialized data structure keeping track of
how every feature splits the data flowing through that node. Here we have an example of how the data structure functions when
we only have three features in consideration.

V. E VALUATION M ETRICS
To evaluate Yggdrasil, we unfortunately cannot directly compare it against PLANET the experiments in the PLANET paper used a proprietary dataset called AdCorpus. However, we can derive
theoretical bounds for what our performance should be, at least in terms of communication costs. We can
confirm this empirically by examining the bytes sent over the network during the shuffle stages of our
algorithm. For our experiments, we will use the libSVM datasets specifically, the YearPredictionMSD
for regression tasks, and the MNIST and MNIST8M datasets for classification tasks.
To evaluate iTree, we aim to demonstrate that after initial training, we can dramatically reduce model
recomputation time. To demonstrate this, we perform a feature shuffle to randomly select 75% of the
features to train our model. We then iteratively add features one at a time, and retrain our model using
iTree and train with the updated feature set from scratch. We hope to show comparable training time
between the two mechanisms on the first iteration. For future iterations, we hope to show that iTree
is significantly faster than training from scratch. We will also show that both methods have negligible
differences in accuracy.

10

VI. C ONCLUSION
We have proposed designing a new distributed framework for computing decision trees across multiple
nodes and in the context of modern data science pipelines. Ultimately, our goal is two-fold. First, we
aim to better understand the fundamental trade-offs that different partition strategies yield in a distributed
environment, particularly in terms of communication costs. Second, we hope to incrementalize the decision
tree learning algorithm to accelerate the process of modifying existing models. We hope that we can
contribute this back to the academic and open-source communities as part of the scikit-learn and
Apache Spark projects. We also hope to submit this work for publication in the International Conference
on Machine Learning.
VII. T IMELINE
Table I shows the timeline for the project. We expect that majority of the work to be complete before
the end of the semester, except for potentially marrying the distribution and incrementalization strategies,
which may continue into IAP and the Spring term.
TABLE I
P ROJECT T IMELINE
October

Complete decision tree distribution implementation

November

Complete single node incrementalization and evaluate distribution strategy

December

Evaluate single node incrementalization and complete distributed incrementalization

January

Evaluate distributed incrementalization strategy

R EFERENCES
[1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification.

John Wiley & Sons, 2012.

[2] L. Breiman, Random forests, Machine learning, vol. 45, no. 1, pp. 532, 2001.
[3] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees.

CRC press, 1984.

[4] R. Caruana and A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in Proceedings of the
23rd international conference on Machine learning.

ACM, 2006, pp. 161168.

[5] L. Breiman, Bagging predictors, Machine learning, vol. 24, no. 2, pp. 123140, 1996.
[6] Y. Liao, A. Rubinsteyn, R. Power, and J. Li, Learning random forests on the gpu, Department of Computer Science,
New York University, 2013.
[7] J. P. Bradford and J. A. Fortes, Characterization and parallelization of decision-tree induction, Journal of Parallel and
Distributed Computing, vol. 61, no. 3, pp. 322349, 2001.
[8] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo, Planet: massively parallel learning of tree ensembles with mapreduce,
Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 14261437, 2009.

11

[9] D. Caragea, A. Silvescu, and V. Honavar, A framework for learning from distributed data using sufficient statistics and
its application to learning decision trees, International Journal of Hybrid Intelligent Systems, vol. 1, no. 1-2, p. 80, 2004.
[10] C. Zhang, A. Kumar, and C. Re, Materialization optimizations for feature selection workloads, in Proceedings of the
2014 ACM SIGMOD international conference on Management of data.

ACM, 2014, pp. 265276.

Potrebbero piacerti anche