Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1 / 49
Introduction
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
References
2 / 49
Introduction
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
References
3 / 49
4 / 49
5 / 49
5 / 49
Introduction
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
References
6 / 49
7 / 49
8 / 49
Text Acquisition
Index Creation
Index
Web pages
Text Transformation
9 / 49
User interaction
Log data
Ranking
Index
Evaluation
10 / 49
11 / 49
see http://kontext.fraunhofer.de/haenelt/kurs/InfoRet/
12 / 49
Learning to Rank
13 / 49
14 / 49
15 / 49
Neural Nets
RankNet (Burges et al., 2006)
Tree Ensembles
Random Forests (Breiman and Schapire, 2001)
Boosted Decision Trees
Multiple Additive Regression Trees (Friedman, 1999)
LambdaMART (Burges, 2010)
Used by AltaVista, Yahoo!, Bing, Yandex, ...
All top teams of the Yahoo! Learning to Rank Challenge (2010) used
combinations of Tree Ensembles!
16 / 49
set-2:
34,815 feature vectors
596 features
1,266 queries
17 / 49
Introduction
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
References
18 / 49
Decision Trees
Characteristics of a tree:
Graph based model
Consists of a root, nodes, and leaves
Advantages:
Simple to understand and interpret
White box model
Can be combined with other techniques
Decision trees are basic learners for machine learning, e.g. classification or
regression trees.
19 / 49
20 / 49
21 / 49
Here we examine the right side first: find a split variable and a split value
that minimize the predicted error, i.e. x2 and v2 .
Hiko Schamoni (Universit
at Heidelberg)
22 / 49
Now to the left side: Again, find a split variable and a split value that
minimize the predicted error, i.e. x1 and v3 .
Hiko Schamoni (Universit
at Heidelberg)
23 / 49
Once again, find a split variable and a split value that minimize the
predicted error, here x2 and v4 .
Hiko Schamoni (Universit
at Heidelberg)
24 / 49
Once again, find a split variable and a split value that minimize the
predicted error, here x2 and v4 . The tree perfectly fits the data! Problem?
Hiko Schamoni (Universit
at Heidelberg)
24 / 49
K
X
k 1(x Rk )
(1)
k=1
{Rk , k }K
1
=
describes the model parameters, 1() is the characteristic
function (1 if argument is true, 0 otherwise), and k = mean(yi |xi Rk ).
are found minimizing the empirical risk:
Optimal parameters
= arg min
K X
X
L(yi , k )
(2)
k=1 xi Rk
The combinatorial optimization problem (2) is usually split into two parts:
(i) finding Rk and (ii) finding k given Rk .
Hiko Schamoni (Universit
at Heidelberg)
25 / 49
Boosting
Idea
Combine multiple weak learners to build a strong learner.
A weak learner is a learner with an error rate slightly better than random
guessing. A strong learner is a learner with high accuracy.
Approach:
Apply a weak learner to iteratively modified data
Generate a sequence of learners
For classification tasks: use majority vote
For regression tasks: build weighted values
26 / 49
Function Estimation
Find a function F (x) that maps x to y , s.t. the expected value of some
loss function L(y , F (x)) is minimized:
F (x) = arg min Ey ,x [L(y , F (x))]
F (x)
Boosting approximates
F (x)
by an additive expansion
F (x) =
M
X
m h(x; am )
m=1
27 / 49
Finding Parameters
M
Expansion coefficients {m }M
0 and the function parameters {am }0 are
iteratively fit to the training data:
1
N
X
(3)
i =1
and
Fm (x) = Fm1 (x) + m h(x; am )
(4)
28 / 49
Gradient Boosting
Gradient boosting approximately solves (3) for differentiable loss functions:
1
N
X
[
yim h(xi , a)]2
(5)
i =1
to the pseudo-residuals
yim
2
L(yi , F (xi ))
=
F (xi )
(6)
F (x)=Fm1 (x)
N
X
(7)
i =1
29 / 49
K
X
(8)
k=1
(9)
xi Rkm
(10)
30 / 49
First, learn the most simple predictor that predicts a constant value
minimizing the error for all training data.
Hiko Schamoni (Universit
at Heidelberg)
31 / 49
xi Rkm
250
200
150
+ 3 (3 x)2 + 5 (4 x)2
100
50
f (x)
=5 (2 + 2x) + 4 (4 + 2x)2
x
+ 3 (6 + 2x)2 + 5 (8 + 2x)2
= 84 + 34x = 32(x 2.471)
-50
-100
f(x)
f_(x)
-150
-1
32 / 49
Split root node based on least squares criterion to build a tree predicting
the pseudo-residuals.
Hiko Schamoni (Universit
at Heidelberg)
33 / 49
34 / 49
This is iteratively continued: in each stage, the algorithm builds a new tree
based on the pseudo-residuals predicted by the previous tree ensemble.
Hiko Schamoni (Universit
at Heidelberg)
35 / 49
5:
6:
7:
8:
9:
10:
11:
12:
end for
{Rkm }K
im
k=1 // Fit a regression tree to targets y
for k = 1, ..., Km doP
km = arg min xi Rjm L(yi , Fm1 (xi ) + )
end for
P m
Fm (x) = Fm1 (x) + K
k=1 km 1(xi Rkm )
end for
Return FM (x)
36 / 49
Introduction
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
References
37 / 49
RankNet Model
Differentiable function of the model parameters, typically neural nets
RankNet maps a feature vector x to a value f (x; w)
Learned probabilities URL Ui Uj modelled via a sigmoid function
Pij P(Ui Uj )
1
1+
e (si sj )
38 / 49
RankNet Algorithm
39 / 49
RankNet s
The crucial part is the update:
C
C si
C sj
=
+
= ij
wk
si wk
sj wk
sj
si
wk
wk
k:{k,i }I
40 / 49
RankNet Example
(a) is the perfect ranking, (b) is a ranking with 10 pairwise errors, (c) is a ranking with
8 pairwise errors. Each blue arrow represents the i for each query-document vector xi .
From: Burges (2010), From RankNet to LambdaRank to LambdaMART: An Overview.
Hiko Schamoni (Universit
at Heidelberg)
41 / 49
LambdaRank Example
42 / 49
C (si sj )
| NDCG |
=
si
1 + e (si sj )
43 / 49
LambdaMART Algorithm
Algorithm 3 LambdaMART.
1:
2:
3:
4:
5:
6:
7:
8:
9:
for i = 0, ..., N do
F0 (xi ) = BaseModel(xi )
// Set to 0 for empty BaseModel
end for
for m = 1, ..., M do
for i = 0, ..., N do
yi = i
// Calculate -gradient
yi
wi = Fk1 (xi ) // Calculate derivative of gradient for xi
end for
// Create K -leaf tree on {xi , yi }
{Rkm }K
k=1
P
yi
w
m i
P xi Rk m
km =
xi Rk
44 / 49
Introduction
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
References
45 / 49
46 / 49
Introduction
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
References
47 / 49
References I
Breiman, Leo and E. Schapire (2001). Random forests. In Machine
Learning , pp. 532.
Burges, Christopher J. C. (2010). From RankNet to LambdaRank
to LambdaMART: An Overview .
Burges, Christopher J. C., R. Ragno and Q. V. Le (2006).
lkopf,
Learning to Rank with Nonsmooth Cost Functions.. In Scho
Bernhard, J. Platt and T. Hoffman, eds.: NIPS, pp. 193200.
MIT Press.
Chapelle, Olivier and Y. Chang (2011). Yahoo! Learning to Rank
Challenge Overview.. Journal of Machine Learning Research Proceedings Track, 14:124.
Croft, W.B., D. Metzler and T. Strohmann (2010). Search
Engines: Information Retrieval in Practice. Pearson, London, England.
Hiko Schamoni (Universit
at Heidelberg)
48 / 49
References II
Friedman, Jerome H. (1999). Greedy Function Approximation: A
Gradient Boosting Machine. Annals of Statistics, 29(5):11891232.
Ganjisaffar, Yasser (2011). Tree Ensembles for Learning to Rank.
PhD thesis, University of California, Irvine.
Hastie, Trevor, R. Tibshirani and J. Friedman (2002). The
Elements of Statistical Learning . Springer, New York.
Liu, Tie-Yan (2010). Learning to Rank for Information Retrieval..
Springer-Verlag New York Inc.
tze (2008).
Manning, Christopher D., P. Raghavan and H. Schu
Introduction to Information Retrieval . Cambridge University Press.
Vapnik, Vladimir N. (1995). The Nature of Statistical Learning
Theory . Springer New York Inc., New York, NY, USA.
Wu, Qiang, C. J. C. Burges, K. M. Svore and J. Gao (2008).
Ranking, Boosting, and Model Adaptation.
Hiko Schamoni (Universit
at Heidelberg)
49 / 49