Schamoni Boosteddecisiontrees

Ranking with Boosted Decision Trees
Seminar Information Retrieval

Dozentin: Dr. Karin Haenelt
Schigehiko Schamoni
Ruprecht-Karls-Universit
at Heidelberg
Institut f
ur Computerlinguistik
January 16, 2012
Hiko Schamoni (Universit

at Heidelberg)
January 16, 2012
1 / 49
Introduction
Web Scale Information Retrieval

Ranking in IR
Algorithms for Ranking
MART
Decision Trees
Boosting
Multiple Additive Regression Trees
LambdaMART
RankNet
LambdaRank
LambdaMART Algorithm
Using Multiple Rankers
References

at Heidelberg)
January 16, 2012
2 / 49
Introduction

Ranking in IR
MART
Decision Trees
Boosting
LambdaMART
RankNet
LambdaRank
References

at Heidelberg)
January 16, 2012
3 / 49
Example: Web Search

at Heidelberg)
January 16, 2012
4 / 49
Web Search Features
Technology features of modern web search engines:

Estimation of hit counts
Can index many pagess
Very fast
Automatic spelling correction
Preview of data
Sophisticated ranking of results
...

at Heidelberg)
January 16, 2012
5 / 49
Web Search Features
Technology features of modern web search engines:

Estimation of hit counts
Can index many pagess
Very fast
Automatic spelling correction
Preview of data
Sophisticated ranking of results Topic of this talk!
...

at Heidelberg)
January 16, 2012
5 / 49
Introduction

Ranking in IR
MART
Decision Trees
Boosting
LambdaMART
RankNet
LambdaRank
References

at Heidelberg)
January 16, 2012
6 / 49
What is the Size of the Web?
From http://www.worldwidewebsize.com/, accessed 08.1.2012
Special algorithms are needed to handle this amount of information.

at Heidelberg)
January 16, 2012
7 / 49
The retrieval pipeline must reduce the number of pages significantly!

at Heidelberg)
January 16, 2012
8 / 49
Details of a Web Search Engine: Indexing
Document data store
Text Acquisition
Index Creation
Index
Web pages
Text Transformation
Components of the Indexing part of a search engine (Croft et al., 2010).

at Heidelberg)
January 16, 2012
9 / 49
Details of a Web Search Engine: Querying
Document data store
User interaction
Log data
Ranking
Index
Evaluation
Components of the Querying part of a search engine (Croft et al., 2010).
The most important element in the whole querying-process is ranking.

at Heidelberg)
January 16, 2012
10 / 49
Model Types for Information Retrieval
Classification of model types for Information Retrieval:

1
Set-theoretic models, e.g.

boolean models
extended boolean models
Algebraic models, e.g.

vector space model
latent semantic indexing
Probabilistic models, e.g.

probabilistic relevance (BM25)
language models

at Heidelberg)
January 16, 2012
11 / 49
Relevance and Ranking

IR Models generate different values describing the relationship between a
search query and the target document, e.g. similarity.
This value expresses the relevance of a document w.r.t. to the query and
induces a ranking of retrieval results.
Some important measures we heard of in this seminar1 :
(Normalized) term-frequency
(Normalized) term-weight
Inverse document frequency
Cosine similarity (vector model)
Retrieval status value (probabilistic model)
see http://kontext.fraunhofer.de/haenelt/kurs/InfoRet/

at Heidelberg)
January 16, 2012
12 / 49
Learning to Rank
From: Liu (2010), Learning to Rank for Information Retrieval.
Basic Idea of Machine Learning:

Hypothesis F transforms input object x to output object y = F (x).
L(y , y ) is the loss, i.e. the difference between the predicted y and
the target y .
Learning process: find the hypothesis minimizing L by tuning F .
Learning a ranking function with machine learning techniques:
Learning to Rank (LTR)
at Heidelberg)
January 16, 2012
13 / 49
Features for Learning
To learn a ranking function, each query-document pair is represented by a

vector of features of three categories:
1
Features modelling web document, d (static features):

inbound links, PAGE rank, document length, etc.
Features modelling query-document relationship (dynamic features):

frequency of search terms in document, cosine similarity, etc.
Features modelling user query, q:

number of words in query, query classification, etc.
In supervised training, the ranking function is learned using vectors of

known ranking levels.

at Heidelberg)
January 16, 2012
14 / 49
Example: Features for AltaVista (2002)

A0 - A4
W0 - W4
L0 - L4
SP
F0 - F4
DCLN
ER
HB
ERHB
A0W0 etc.
QA
SPN
FF
UD
anchor text score per term

term weights
first occurrence location
(encodes hostname and title match)
spam index: logistic regression of 85 spam filter variables
(against relevance scores)
term occurrence frequency within document
document length (tokens)
Eigenrank
Extra-host unique inlink count
ER*HB
A0*W0
Site factor - logistic regression of 5 site link and url count ratios
Proximity
family friendly rating
url depth
From: J. Pedersen (2008), The Machine Learned Ranking Story

at Heidelberg)
January 16, 2012
15 / 49

Support Vector Machines (Vapnik, 1995)
Very good classifier
Can be adapted to ranking and multiclass problems
Neural Nets
RankNet (Burges et al., 2006)
Tree Ensembles
Random Forests (Breiman and Schapire, 2001)
Boosted Decision Trees
Multiple Additive Regression Trees (Friedman, 1999)
LambdaMART (Burges, 2010)
Used by AltaVista, Yahoo!, Bing, Yandex, ...
All top teams of the Yahoo! Learning to Rank Challenge (2010) used
combinations of Tree Ensembles!

at Heidelberg)
January 16, 2012
16 / 49
Yahoo! Learning to Rank Challenge

Yahoo! Webscope dataset (Chapelle and Chang, 2011):
36,251 queries, 883 k documents, 700 features, 5 ranking levels
set-1:
473,134 feature vectors
519 features
19,944 queries
set-2:
34,815 feature vectors
596 features
1,266 queries
Winner used a combination of 12 models:

8 Tree Ensembles (LambdaMART)
2 Tree Ensembles (Additive Regression Trees)
2 Neural Nets

at Heidelberg)
January 16, 2012
17 / 49
Introduction

Ranking in IR
MART
Decision Trees
Boosting
LambdaMART
RankNet
LambdaRank
References

at Heidelberg)
January 16, 2012
18 / 49
Decision Trees
Characteristics of a tree:
Graph based model
Consists of a root, nodes, and leaves
Advantages:
Simple to understand and interpret
White box model
Can be combined with other techniques
Decision trees are basic learners for machine learning, e.g. classification or
regression trees.

at Heidelberg)
January 16, 2012
19 / 49
Learning a Regression Tree (I)
Consider a 2-dimensional space consisting of data points of the indicated

values. We start with an empty root node (blue).
at Heidelberg)
January 16, 2012
20 / 49
Learning a Regression Tree (II)
The algorithm searches for split variables and split P

points, x1 and v1 , that
predict values minimizing the predicted error, e.g.
(yi f (xi ))2 .
at Heidelberg)
January 16, 2012
21 / 49
Learning a Regression Tree (III)
Here we examine the right side first: find a split variable and a split value
that minimize the predicted error, i.e. x2 and v2 .
at Heidelberg)
January 16, 2012
22 / 49
Learning a Regression Tree (IV)
Now to the left side: Again, find a split variable and a split value that
minimize the predicted error, i.e. x1 and v3 .
at Heidelberg)
January 16, 2012
23 / 49
Learning a Regression Tree (V)
Once again, find a split variable and a split value that minimize the
predicted error, here x2 and v4 .
at Heidelberg)
January 16, 2012
24 / 49
Learning a Regression Tree (V)
Once again, find a split variable and a split value that minimize the
predicted error, here x2 and v4 . The tree perfectly fits the data! Problem?
at Heidelberg)
January 16, 2012
24 / 49
Formal Definition of a Decision Tree

A decision tree partitions the parameter space into disjoint regions Rk ,
k {1, ..., K }, K = number of leaves. Formally, the regression model (1)
predicts a value using a constant k for each region Rk :
T (x; ) =
K
X
k 1(x Rk )
(1)
k=1
{Rk , k }K
1
=
describes the model parameters, 1() is the characteristic
function (1 if argument is true, 0 otherwise), and k = mean(yi |xi Rk ).
are found minimizing the empirical risk:
Optimal parameters
= arg min
K X
X
L(yi , k )
(2)
k=1 xi Rk
The combinatorial optimization problem (2) is usually split into two parts:
(i) finding Rk and (ii) finding k given Rk .
at Heidelberg)
January 16, 2012
25 / 49
Boosting
Idea
Combine multiple weak learners to build a strong learner.
A weak learner is a learner with an error rate slightly better than random
guessing. A strong learner is a learner with high accuracy.
Approach:
Apply a weak learner to iteratively modified data
Generate a sequence of learners
For classification tasks: use majority vote
For regression tasks: build weighted values

at Heidelberg)
January 16, 2012
26 / 49
Function Estimation
Find a function F (x) that maps x to y , s.t. the expected value of some
loss function L(y , F (x)) is minimized:
F (x) = arg min Ey ,x [L(y , F (x))]
F (x)
Boosting approximates
F (x)
by an additive expansion
F (x) =
M
X
m h(x; am )
m=1
where h(x; a) are simple functions of x with parameters a = {a1 , a2 , ..., an }

defining the function h, and are expansion coefficients.

at Heidelberg)
January 16, 2012
27 / 49
Finding Parameters
M
Expansion coefficients {m }M
0 and the function parameters {am }0 are
iteratively fit to the training data:
1
Set F0 (x) to initial guess
For each m = 1, 2..., M

(m , am ) = arg min
,a
N
X
L(yi , Fm1 (xi ) + h(xi , a))
(3)
i =1
and
Fm (x) = Fm1 (x) + m h(x; am )

at Heidelberg)
(4)
January 16, 2012
28 / 49
Gradient Boosting
Gradient boosting approximately solves (3) for differentiable loss functions:
1
Fit the function h(x; a) by least squares

am = arg min
a
N
X
[
yim h(xi , a)]2
(5)
i =1
to the pseudo-residuals
yim
2
L(yi , F (xi ))
=
F (xi )
(6)
F (x)=Fm1 (x)
Given h(x; am ), the m are

m = arg min
N
X
L(yi , Fm1 (xi ) + h(xi ; am ))
(7)
i =1
Gradient boosting simplifies the problem to least squares (5).

at Heidelberg)
January 16, 2012
29 / 49
Gradient Tree Boosting

Gradient tree boosting applies this approach on functions h(x; a)
representing K -terminal node regression trees.
h(x; {Rkm }K
1 )=
K
X
ykm 1(x Rkm )
(8)
k=1
With ykm = meanxi Rkm (

yim ) the tree (8) predicts a constant value ykm in
region Rkm . Equation (7) becomes a prediction of a km for each Rkm :
km = arg min
L(yi , Fm1 (xi ) + )
(9)
xi Rkm
The approximation for F in stage m is then:

Fm (x) = Fm1 (x) + km 1(xi Rkm )
(10)
The parameter controls the learning rate of the procedure.

at Heidelberg)
January 16, 2012
30 / 49
Learning Boosted Regression Trees (I)
First, learn the most simple predictor that predicts a constant value
minimizing the error for all training data.
at Heidelberg)
January 16, 2012
31 / 49
Calculating Optimal Leaf Value for F0

Recall the exp. coefficient: km = arg min
xi Rkm
L(yi , Fm1 (xi ) + )
250
Quadratic loss for the leaf (red):
200
f (x) =5 (1 x)2 + 4 (2 x)2
150
+ 3 (3 x)2 + 5 (4 x)2
100
f (x) is quadratic, convex

Optimum at f (x) = 0 (green)
50
f (x)
=5 (2 + 2x) + 4 (4 + 2x)2
x
+ 3 (6 + 2x)2 + 5 (8 + 2x)2
= 84 + 34x = 32(x 2.471)
-50
-100
f(x)
f_(x)
-150
-1

at Heidelberg)
January 16, 2012
32 / 49
Learning Boosted Regression Trees (II)
Split root node based on least squares criterion to build a tree predicting
the pseudo-residuals.
at Heidelberg)
January 16, 2012
33 / 49
Learning Boosted Regression Trees (III)
In the next stage, another tree is created to fit the actual

pseudo-residuals predicted by the first tree.
at Heidelberg)
January 16, 2012
34 / 49
Learning Boosted Regression Trees (IV)
This is iteratively continued: in each stage, the algorithm builds a new tree
based on the pseudo-residuals predicted by the previous tree ensemble.
at Heidelberg)
January 16, 2012
35 / 49
Multiple Additive Regression Trees (MART)

Algorithm 1 Multiple Additive Regression Trees.
PN
1: Initialize F0 (x) = arg min
i =1 L(yi , )
2: for m = 1, ..., M do
3:
for i = 1, ...,hN do
i
i ,F (xi ))
4:
yim = L(y
F (xi )
F (x)=Fm1 (x)
5:
6:
7:
8:
9:
10:
11:
12:
end for
{Rkm }K
im
k=1 // Fit a regression tree to targets y
for k = 1, ..., Km doP
km = arg min xi Rjm L(yi , Fm1 (xi ) + )
end for
P m
Fm (x) = Fm1 (x) + K
k=1 km 1(xi Rkm )
end for
Return FM (x)

at Heidelberg)
January 16, 2012
36 / 49
Introduction

Ranking in IR
MART
Decision Trees
Boosting
LambdaMART
RankNet
LambdaRank
References

at Heidelberg)
January 16, 2012
37 / 49
RankNet Model
Differentiable function of the model parameters, typically neural nets
RankNet maps a feature vector x to a value f (x; w)
Learned probabilities URL Ui Uj modelled via a sigmoid function
Pij P(Ui Uj )
1
1+
e (si sj )
with si = f (xi ), sj = f (xj )

Cost function calculates cross entropy:
ij log Pij (1 P
ij ) log(1 Pij )
C = P
ij is the known probability from training.
Pij is the model probability, P

at Heidelberg)
January 16, 2012
38 / 49
RankNet Algorithm
Algorithm 2 RankNet Training.

PN
1: Initialize F0 (x) = arg min
i =1 L(yi , )
2: for each query q Q do
3:
for each pair of URLs Ui , Uj with different label do
4:
si = f (xi ), sj = f (xj )
5:
Estimate cost C
C
6:
Update model scores wk wk w
k
7:
end for
8: end for
9: Return w

at Heidelberg)
January 16, 2012
39 / 49
RankNet s
The crucial part is the update:
C
C si
C sj
=
+
= ij
wk
si wk
sj wk
sj
si
wk
wk
ij describes the desired change of scores for the pair Ui and Uj

The sum over all ij s and ji s of a given query-document vector xi
w.r.t. all other differently labelled documents is
X
X
ij
ki
i =
j:{i ,j}I
k:{k,i }I
i is (kind of) a gradient of the pairwise loss of vector xi .

at Heidelberg)
January 16, 2012
40 / 49
RankNet Example
(a) is the perfect ranking, (b) is a ranking with 10 pairwise errors, (c) is a ranking with
8 pairwise errors. Each blue arrow represents the i for each query-document vector xi .
From: Burges (2010), From RankNet to LambdaRank to LambdaMART: An Overview.
at Heidelberg)
January 16, 2012
41 / 49
LambdaRank Example
Problem: RankNet is based on pairwise error, while modern IR measures emphasize

higher ranking positions. Red arrows show better s for modern IR measures.
From: Burges (2010), From RankNet to LambdaRank to LambdaMART: An Overview.
at Heidelberg)
January 16, 2012
42 / 49
From RankNet to LambdaRank to LambdaMART

From RankNet to LambdaRank:
Multiply s with |Z |, i.e. the difference of an IR measure when Ui
and Uj are swapped
E.g. | NDCG | is the change in NDCG when swapping Ui and Uj :
ij =
C (si sj )
| NDCG |
=
si
1 + e (si sj )
From LambdaRank to LambdaMART:

LambdaRank models gradients
MART works on gradients
Combine both to get LambdaMART :
MART with specified gradients and Newton step

at Heidelberg)
January 16, 2012
43 / 49
Algorithm 3 LambdaMART.
1:
2:
3:
4:
5:
6:
7:
8:
9:
for i = 0, ..., N do
F0 (xi ) = BaseModel(xi )
// Set to 0 for empty BaseModel
end for
for m = 1, ..., M do
for i = 0, ..., N do
yi = i
// Calculate -gradient
yi
wi = Fk1 (xi ) // Calculate derivative of gradient for xi
end for
// Create K -leaf tree on {xi , yi }
{Rkm }K
k=1
P
yi
w
m i
P xi Rk m
// Assign leaf values

P
11:
Fm (xi ) = Fm1 (xi ) + k km 1(xi inRkm )
12: end for
10:
km =
xi Rk

at Heidelberg)
January 16, 2012
44 / 49
Introduction

Ranking in IR
MART
Decision Trees
Boosting
LambdaMART
RankNet
LambdaRank
References

at Heidelberg)
January 16, 2012
45 / 49
Optimally combine Rankers

Linearly combine rankers:
(1 )R(xi ) + R (xi )
Let go from 0 to 1:
Score changes only at the
intersections
Enumerate all for which
pairs swap position
Calculate desired IR measure
(e.g. NDCG)
Select the giving best scores

Solution can be found analytically,
From: Wu et al. (2008),Ranking, Boosting, or approximated by Boosting or a
and Model Adaptation.
LambdaRank approach.

at Heidelberg)
January 16, 2012
46 / 49
Introduction

Ranking in IR
MART
Decision Trees
Boosting
LambdaMART
RankNet
LambdaRank
References

at Heidelberg)
January 16, 2012
47 / 49
References I
Breiman, Leo and E. Schapire (2001). Random forests. In Machine
Learning , pp. 532.
Burges, Christopher J. C. (2010). From RankNet to LambdaRank
to LambdaMART: An Overview .
Burges, Christopher J. C., R. Ragno and Q. V. Le (2006).
lkopf,
Learning to Rank with Nonsmooth Cost Functions.. In Scho
Bernhard, J. Platt and T. Hoffman, eds.: NIPS, pp. 193200.
MIT Press.
Chapelle, Olivier and Y. Chang (2011). Yahoo! Learning to Rank
Challenge Overview.. Journal of Machine Learning Research Proceedings Track, 14:124.
Croft, W.B., D. Metzler and T. Strohmann (2010). Search
Engines: Information Retrieval in Practice. Pearson, London, England.
at Heidelberg)
January 16, 2012
48 / 49
References II
Friedman, Jerome H. (1999). Greedy Function Approximation: A
Gradient Boosting Machine. Annals of Statistics, 29(5):11891232.
Ganjisaffar, Yasser (2011). Tree Ensembles for Learning to Rank.
PhD thesis, University of California, Irvine.
Hastie, Trevor, R. Tibshirani and J. Friedman (2002). The
Elements of Statistical Learning . Springer, New York.
Liu, Tie-Yan (2010). Learning to Rank for Information Retrieval..
Springer-Verlag New York Inc.
tze (2008).
Manning, Christopher D., P. Raghavan and H. Schu
Introduction to Information Retrieval . Cambridge University Press.
Vapnik, Vladimir N. (1995). The Nature of Statistical Learning
Theory . Springer New York Inc., New York, NY, USA.
Wu, Qiang, C. J. C. Burges, K. M. Svore and J. Gao (2008).
Ranking, Boosting, and Model Adaptation.
at Heidelberg)
January 16, 2012
49 / 49

Schamoni Boosteddecisiontrees

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Schamoni Boosteddecisiontrees

Caricato da

Copyright:

Formati disponibili

Ranking with Boosted Decision Trees

Seminar Information Retrieval

January 16, 2012

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Web Scale Information Retrieval

Using Multiple Rankers

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Web Scale Information Retrieval

Using Multiple Rankers

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Example: Web Search

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Web Search Features

Technology features of modern web search engines:

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Web Search Features

Technology features of modern web search engines:

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Web Scale Information Retrieval

Using Multiple Rankers

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

What is the Size of the Web?

From http://www.worldwidewebsize.com/, accessed 08.1.2012

Special algorithms are needed to handle this amount of information.

Ranking with Boosted Decision Trees

January 16, 2012

Web Scale Information Retrieval

The retrieval pipeline must reduce the number of pages significantly!

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Details of a Web Search Engine: Indexing

Document data store

Components of the Indexing part of a search engine (Croft et al., 2010).

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Details of a Web Search Engine: Querying

Document data store

Components of the Querying part of a search engine (Croft et al., 2010).

The most important element in the whole querying-process is ranking.

Hiko Schamoni (Universit

Ranking with Boosted Decision Trees

January 16, 2012

Model Types for Information Retrieval

Classification of model types for Information Retrieval:

Set-theoretic models, e.g.

Algebraic models, e.g.

Probabilistic models, e.g.