Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Sanjoy Dasgupta1
John Langford2
UC San Diego1
Yahoo Labs2
Unlabeled points
Unlabeled points
Supervised learning
Unlabeled points
Supervised learning
Semisupervised and
active learning
[Warmuth et al 03]
vendor catalogs
corporate collections
combinatorial chemistry
[Freund et al 03]
Sampling bias
Start with a pool of unlabeled data
Pick a few points at random and get their labels
Repeat
Fit a classifier to the labels seen so far
Query the unlabeled point that is closest to the boundary
(or most uncertain, or most likely to decrease overall
uncertainty,...)
Example:
45%
5%
5%
45%
Sampling bias
Start with a pool of unlabeled data
Pick a few points at random and get their labels
Repeat
Fit a classifier to the labels seen so far
Query the unlabeled point that is closest to the boundary
(or most uncertain, or most likely to decrease overall
uncertainty,...)
Example:
45%
5%
5%
45%
Sampling bias
Start with a pool of unlabeled data
Pick a few points at random and get their labels
Repeat
Fit a classifier to the labels seen so far
Query the unlabeled point that is closest to the boundary
(or most uncertain, or most likely to decrease overall
uncertainty,...)
Example:
45%
5%
5%
45%
Outline of tutorial
[ZGL 03]
[ZGL 03]
[ZGL 03]
.2
.5
.6
0
.8
.7
.8
[ZGL 03]
.4
.2
.4
.2
.5
.5
.6
.6
0
.8
.7
.8
.8
.7
.8
[ZGL 03]
.4
.2
.4
.2
.5
.5
.6
.6
0
.8
.7
.8
.8
.7
.8
[DH 08]
Basic primitive:
[DH 08]
Basic primitive:
Unlabeled data
Unlabeled data
Find a clustering
Unlabeled data
Find a clustering
Unlabeled data
Now what?
Find a clustering
Unlabeled data
Find a clustering
Now what?
Rules:
For each tree node (i.e. cluster) v maintain: (i) majority label L(v );
pv,l ; and (iii) confidence interval
(ii) empirical label frequencies b
lb
ub
[pv,l
, pv,l
]
for t = 1, 2, 3, . . . :
v = select-node(P)
pick a random point z in subtree Tv and query its label
update empirical counts for all nodes along path from z to v
choose best pruning and labeling (P , L ) of Tv
P = (P \ {v }) P and L(u) = L (u) for all u in P
v = select-node(P)
Prob[v ] |Tv |
lb
Prob[v ] |Tv |(1 pv,l
)
random sampling
active sampling
Outline of tutorial
hw (x) = 1(x w )
+
w
hw (x) = 1(x w )
+
w
hw (x) = 1(x w )
+
w
Binary search: need just log 1/ labels, from which the rest can be
inferred. Exponential improvement in label complexity!
hw (x) = 1(x w )
+
w
Binary search: need just log 1/ labels, from which the rest can be
inferred. Exponential improvement in label complexity!
Challenges: Nonseparable data? Other hypothesis classes?
Aggressive
Mellow
Separable data
Query by committee
(Freund, Seung, Shamir, Tishby, 97)
Splitting index (D, 05)
Generic active learner
(Cohn, Atlas, Ladner, 91)
A2 algorithm
(Balcan, Beygelzimer, L, 06)
Disagreement coefficient
(Hanneke, 07)
Reduction to supervised
(D, Hsu, Monteleoni, 2007)
Importance-weighted approach
(Beygelzimer, D, L, 2009)
Aggressive
Separable data
Query by committee
(Freund, Seung, Shamir, Tishby, 97)
Splitting index (D, 05)
Generic active learner
(Cohn, Atlas, Ladner, 91)
Mellow
A2 algorithm
(Balcan, Beygelzimer, L, 06)
Disagreement coefficient
(Hanneke, 07)
Reduction to supervised
(D, Hsu, Monteleoni, 2007)
Importance-weighted approach
(Beygelzimer, D, L, 2009)
Issues:
Computational tractability
Are labels being used as efficiently as possible?
[CAL 91]
Is a label needed?
[CAL 91]
Is a label needed?
Ht = current candidate
hypotheses
[CAL 91]
Is a label needed?
Ht = current candidate
hypotheses
Region of uncertainty
[CAL 91]
Is a label needed?
Ht = current candidate
hypotheses
Region of uncertainty
Maintaining Ht
Explicitly maintaining Ht is intractable. Do it implicitly, by
reduction to supervised learning.
Explicit version
Implicit version
H1 = hypothesis class
For t = 1, 2, . . .:
Receive unlabeled point xt
If disagreement in Ht about xt s label:
query label yt of xt
Ht+1 = {h Ht : h(xt ) = yt }
else:
Ht+1 = Ht
Maintaining Ht
Explicitly maintaining Ht is intractable. Do it implicitly, by
reduction to supervised learning.
Explicit version
Implicit version
H1 = hypothesis class
For t = 1, 2, . . .:
Receive unlabeled point xt
If disagreement in Ht about xt s label:
query label yt of xt
Ht+1 = {h Ht : h(xt ) = yt }
else:
Ht+1 = Ht
Label complexity
[Hanneke]
Label complexity
[Hanneke]
labeled examples.
Label complexity
[Hanneke]
labeled examples.
CAL active learner, separable case.
Label complexity is
d log
Label complexity
[Hanneke]
labeled examples.
CAL active learner, separable case.
Label complexity is
d 2
1
d log2 + 2
d log
Disagreement coefficient
[Hanneke]
P[DIS(B(h , r ))]
.
r
h
h*
DIS(B(h , r ))
P[DIS(H )]
.
P[DIS(H )]
.
Therefore = 2.
[H 07, F 09]
= 2.
Label complexity O(log 1/).
[H 07, F 09]
[H 07, F 09]
= 2.
Label complexity O(log 1/).
Linear separators through the origin in Rd , uniform data distribution.
d.
[H 07, F 09]
= 2.
Label complexity O(log 1/).
Linear separators through the origin in Rd , uniform data distribution.
d.
c(h )d
where c(h ) is a constant depending on the target h .
Label complexity O(c(h )d 2 log 1/).
1. The A2 algorithm.
2. Limitations
3. IWAL
0.5
Error Rate
0
0
1
Threshold/Input Feature
Sampling
Error Rate
0.5
Labeled
Samples
0
0 1
10
0
0
1
Threshold/Input Feature
Bounding
0.5
Error Rate
Upper Bound
Lower
Bound
0
0 1
10
0
0
1
Threshold/Input Feature
Eliminating
Error Rate
0.5 111111
000000
000000
111111
111111
000000
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
0 0 1
000000
111111
000000
111111
000000
111111
000000
111111
Elimination
10
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
1
0000
1111
0000
1111
0000
1111
0000
1111
1
Threshold/Input Feature
Chop away part of the hypothesis space, implying that you cease
to care about part of the feature space.
Eliminating
Error Rate
0.5 111111
000000
000000
111111
111111
000000
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
0 0 1
000000
111111
000000
111111
000000
111111
000000
111111
Elimination
10
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
1
0000
1111
0000
1111
0000
1111
0000
1111
1
Threshold/Input Feature
Chop away part of the hypothesis space, implying that you cease
to care about part of the feature space. Recurse!
hH
examples suffice.
Error Rate
Threshold
1 measure = DX (x)
Error Rate
1
2
Threshold
makes Done(C , D)
1. The A2 algorithm
2. Limitations
3. IWAL
2
samples to achieve to find an h satisfying e(h, D) < + .
2
samples to achieve toqfind an h satisfying e(h, D) < + .
Implication: For = d
T as in supervised learning, all label
complexity bounds must have a dependence on:
d 2
d 2
=
= T
d
2
T
2
samples to achieve toqfind an h satisfying e(h, D) < + .
Implication: For = d
T as in supervised learning, all label
complexity bounds must have a dependence on:
d 2
d 2
=
= T
d
2
T
Proof sketch:
1. VC(H) = d means there are d inputs x where the prediction
can go either way.
2
2. Make |P(y = 1|x) P(y = 0|x)| = +2
on these points.
3. Apply coin flipping lower bound for determining whether
heads or tails is most common.
1. The A2 algorithm.
2. Limitations
3. IWAL
IWAL(subroutine Rejection-Threshold)
S =
While (unlabeled examples remain)
1. Receive unlabeled example x.
2. Set p = Rejection-Threshold(x, S).
3. If U(0, 1) p, get label y , and add (x, y , p1 ) to S.
4. Let h = Learn(S).
Loss-weighting safety
maxy Y |(z, y ) (z , y )|
miny Y |(z, y ) (z , y )|
Logistic Loss
4
log(1+exp(-x))
max derivative
min derivative
Loss value
3
2
1
0
-1
-2
-3
-3
-2
-1
0
Margin
= sup
Generalized:
= sup
r
(up to constants)
If:
1. Loss is convex
2. Representation is linear
Then computation is easy.(*)
(*) caveat: you cant quite track HS perfectly, so must have
minimum sample probability for safety.
Logistic/interior point
J48
J48
J48
J48
J48
J48
J48
Online Constrained
Batch Bootstrap
Batch Bootstrap
Batch Bootstrap
Batch Bootstrap
Batch Bootstrap
Batch Bootstrap
Batch Bootstrap
MNist
MNist
Adult
Pima
Yeast
Spambase
Waveform
Letter
65%
35%
60%
32%
19%
55%
16%
25%
700
2000
supervised
active
1600
number of points queried
supervised
active
1800
600
500
400
300
1400
1200
1000
800
600
400
200
200
0
200
400
600
1600
1800
2000
500
1000
number of points seen
1500
2000
Bibliography
1. Yotam Abramson and Yoav Freund, Active Learning for Visual
Object Recognition, UCSD tech report.
2. Nina Balcan, Alina Beygelzimer, John Langford, Agnostic
Active Learning. ICML 2006
3. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford,
Importance Weighted Active Learning, ICML 2009.
4. David Cohn, Les Atlas and Richard Ladner. Improving
generalization with active learning, Machine Learning
15(2):201-221, 1994.
5. Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for
active learning. ICML 2008.
6. Sanjoy Dasgupta, Coarse sample complexity bounds for active
learning. NIPS 2005.
Bibliography II
1. Sanjoy Dasgupta, Daniel J. Hsu, and Claire Monteleoni. A
general agnostic active learning algorithm. NIPS 2007.
2. Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali
Tishby, Selective Sampling Using the Query by Committee
Algorithm, Machine Learning, 28, 133-168, 1997.
3. Hanneke, S. A Bound on the Label Complexity of Agnostic
Active Learning. ICML 2007.
4. Manfred K. Warmuth, Gunnar Ratsch, Michael Mathieson,
Jun Liao, and Christian Lemmon, Active Learning in the Drug
Discovery Process, NIPS 2001.
5. Xiaojin Zhu, John Lafferty and Zoubin Ghahramani,
Combining active learning and semi-supervised learning using
Gaussian fields and harmonic functions, ICML 2003 workshop