Lecture 4: Feasibility of Learning
0/26
Feasibility of Learning
Roadmap
When Can Machines Learn?
Lecture 3: Types of Learning
focus: binary classiﬁcation or regression from a batch of supervised data with concrete features
Lecture 4: Feasibility of Learning
Learning is Impossible? Probability to the Rescue Connection to Learning Connection to Real Learning
Why Can Machines Learn?
How Can Machines Learn?
How Can Machines Learn Better?
1/26
Feasibility of Learning
Learning is Impossible?
A Learning Puzzle
g (x) = ?
y _{n} = − 1
y _{n} = + 1
let’s test your ‘human learning’ with 6 examples :)
2/26
Feasibility of Learning
Learning is Impossible?
Two Controversial Answers
truth f (x) = +1 because
• symmetry ⇔ +1
• (black or white count = 3) or (black count = 4 and middletop black) ⇔ +1
truth f (x) = −1 because
• lefttop black ⇔ 1
• middle column contains at most 1 black and righttop white ⇔ 1
all valid reasons, your adversarial teacher can always call you ‘didn’t learn’. :(
3/26
pick g ∈ H with all g(x _{n} ) = y _{n} (like PLA), does g ≈ f ?
4/26
Feasibility of Learning
D
Learning is Impossible?
No Free Lunch
x 
y 
g f _{1} f _{2} 
f _{3} 
f _{4} 
f _{5} 
f _{6} 
f _{7} 
f _{8} 

0 
0 0 
◦ 
◦ ◦ ◦ 
◦ 
◦ 
◦ 
◦ 
◦ 
◦ 

0 
0 1 
× 
× × × 
× 
× 
× 
× 
× 
× 

0 
1 0 
× 
× × × 
× 
× 
× 
× 
× 
× 

0 
1 
1 
◦ 
◦ ◦ ◦ 
◦ 
◦ 
◦ 
◦ 
◦ 
◦ 
1 
0 0 
× 
× × × 
× 
× 
× 
× 
× 
× 

1 
0 1 
? ◦ ◦ 
◦ 
◦ 
× 
× 
× 
× 

1 
1 0 
? ◦ ◦ 
× 
× 
◦ 
◦ 
× 
× 

1 
1 
1 
? × ◦ 
◦ 
× 
◦ 
× 
◦ 
× 
• g ≈ f inside D: sure!
• g ≈ f outside D: No! (but that’s really what we want!)
learning from D (to infer something outside D) is doomed if any ‘unknown’ f can happen. :
5/26
6/26
6/26
Feasibility of Learning
Probability to the Rescue
Inferring Something Unknown
difﬁcult to infer unknown target f outside D in learning; can we infer something unknown in other scenarios?
• consider a bin of many many orange and green marbles
• do we know the orange portion (probability)? No!
can you infer the orange probability?
7/26
Feasibility of Learning
Probability to the Rescue
Statistics 101: Inferring Orange Probability
bin
bin
assume
orange probability = µ, green probability = 1 − µ,
with µ unknown
sample
N marbles sampled independently, with
orange fraction = ν, green fraction = 1 − ν,
now ν known
does insample ν say anything about outofsample µ?
8/26
Feasibility of Learning
Probability to the Rescue
Possible versus Probable
9/26
Feasibility of Learning
Probability to the Rescue
Hoeffding’s Inequality (1/2)
µ = orange probability in bin
bin
ν = orange
fraction in sample
• in big sample (N large), ν is probably close to µ (within )
P ^{}^{} ^{} ν − µ ^{} ^{} > ^{} ≤ 2 exp −2 ^{2} N
• called Hoeffding’s Inequality, for marbles, coin, polling,
the statement ‘ν = µ’ is probably approximately correct(PAC)
10/26
Feasibility of Learning
Probability to the Rescue
Hoeffding’s Inequality (2/2)
P ^{}^{} ν − µ ^{} > ^{} ≤ 2 exp −2 ^{2} N 

• valid for all N and • does not depend on µ, no need to ‘know’ µ 
sample of size N 

• larger sample size N or looser gap =⇒ higher probability for ‘ν ≈ µ’ 



bin
if large N, can probably infer unknown µ by known ν
11/26
Feasibility of Learning
Probability to the Rescue
Fun Time
Let µ = 0.4. Use Hoeffding’s Inequality
P ^{}^{} ^{} ν − µ ^{} ^{} > ^{} ≤ 2 exp ^{} −2 ^{2} N ^{}
to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?
1

0.67 
2

0.40 
3

0.33 
4

0.05 
12/26
Feasibility of Learning
Probability to the Rescue
Fun Time
Let µ = 0.4. Use Hoeffding’s Inequality
P ^{}^{} ^{} ν − µ ^{} ^{} > ^{} ≤ 2 exp ^{} −2 ^{2} N ^{}
to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?
1

0.67 
2

0.40 
3

0.33 
4

0.05 
12/26
Feasibility of Learning
Connection to Learning
Connection to Learning
bin
• unknown orange prob. µ
• marble • ∈ bin
• orange •
• green •
• sizeN sample from bin
of i.i.d. marbles
learning
• ﬁxed hypothesis h(x) = target f (x)
• x ∈ X
=
• h is wrong ⇔ h(x)
• h is right ⇔ h(x) = f (x)
• check h on D = {(x _{n} ,
?
f (x)
y _{n} )}
with i.i.d. x _{n}
f(x _{n} )
if large N & i.i.d. x _{n} , can probably infer
=
unknown h(x)
f (x) probability
by known h(x _{n} )
= y _{n} fraction
_{•}
h(x)
= f(x)
_{•} h(x) = f(x)
13/26
Feasibility of Learning
Connection to Learning
Added Components
unknown target function f : X → Y
(ideal credit approval formula)
unknown P on X
x _{1} , x _{2} , ··· , x _{N}
training examples D: (x _{1} , y _{1} ), ··· , (x _{N} , y _{N} )
learning
algorithm
A
(historical records in bank)
x
?
h ≈ f
ﬁnal hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set
H
(set of candidate formula)
for any ﬁxed h, can probably infer
unknown E _{o}_{u}_{t} (h) = _{x}_{∼}_{P}
E
h(x)
= f(x)
by known E _{i}_{n} (h) =
1
N
N
n=1 h(x _{n} )
=
y _{n} .
ﬁxed h
14/26
Feasibility of Learning
Connection to Learning
The Formal Guarantee
same as the ‘bin’ analogy
• valid for all N and
• does not depend on E _{o}_{u}_{t} (h), no need to ‘know’ E _{o}_{u}_{t} (h) —f and P can stay unknown
• ‘E _{i}_{n} (h) = E _{o}_{u}_{t} (h)’ is probably approximately correct (PAC)
if ‘E _{i}_{n} (h) ≈ E _{o}_{u}_{t} (h)’ and ‘E _{i}_{n} (h) small’ =⇒ E _{o}_{u}_{t} (h) small =⇒ h ≈ f with respect to P
15/26
Feasibility of Learning
Connection to Learning
Veriﬁcation of One h
for any ﬁxed h, when data large enough,
E _{i}_{n} (h) ≈ E _{o}_{u}_{t} (h)
Can we claim ‘good learning’ (g ≈ f )?
Yes!
if E _{i}_{n} (h) small for the ﬁxed h and A pick the h as g =⇒ ‘g = f ’ PAC
No!
if A forced to pick THE h as g =⇒ E _{i}_{n} (h) almost always not small
=⇒ ‘g
= f ’ PAC!
real learning:
A shall make choices ∈ H (like PLA) rather than being forced to pick one h. :(
16/26
Feasibility of Learning
Connection to Learning
The ‘Veriﬁcation’ Flow
verifying examples D: (x _{1} , y _{1} ), ··· , (x _{N} , y _{N} )
(historical records in bank)
ﬁnal hypothesis g ≈ f
(given formula to be veriﬁed)
(one candidate formula)
can now use ‘historical records’ (data) to verify ‘one candidate formula’ h
17/26
Feasibility of Learning
Connection to Learning
Fun Time
Your friend tells you her secret rule in investing in a particular stock:
‘Whenever the stock goes down in the morning, it will go up in the afternoon; vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee that you can get from the veriﬁcation?
1

You’ll deﬁnitely be rich by exploiting the rule in the next 100 days. 
2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 
3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days. 
4

You’d deﬁnitely have been rich if you had exploited the rule in the past 10 years. 
18/26
Feasibility of Learning
Connection to Learning
Fun Time
Your friend tells you her secret rule in investing in a particular stock:
‘Whenever the stock goes down in the morning, it will go up in the afternoon; vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee that you can get from the veriﬁcation?
1

You’ll deﬁnitely be rich by exploiting the rule in the next 100 days. 
2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 
3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days. 
4

You’d deﬁnitely have been rich if you had exploited the rule in the past 10 years. 
18/26
.
.
.
.
.
real learning (say like PLA):
BINGO when getting ••••••••••?
h M
bottom
19/26
Feasibility of Learning
Connection to Real Learning
Coin Game
20/26
Feasibility of Learning
Connection to Real Learning
BAD Sample and BAD Data
BAD Data for One h
21/26
Feasibility of Learning
Connection to Real Learning
BAD Data for Many h
22/26
Feasibility of Learning
Connection to Real Learning
Bound of BAD Data
P _{D} [BAD D] 

= P _{D} [BAD D for h _{1} or BAD D for h _{2} or 
or 
BAD D for h _{M} ] 
≤ P _{D} [BAD D for h _{1} ] + P _{D} [BAD D for h _{2} ] + + P _{D} [BAD D for h _{M} ] 

(union bound) 

≤ 2 exp −2 ^{2} N + 2 exp −2 ^{2} N + 
+ 2 exp −2 ^{2} N 

= 2M exp −2 ^{2} N 

• ﬁnitebin version of Hoeffding, valid for all M, N and 

• does not depend on any E _{o}_{u}_{t} (h _{m} ), no need to ‘know’ E _{o}_{u}_{t} (h _{m} ) —f and P can stay unknown 

• ‘E _{i}_{n} (g) = E _{o}_{u}_{t} (g)’ is PAC, regardless of A 
‘most reasonable’ A (like PLA/pocket):
pick the h _{m} with lowest E _{i}_{n} (h _{m} ) as g
23/26
Feasibility of Learning
Connection to Real Learning
The ‘Statistical’ Learning Flow
if H = M ﬁnite, N large enough, for whatever g picked by A, E _{o}_{u}_{t} (g) ≈ E _{i}_{n} (g) if A ﬁnds one g with E _{i}_{n} (g) ≈ 0, PAC guarantee for E _{o}_{u}_{t} (g) ≈ 0 =⇒ learning possible :)
training examples D: (x _{1} , y _{1} ), ··· , (x _{N} , y _{N} )
(historical records in bank)
(set of candidate formula)
ﬁnal hypothesis g ≈ f
(‘learned’ formula to be used)
M = ∞? (like perceptrons) —see you in the next lectures
24/26
Feasibility of Learning
Connection to Real Learning
Fun Time
Consider 4 hypotheses.
h _{1} (x) = sign(x _{1} ), h _{2} (x) = sign(x _{2} ),
h _{3} (x) = sign(−x _{1} ), h _{4} (x) = sign(−x _{2} ). For any N and , which of the following statement is not true?
1

the BAD data of h _{1} and the BAD data of h _{2} are exactly the same 
2

the BAD data of h _{1} and the BAD data of h _{3} are exactly the same 
3

P _{D} [BAD for some h _{k} ] ≤ 8 exp ^{} −2 ^{2} N ^{} 
4

P _{D} [BAD for some h _{k} ] ≤ 4 exp ^{} −2 ^{2} N ^{} 
25/26
Feasibility of Learning
Connection to Real Learning
Fun Time
Consider 4 hypotheses.
h _{1} (x) = sign(x _{1} ), h _{2} (x) = sign(x _{2} ),
h _{3} (x) = sign(−x _{1} ),
h _{4} (x) = sign(−x _{2} ).
For any N and , which of the following statement is not true?
1

the BAD data of h _{1} and the BAD data of h _{2} are exactly the same 
2

the BAD data of h _{1} and the BAD data of h _{3} are exactly the same 
3

P _{D} [BAD for some h _{k} ] ≤ 8 exp ^{} −2 ^{2} N ^{} 
4

P _{D} [BAD for some h _{k} ] ≤ 4 exp ^{} −2 ^{2} N ^{} 
25/26
Feasibility of Learning
Connection to Real Learning
Summary
When Can Machines Learn?
Lecture 3: Types of Learning Lecture 4: Feasibility of Learning
Learning is Impossible? absolutely no free lunch outside D Probability to the Rescue probably approximately correct outside D Connection to Learning veriﬁcation possible if E _{i}_{n} (h) small for ﬁxed h Connection to Real Learning learning possible if H ﬁnite and E _{i}_{n} (g) small
Why Can Machines Learn?
• next: what if H = ∞?
How Can Machines Learn?
How Can Machines Learn Better?
26/26
Molto più che documenti.
Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.
Annulla in qualsiasi momento.