Sei sulla pagina 1di 8

Problem 1.9 Prove that the Normal distribution (1.

3) has mean µ and


variance σ
2
. Hint: exploit the fact that p is symmetric around µ.
Problem 1.10 (Cauchy Distribution) Prove that for the density
p(x) = 1
π(1 + x
2)
(1.33)
mean and variance are undefined. Hint: show that the integral diverges.
Problem 1.11 (Quantiles) Find a distribution for which the mean exceeds the median.
Hint: the mean depends on the value of the high-quantile
terms, whereas the median does not.
Problem 1.12 (Multicategory Naive Bayes) Prove that for multicategory Naive Bayes
the optimal decision is given by
y

(x) := argmax
y
p(y)
Yn
i=1
p([x]i
|y) (1.34)
where y ∈ Y is the class label of the observation x.
Problem 1.13 (Bayes Optimal Decisions) Denote by y

(x) = argmaxy p(y|x)
the label associated with the largest conditional class probability. Prove that
for y

(x) the probability of choosing the wrong label y is given by
l(x) := 1 − p(y

(x)|x).
Moreover, show that y

(x) is the label incurring the smallest misclassification
error.
Problem 1.14 (Nearest Neighbor Loss) Show that the expected loss incurred by the
nearest neighbor classifier does not exceed twice the loss of the
Bayes optimal decision.
2
Density Estimation
2.1 Limit Theorems
Assume you are a gambler and go to a casino to play a game of dice. As
it happens, it is your unlucky day and among the 100 times you toss the
dice, you only see ’6’ eleven times. For a fair dice we know that each face
should occur with equal probability 1
6
. Hence the expected value over 100
draws is 100
6 ≈ 17, which is considerably more than the eleven times that we
observed. Before crying foul you decide that some mathematical analysis is
in order.
The probability of seeing a particular sequence of m trials out of which n
are a ’6’ is given by 1
6
n 5
6
m−n
. Moreover, there are m
n
#
=
m!
n!(m−n)! different
sequences of ’6’ and ’not 6’ with proportions n and m−n respectively. Hence
we may compute the probability of seeing a ’6’ only 11 or less via
Pr(X ≤ 11) = X
11
i=0
p(i) = X
11
i=0
#
100
i
# #1
6
#i #
5
6
#100−i
≈ 7.0% (2.1)
After looking at this figure you decide that things are probably reasonable.
And, in fact, they are consistent with the convergence behavior of a simulated dice
in Figure 2.1. In computing (2.1) we have learned something
useful: the expansion is a special case of a binomial series. The first term
1234 56
0.0
0.1
0.2
0.3
m=10
1234 56
0.0
0.1
0.2
0.3
m=20
1234 56
0.0
0.1
0.2
0.3
m=50
1234 56
0.0
0.1
0.2
0.3
m=100
1234 56
0.0
0.1
0.2
0.3
m=200
1234 56
0.0
0.1
0.2
0.3
m=500
Fig. 2.1. Convergence of empirical means to expectations. From left to right:
empirical frequencies of occurrence obtained by casting a dice 10, 20, 50, 100,
200, and
500 times respectively. Note that after 20 throws we still have not observed a
single
’6’, an event which occurs with only
5
6
20
≈ 2.6% probability.
3
38 2 Density Estimation
counts the number of configurations in which we could observe i times ’6’ in a
sequence of 100 dice throws. The second and third term are the probabilities
of seeing one particular instance of such a sequence.
Note that in general we may not be as lucky, since we may have considerably less
information about the setting we are studying. For instance,
we might not know the actual probabilities for each face of the dice, which
would be a likely assumption when gambling at a casino of questionable
reputation. Often the outcomes of the system we are dealing with may be
continuous valued random variables rather than binary ones, possibly even
with unknown range. For instance, when trying to determine the average
wage through a questionnaire we need to determine how many people we
need to ask in order to obtain a certain level of confidence.
To answer such questions we need to discuss limit theorems. They tell
us by how much averages over a set of observations may deviate from the
corresponding expectations and how many observations we need to draw to
estimate a number of probabilities reliably. For completeness we will present
proofs for some of the more fundamental theorems in Section 2.1.2. They
are useful albeit non-essential for the understanding of the remainder of the
book and may be omitted.
2.1.1 Fundamental Laws
The Law of Large Numbers developed by Bernoulli in 1713 is one of the
fundamental building blocks of statistical analysis. It states that averages
over a number of observations converge to their expectations given a sufficiently
large number of observations and given certain assumptions on the
independence of these observations. It comes in two flavors: the weak and
the strong law.
Theorem 2.1 (Weak Law of Large Numbers) Denote by X1, . . . , Xm
random variables drawn from p(x) with mean µ = EXi
[xi
] for all i. Moreover
let
X¯m :=
1
m
Xm
i=1
Xi (2.2)
be the empirical average over the random variables Xi. Then for any # > 0
the following holds
lim m→∞
Pr
X¯m − µ
≤ #
#
= 1. (2.3
2.1 Limit Theorems 39
101 102 103
1
2
3
4
5
6
Fig. 2.2. The mean of a number of casts of a dice. The horizontal straight line
denotes the mean 3.5. The uneven solid line denotes the actual mean X¯
n as a
function of the number of draws, given as a semilogarithmic plot. The crosses
denote
the outcomes of the dice. Note how X¯
n ever more closely approaches the mean 3.5
are we obtain an increasing number of observations.
This establishes that, indeed, for large enough sample sizes, the average will
converge to the expectation. The strong law strengthens this as follows:
Theorem 2.2 (Strong Law of Large Numbers) Under the conditions
of Theorem 2.1 we have Pr
limm→∞ X¯m = µ
#
= 1.
The strong law implies that almost surely (in a measure theoretic sense) X¯m
converges to µ, whereas the weak law only states that for every # the random
variable X¯m will be within the interval [µ−#, µ+#]. Clearly the strong implies
the weak law since the measure of the events X¯m = µ converges to 1, hence
any #-ball around µ would capture this.
Both laws justify that we may take sample averages, e.g. over a number
of events such as the outcomes of a dice and use the latter to estimate their
means, their probabilities (here we treat the indicator variable of the event
as a {0; 1}-valued random variable), their variances or related quantities. We
postpone a proof until Section 2.1.2, since an effective way of proving Theorem 2.1
relies on the theory of characteristic functions which we will discuss
in the next section. For the moment, we only give a pictorial illustration in
Figure 2.2.
Once we established that the random variable X¯m = m−1 Pm
i=1 Xi converges to its mean µ, a natural second question is to establish how
quickly it
converges and what the properties of the limiting distribution of X¯m −µ are.
Note in Figure 2.2 that the initial deviation from the mean is large whereas
as we observe more data the empirical mean approaches the true one

Potrebbero piacerti anche