Problem 1.9 Prove that the Normal distribution (1.
3) has mean µ and
variance σ 2 . Hint: exploit the fact that p is symmetric around µ. Problem 1.10 (Cauchy Distribution) Prove that for the density p(x) = 1 π(1 + x 2) (1.33) mean and variance are undefined. Hint: show that the integral diverges. Problem 1.11 (Quantiles) Find a distribution for which the mean exceeds the median. Hint: the mean depends on the value of the high-quantile terms, whereas the median does not. Problem 1.12 (Multicategory Naive Bayes) Prove that for multicategory Naive Bayes the optimal decision is given by y ∗ (x) := argmax y p(y) Yn i=1 p([x]i |y) (1.34) where y ∈ Y is the class label of the observation x. Problem 1.13 (Bayes Optimal Decisions) Denote by y ∗ (x) = argmaxy p(y|x) the label associated with the largest conditional class probability. Prove that for y ∗ (x) the probability of choosing the wrong label y is given by l(x) := 1 − p(y ∗ (x)|x). Moreover, show that y ∗ (x) is the label incurring the smallest misclassification error. Problem 1.14 (Nearest Neighbor Loss) Show that the expected loss incurred by the nearest neighbor classifier does not exceed twice the loss of the Bayes optimal decision. 2 Density Estimation 2.1 Limit Theorems Assume you are a gambler and go to a casino to play a game of dice. As it happens, it is your unlucky day and among the 100 times you toss the dice, you only see ’6’ eleven times. For a fair dice we know that each face should occur with equal probability 1 6 . Hence the expected value over 100 draws is 100 6 ≈ 17, which is considerably more than the eleven times that we observed. Before crying foul you decide that some mathematical analysis is in order. The probability of seeing a particular sequence of m trials out of which n are a ’6’ is given by 1 6 n 5 6 m−n . Moreover, there are m n # = m! n!(m−n)! different sequences of ’6’ and ’not 6’ with proportions n and m−n respectively. Hence we may compute the probability of seeing a ’6’ only 11 or less via Pr(X ≤ 11) = X 11 i=0 p(i) = X 11 i=0 # 100 i # #1 6 #i # 5 6 #100−i ≈ 7.0% (2.1) After looking at this figure you decide that things are probably reasonable. And, in fact, they are consistent with the convergence behavior of a simulated dice in Figure 2.1. In computing (2.1) we have learned something useful: the expansion is a special case of a binomial series. The first term 1234 56 0.0 0.1 0.2 0.3 m=10 1234 56 0.0 0.1 0.2 0.3 m=20 1234 56 0.0 0.1 0.2 0.3 m=50 1234 56 0.0 0.1 0.2 0.3 m=100 1234 56 0.0 0.1 0.2 0.3 m=200 1234 56 0.0 0.1 0.2 0.3 m=500 Fig. 2.1. Convergence of empirical means to expectations. From left to right: empirical frequencies of occurrence obtained by casting a dice 10, 20, 50, 100, 200, and 500 times respectively. Note that after 20 throws we still have not observed a single ’6’, an event which occurs with only 5 6 20 ≈ 2.6% probability. 3 38 2 Density Estimation counts the number of configurations in which we could observe i times ’6’ in a sequence of 100 dice throws. The second and third term are the probabilities of seeing one particular instance of such a sequence. Note that in general we may not be as lucky, since we may have considerably less information about the setting we are studying. For instance, we might not know the actual probabilities for each face of the dice, which would be a likely assumption when gambling at a casino of questionable reputation. Often the outcomes of the system we are dealing with may be continuous valued random variables rather than binary ones, possibly even with unknown range. For instance, when trying to determine the average wage through a questionnaire we need to determine how many people we need to ask in order to obtain a certain level of confidence. To answer such questions we need to discuss limit theorems. They tell us by how much averages over a set of observations may deviate from the corresponding expectations and how many observations we need to draw to estimate a number of probabilities reliably. For completeness we will present proofs for some of the more fundamental theorems in Section 2.1.2. They are useful albeit non-essential for the understanding of the remainder of the book and may be omitted. 2.1.1 Fundamental Laws The Law of Large Numbers developed by Bernoulli in 1713 is one of the fundamental building blocks of statistical analysis. It states that averages over a number of observations converge to their expectations given a sufficiently large number of observations and given certain assumptions on the independence of these observations. It comes in two flavors: the weak and the strong law. Theorem 2.1 (Weak Law of Large Numbers) Denote by X1, . . . , Xm random variables drawn from p(x) with mean µ = EXi [xi ] for all i. Moreover let X¯m := 1 m Xm i=1 Xi (2.2) be the empirical average over the random variables Xi. Then for any # > 0 the following holds lim m→∞ Pr X¯m − µ ≤ # # = 1. (2.3 2.1 Limit Theorems 39 101 102 103 1 2 3 4 5 6 Fig. 2.2. The mean of a number of casts of a dice. The horizontal straight line denotes the mean 3.5. The uneven solid line denotes the actual mean X¯ n as a function of the number of draws, given as a semilogarithmic plot. The crosses denote the outcomes of the dice. Note how X¯ n ever more closely approaches the mean 3.5 are we obtain an increasing number of observations. This establishes that, indeed, for large enough sample sizes, the average will converge to the expectation. The strong law strengthens this as follows: Theorem 2.2 (Strong Law of Large Numbers) Under the conditions of Theorem 2.1 we have Pr limm→∞ X¯m = µ # = 1. The strong law implies that almost surely (in a measure theoretic sense) X¯m converges to µ, whereas the weak law only states that for every # the random variable X¯m will be within the interval [µ−#, µ+#]. Clearly the strong implies the weak law since the measure of the events X¯m = µ converges to 1, hence any #-ball around µ would capture this. Both laws justify that we may take sample averages, e.g. over a number of events such as the outcomes of a dice and use the latter to estimate their means, their probabilities (here we treat the indicator variable of the event as a {0; 1}-valued random variable), their variances or related quantities. We postpone a proof until Section 2.1.2, since an effective way of proving Theorem 2.1 relies on the theory of characteristic functions which we will discuss in the next section. For the moment, we only give a pictorial illustration in Figure 2.2. Once we established that the random variable X¯m = m−1 Pm i=1 Xi converges to its mean µ, a natural second question is to establish how quickly it converges and what the properties of the limiting distribution of X¯m −µ are. Note in Figure 2.2 that the initial deviation from the mean is large whereas as we observe more data the empirical mean approaches the true one