Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Based on
Foundations of Statistical NLP
C. Manning & H. Schutze, ch. 2, MIT Press, 2002
PLAN
1. Elementary Probability Notions:
Sample Space, Event Space, and Probability Function
Conditional Probability
Bayes Theorem
Independence of Probabilistic Events
2. Random Variables:
Discrete Variables and Continuous Variables
Mean, Variance and Standard Deviation
Standard Distributions
Joint, Marginal and and Conditional Distributions
Independence of Random Variables
2.
PLAN (contd)
3. Limit Theorems
Laws of Large Numbers
Central Limit Theorems
4. Estimating the parameters of probabilistic models from
data
Maximum Likelihood Estimation (MLE)
Maximum A Posteriori (MAP) Estimation
5. Elementary Information Theory
Entropy; Conditional Entropy; Joint Entropy
Mutual Information / Information Gain
Cross-Entropy
Relative Entropy / Kullback-Leibler (KL) Divergence
Properties: bounds, chain rules, (non-)symmetries,
properties pertaining to independence
3.
Conditional Probability
P (A B)
P (A | B) =
P (B)
Note: P (A | B) is called the a posteriory probability of A, given B.
Bayes Theorem:
P (A | B) P (B)
P (B | A) =
P (A)
P (A | B) P (B)
or P (B | A) =
P (A | B)P (B) + P (A | B)P (B)
or ...
6.
2. Random Variables
2.1 Basic Definitions
Let be a sample space, and
P : 2 [0, 1] a probability function.
A random variable of distribution P is a function
X : Rn
For now, let us consider n = 1.
1.0
p = 0.5, n = 20
p = 0.7, n = 20
0.20
0.8
p = 0.5, n = 40
0.15
b(r ; n, p)
0.6
F(r)
0.10
0.4
0.05
0.2
p = 0.5, n = 20
p = 0.7, n = 20
p = 0.5, n = 40
0.00
0 10 20 30 40 0.0 0 10 20 30 40
r r
10.
2.3 Continuous Random Variables
Definitions:
Let P : 2 [0, 1] be a probability function, and
X : R be a random variable of distribution P .
If Image(X) is unfinite non-countable set, and
F , the cumulative distribution function of X is continuous, then
X is called a continuous random variable.
(It follows, naturally, that P (X = x) = 0, for all x R.)
Rx
If there exists p : R [0, ) such that F (x) = p(t)dt,
then X is called absolutely continuous.
In such a case, p is called the probability density function (pdf) of X.
1
R R
For B R for which B p(x)dx exists, P (X (B)) = B p(x)dx,
not.
where X 1 (B) R= { | X() B}.
+
In particular, p(x)dx = 1.
not. R
Expectation / mean of X: E(X) = E[X] = xp(x)dx.
11.
Exemplification:
(x )2
Normal (Gaussean) distribution: N(x; , ) = 1 e 2 2
2
mean: , variance: 2
Standard Normal distribution: N(x; 0, 1)
Remark:
For n, p such that np(1 p) > 5, the Binomial distributions can be
approximated by Normal distributions.
12.
1.0
1.0
= 0, = 0.2
= 0, = 1.0
= 0, = 5.0
0.8
0.8
= 2, = 0.5
0.6
0.6
N,2(X=x)
,2(x)
0.4
0.4
= 0, = 0.2
0.2
0.2
= 0, = 1.0
= 0, = 5.0
= 2, = 0.5
0.0
0.0
4 2 0 2 4 4 2 0 2 4
x x
13.
2.4 Basic Properties of Random Variables
Let P : 2 [0, 1] be a probability function,
X : Rn be a random discrete/continuous variable of distribution P .
If g : Rn Rm is a function, then g(X) is a random variable.
P
If g(X) is discrete, then E(g(X)) = x g(x)p(x).
R
If g(X) is continuous, then E(g(X)) = g(x)p(x)dx.
E(aX + b) = aE(X) + b.
If g is non-linear 6 E(g(X)) = g(E(X)).
E(X + Y ) = E(X) + E(Y ).
Var(X) = E(X 2 ) E 2 (X).
Var(aX) = a2 Var(X).
Var(X + a) = Var(X).
Cov(X, Y ) = E[XY ] E[X]E[Y ].
14.
2.5 Joint, Marginal and Conditional Distributions
Exemplification for the bi-variate case:
Let be a sample space, P : 2 [0, 1] a probability function, and
V : R2 be a random variable of distribution P .
One could naturally see V as a pair of two random variables X : R and
Y : R. (More precisely, V () = (x, y) = (X(), Y ()).)
the joint pmf/pdf of X and Y is defined by
not.
p(x, y) = pX,Y (x, y) = P (X = x, Y = y) = P ( | X() = x, Y () = y).
the marginal pmf/pdf functions of X and Y are:
for the discrete
P case: P
pX (x) = y p(x, y), pY (y) = x p(x, y)
for the continuous
R case: R
pX (x) = y p(x, y) dy, pY (y) = x p(x, y) dx
the conditional pmf/pdf of X given Y is:
p (x, y)
pX|Y (x | y) = X,Y
pY (y)
15.
3. Limit Theorems
[ Sheldon Ross, A first course in probability, 5th ed., 1998 ]
Chernoff bounds:
not
Let M(t) = E[etX ]. Then
P (X a) eta M(t) for all t > 0
P (X a) eta M(t) for all t < 0
23.
Case 1:
When we have plenty of data, it is natural to use the maximum likeli-
n ni
hood (ML) solution, i.e. the observed frequency iML = P i not. = .
n
j j N
Note: it is easy to show that indeed P (n | ML ) > P (n | ) for any 6= ML .
P (n | ML ) i (iML )ni X iML X
ML iML
log = log = ni log =N i log >0
P (n | ) i ini i
i i
i
The inequality follows from the fact that the relative entropy is always
positive except when the two distributions are identical.
25.
Case 2:
When the data is scarce, it is not clear what is the best estimate.
In general, we should use prior knowledge, via Bayesian statistics.
For instance, one can use the Dirichlet distribution with parameters .
P (n | )D( | )
P ( | n) =
P (n)
It can be shown (see calculus on R. Durbin et. al. BSA book, pag. 320)
that the posterior mean estimation (PME) of the parameters is
ni + i
Z
def.
iPME = P ( | n)d = P
N + j j
The s are like pseudocounts added to the real counts. (If we think of the
s as extra observations added to the real ones, this is precisely the ML
estimate!) This makes the Dirichlet regulariser very intuitive.
How to use the pseudocounts: If it is fairly obvious that a certain residue,
lets say i, is very common, than we should give it a very high pseudocount
i ; if the residue j is generaly rare, we should give it a low pseudocount.
26.
Strategies to be used in the general case
A. The Maximum Likelihood (ML) Estimate
When we wish to infer the parameters = (i ) for a model M from a set
of data D, the most obvious strategy is to maximise P (D | , M) over all
possible values of . Formally:
ML = argmax P (D | , M)
where the integral is over all probability vectors, i.e. all those that sum to
one.
def. P
Specific Conditional entropy: H(Y | X = x) = yY p(y | x) log p(y | x).
Average conditional entropy:
def. P imed. P P
H(Y | X) = xX p(x)H(Y | X = x) = xX yY p(x, y) log p(y | x).
Joint entropy:
def. P dem. dem.
H(X, Y ) = x p(x, y) log p(x, y) = H(X) + H(Y | X) = H(Y ) + H(X | Y ).
Mutual information (or: Information gain):
def. imed.
IG(X; Y ) = H(X) H(X | Y ) = H(Y ) H(Y | X)
imed.
= H(X, Y ) H(X | Y ) H(Y | X) = IG(Y ; X).
29.
Exemplification: Entropy of a Bernoulli Distribution
1.0
0.8
0.6
0.4
0.2
0.0
p
30.
Basic properties of
Entropy, Conditional Entropy, Joint Entropy and
Mutual Information / Information Gain
1 1
0 H(p1 , . . . , pn ) H ,..., = log n;
n n
H(X) = 0 iff X is a constant random variable.
IG(X; Y ) 0;
IG(X; Y ) = 0 iff X and Y are independent.
H(X | Y ) H(X)
H(X | Y ) = H(X) iff X and Y are independent;
H(X, Y ) H(X) + H(Y );
H(X, Y ) = H(X) + H(Y ) iff X and Y are independent.
a chain rule: H(X1, . . . , Xn ) = H(X1 )+H(X2|X1 )+. . .+H(Xn |X1 , . . . , Xn1).
31.
H(X,Y)
H(X) H(Y)
32.
Other definitions
Basic properties of
relative entropy and cross-entropy
KL(p || q) 0 for all p and q;
KL(p || q) = 0 iff p and q are identical.
If X is a discrete random variable, p its pmf, and q another pmf,
then CH(X, q) H(X) 0.
KL is NOT a distance metric (because it is not symmetric)!!
The quantity
def
d(X, Y ) = H(X, Y ) IG(X; Y ) = H(X) + H(Y ) 2IG(X; Y )
= H(X | Y ) + H(Y | X)
known as variation of information, is a distance metric.
P P p(x)p(y)
IG(X; Y ) = KL(pXY || pX pY ) = x y p(x, y) log .
p(x, y)
34.
6. Recommended Exercises
Addenda:
Other Examples of Probabilistic Distributions
36.
Multinomial distribution:
generalises the binomial distribution to the case where there P are K inde-
pendent outcomes with probabilities i , i = 1, . . . , K such that Ki=1 i = 1.
The probability of getting ni occurrence of outcome i is given by
n! K ni
P (n | ) = i=1 i ,
K
i=1 (ni !)
where n = n1 + . . . + nK , and = (1 , . . . , K ).
Note: The particular case n = 1 represents the categorical distribu-
tion. This is a generalisation of the Bernoulli distribution.
Example: The outcome of rolling a die n times is described by a categor-
ical distribution. The probabilities of each of the 6 outcomes are 1 , . . . , 6 .
For a fair die, 1 = . . . = 6 , and the probability of rolling it 12 times and
getting each outcome twice is:
12
12! 1 3
= 3.4 10
(2!)6 6
37.
Poisson distribution (or, Poisson law of small numbers):
k
p(k; ) = e , with k N and parameter > 0.
k!
Mean = variance = .
1.0
0.4
=1
=4
0.8
0.3
= 10
0.6
P(X k)
P(X=k)
0.2
0.4
=1
0.1
=4
0.2
= 10
0.0
0.0
0 5 10 15 20 0 5 10 15 20
k k
38.
1.0
1.5
= 0.5
=1
0.8
= 1.5
1.0
0.6
P(X x)
p(x)
0.4
0.5
= 0.5
0.2
=1
= 1.5
0.0
0.0
0 1 2 3 4 5 0 1 2 3 4 5
x x
39.
Gamma distribution:
ex/ k1
p(x; k, ) = x for x 0 and parameters k > 0 (shape) and > 0 (scale).
(k)k
Mean = k, variance = k2 .
The gamma function is a generalisation of the factorial function to real values. For any
positive real number x, (x + 1) = x(x). (Thus, for integers (n) = (n 1)!.)
1.0
k = 1.0, = 2.0
k = 2.0, = 2.0
k = 3.0, = 2.0
0.4
0.8
k = 5.0, = 1.0
k = 9.0, = 0.5
k = 7.5, = 1.0
0.3
0.6
k = 0.5, = 1.0
P(X x)
p(x)
k = 1.0, = 2.0
0.2
0.4
k = 2.0, = 2.0
k = 3.0, = 2.0
k = 5.0, = 1.0
k = 9.0, = 0.5
0.1
0.2
k = 7.5, = 1.0
k = 0.5, = 1.0
0.0
0.0
0 5 10 15 20 0 5 10 15 20
x x
40.
2 distribution:
/2
1 1 1
p(x; ) = x/21 e 2 x for x 0 and a positive integer.
(/2) 2
It is obtained from Gamma distribution by taking k = /2 and = 2.
Mean = , variance = 2.
Chi Squared probability density function Chi Squared cumulative distribution function
1.0
0.5
k=1
k=2
0.8
0.4
k=3
k=4
k=6
0.6
k=9
0.3
P(X x)
p(x)
0.4
0.2
k=1
k=2
k=3
0.2
0.1
k=4
k=6
k=9
0.0
0.0
0 2 4 6 8 0 2 4 6 8
x x
41.
Laplace distribution:
| x|
1
p(x; , ) = e .
2
Mean = , variance = 22 .
Laplace probability density function Laplace cumulative density function
1.0
0.5
= 0, = 1
= 0, = 2
= 0, = 4
0.8
0.4
= 5, = 4
0.6
0.3
P(X x)
p(x)
0.4
0.2
= 0, = 1
0.2
0.1
= 0, = 2
= 0, = 4
= 5, = 4
0.0
0.0
10 5 0 5 10 10 5 0 5 10
x x
42.
Students
distribution:
+1 +1
2
2 x 2
p(x; ) = 1 + for x R and > 0 (the degree of freedom param.)
2
Mean = 0 for > 1, otherwise undefined.
Variance = for > 2, for 1 < 2, otherwise undefined.
2
The probability density function and the cumulative distribution function:
Note [from Wiki]: The t-distribution is symmetric and bell-shaped, like the normal distribution, but it has
havier tails, meaning that it is more prone to producing values that fall far from its mean.
43.
Dirichlet distribution:
1 K i 1 PK
D( | ) = ( i=1 i 1)
Z() i=1 i
where
= 1 , . . . , K with i > 0 are the parameters,
i satisfyP0 i 1 and sum to 1, this being indicated by the delta function
term ( i i 1), and
the normalising factor can be expressed in terms of the gamma function:
R K i 1 P i (i )
Z() = i=1 i ( i 1)d = P
( i i )
i
Mean of i : P .
j j
Remark:
Concerning the multinomial and Dirichlet distributions:
The algebraic expression for the parameters i is similar in the two distri-
butions.
However, the multinomial is a distribution over its exponents ni , whereas
the Dirichlet is a distribution over the numbers i that are exponentiated.
The two distributions are said to be conjugate distributions and their
close formal relationship leads to a harmonious interplay in many estima-
tion problems.
Similarly,
the beta distribution is the conjugate of the Bernoulli distribution, and
the gamma distribution is the conjugate of the Poisson distribution.