Sei sulla pagina 1di 45

0.

Basic Statistics and Probability Theory

Based on
Foundations of Statistical NLP
C. Manning & H. Schutze, ch. 2, MIT Press, 2002

Probability theory is nothing but common sense


reduced to calculation.
Pierre Simon, Marquis de Laplace (1749-1827)
1.

PLAN
1. Elementary Probability Notions:
Sample Space, Event Space, and Probability Function
Conditional Probability
Bayes Theorem
Independence of Probabilistic Events
2. Random Variables:
Discrete Variables and Continuous Variables
Mean, Variance and Standard Deviation
Standard Distributions
Joint, Marginal and and Conditional Distributions
Independence of Random Variables
2.
PLAN (contd)
3. Limit Theorems
Laws of Large Numbers
Central Limit Theorems
4. Estimating the parameters of probabilistic models from
data
Maximum Likelihood Estimation (MLE)
Maximum A Posteriori (MAP) Estimation
5. Elementary Information Theory
Entropy; Conditional Entropy; Joint Entropy
Mutual Information / Information Gain
Cross-Entropy
Relative Entropy / Kullback-Leibler (KL) Divergence
Properties: bounds, chain rules, (non-)symmetries,
properties pertaining to independence
3.

1. Elementary Probability Notions


sample space: (either discrete or continuous)
event: A
the certain event:
the impossible event:
event space: F = 2 (or a subspace of 2 that contains and is closed
under complement and countable union)
probability function/distribution: P : F [0, 1] such that:
P () = 1
the countable additivity property: P
A1 , ..., Ak disjoint events, P (Ai ) = P (Ai )
Consequence: for a uniform distribution in a finite sample space:
#favorable events
P (A) =
#all events
4.

Conditional Probability
P (A B)
P (A | B) =
P (B)
Note: P (A | B) is called the a posteriory probability of A, given B.

The multiplication rule:


P (A B) = P (A | B)P (B) = P (B | A)P (A)
The chain rule:
P (A1 A2 . . . An ) =
P (A1 )P (A2 | A1 )P (A3 | A1 , A2 ) . . . P (An | A1 , A2 , . . . , An1 )
5.

The total probability formula:


P (A) = P (A | B)P (B) + P (A | B)P (B)
More generally:
if A Bi and i 6= j Bi Bj = , then
P
P (A) = i P (A | Bi )P (Bi )

Bayes Theorem:
P (A | B) P (B)
P (B | A) =
P (A)
P (A | B) P (B)
or P (B | A) =
P (A | B)P (B) + P (A | B)P (B)
or ...
6.

Independence of Probabilistic Events


Independent events: P (A B) = P (A)P (B)
Note: When P (B) 6= 0, the above definition is equivalent to
P (A|B) = P (A).

Conditionally independent events:


P (A B | C) = P (A | C)P (B | C), assuming, of course, that
P (C) 6= 0.
Note: When P (B C) 6= 0, the above definition is equivalent
to P (A|B, C) = P (A|C).
7.

2. Random Variables
2.1 Basic Definitions
Let be a sample space, and
P : 2 [0, 1] a probability function.
A random variable of distribution P is a function

X : Rn
For now, let us consider n = 1.

The cumulative distribution function of X is F : R [0, ) defined by

F (x) = P (X x) = P ({ | X() x})


8.
2.2 Discrete Random Variables
Definition: Let P : 2 [0, 1] be a probability function, and X be a random
variable of distribution P .
If Image(X) is either finite or unfinite countable, then
X is called a discrete random variable.
For such a variable we define the probability mass function (pmf)
not. def.
p : R [0, 1] as p(x) = p(X = x) = P ({ | X() = x}).
P
(Obviously, it follows that xi Image(X) p(xi ) = 1.)
Mean, Variance, and Standard Deviation:
Expectation / mean of X:
not. P
E(X) = E[X] = x xp(x) if X is a discrete random variable.
not.
Variance of X: Var(X) = Var[X] = E((X E(X))2 ).
p
Standard deviation: = Var(X).
Covariance of X and Y , two random variables of distribution P :
Cov(X, Y ) = E[(X E[X])(Y E[Y ])]
9.
Exemplification:
the Binomial distribution: b(r; n, p) = Cnr pr (1 p)nr (0 r n)
mean: np, variance: np(1 p)
the Bernoulli distribution: b(r; 1, p)
mean: p, variance: p(1 p), entropy: p log2 p (1 p) log2 (1 p)
Binomial probability mass function Binomial cumulative distribution function
0.25

1.0
p = 0.5, n = 20
p = 0.7, n = 20
0.20

0.8
p = 0.5, n = 40
0.15
b(r ; n, p)

0.6
F(r)
0.10

0.4
0.05

0.2
p = 0.5, n = 20
p = 0.7, n = 20
p = 0.5, n = 40
0.00

0 10 20 30 40 0.0 0 10 20 30 40

r r
10.
2.3 Continuous Random Variables
Definitions:
Let P : 2 [0, 1] be a probability function, and
X : R be a random variable of distribution P .
If Image(X) is unfinite non-countable set, and
F , the cumulative distribution function of X is continuous, then
X is called a continuous random variable.
(It follows, naturally, that P (X = x) = 0, for all x R.)
Rx
If there exists p : R [0, ) such that F (x) = p(t)dt,
then X is called absolutely continuous.
In such a case, p is called the probability density function (pdf) of X.
1
R R
For B R for which B p(x)dx exists, P (X (B)) = B p(x)dx,
not.
where X 1 (B) R= { | X() B}.
+
In particular, p(x)dx = 1.
not. R
Expectation / mean of X: E(X) = E[X] = xp(x)dx.
11.

Exemplification:
(x )2

Normal (Gaussean) distribution: N(x; , ) = 1 e 2 2
2
mean: , variance: 2
Standard Normal distribution: N(x; 0, 1)
Remark:
For n, p such that np(1 p) > 5, the Binomial distributions can be
approximated by Normal distributions.
12.

Gaussian probability density function Gaussian cumulative distribution function

1.0
1.0

= 0, = 0.2
= 0, = 1.0
= 0, = 5.0

0.8
0.8

= 2, = 0.5

0.6
0.6
N,2(X=x)

,2(x)
0.4

0.4
= 0, = 0.2
0.2

0.2
= 0, = 1.0
= 0, = 5.0
= 2, = 0.5
0.0

0.0
4 2 0 2 4 4 2 0 2 4

x x
13.
2.4 Basic Properties of Random Variables
Let P : 2 [0, 1] be a probability function,
X : Rn be a random discrete/continuous variable of distribution P .
If g : Rn Rm is a function, then g(X) is a random variable.
P
If g(X) is discrete, then E(g(X)) = x g(x)p(x).
R
If g(X) is continuous, then E(g(X)) = g(x)p(x)dx.
E(aX + b) = aE(X) + b.
If g is non-linear 6 E(g(X)) = g(E(X)).
E(X + Y ) = E(X) + E(Y ).
Var(X) = E(X 2 ) E 2 (X).
Var(aX) = a2 Var(X).
Var(X + a) = Var(X).
Cov(X, Y ) = E[XY ] E[X]E[Y ].
14.
2.5 Joint, Marginal and Conditional Distributions
Exemplification for the bi-variate case:
Let be a sample space, P : 2 [0, 1] a probability function, and
V : R2 be a random variable of distribution P .
One could naturally see V as a pair of two random variables X : R and
Y : R. (More precisely, V () = (x, y) = (X(), Y ()).)
the joint pmf/pdf of X and Y is defined by
not.
p(x, y) = pX,Y (x, y) = P (X = x, Y = y) = P ( | X() = x, Y () = y).
the marginal pmf/pdf functions of X and Y are:
for the discrete
P case: P
pX (x) = y p(x, y), pY (y) = x p(x, y)
for the continuous
R case: R
pX (x) = y p(x, y) dy, pY (y) = x p(x, y) dx
the conditional pmf/pdf of X given Y is:
p (x, y)
pX|Y (x | y) = X,Y
pY (y)
15.

2.6 Independence of Random Variables


Definitions:
Let X, Y be random variables of the same type (i.e. either discrete or
continuous), and pX,Y their joint pmf/pdf.
X and Y are said to be independent if

pX,Y (x, y) = pX (x) pY (y)


for all possible values x and y of X and Y respectively.
Similarly, let X, Y and Z be random variables of the same type, and p
their joint pmf/pdf.
X and Y are conditionally independent given Z if
pX,Y |Z (x, y | z) = pX|Z (x | z) pY |Z (y | z)
for all possible values x, y and z of X, Y and Z respectively.
16.

Properties of random variables pertaining to independence

If X, Y are independent, then


Var(X + Y ) = Var(X) + Var(Y ).
If X, Y are independent, then
E(XY ) = E(X)E(Y ), i.e. Cov(X, Y ) = 0.
Cov(X, Y ) = 0 6 X, Y are independent.

The covariance matrix corresponding to a vector of random variables


is symmetric and positive semi-definite.
If the covariance matrix of a multi-variate Gaussian distribution is
diagonal, then the marginal distributions are independent.
17.

3. Limit Theorems
[ Sheldon Ross, A first course in probability, 5th ed., 1998 ]

The most important results in probability theory are limit theo-


rems. Of these, the most important are...
laws of large numbers, concerned with stating conditions under
which the average of a sequence of random variables converge (in
some sense) to the expected average;
central limit theorems, concerned with determining the conditions
under which the sum of a large number of random variables has a
probability distribution that is approximately normal.
18.
Two basic inequalities and the weak law of large numbers
Markovs inequality:
If X is a random variable that takes only non-negative values,
then for any value a > 0,
E[X]
P (X a)
a
Chebyshevs inequality:
If X is a random variable with finite mean and variance 2 ,
then for any value k > 0,
2
P (| X | k) 2
k
The weak law of large numbers (Bernoulli; Khintchine):
Let X1 , X2 , . . . , Xn be a sequence of independent and identically dis-
tributed random variables, each having a finite mean E[Xi ] = .
Then, for any value > 0,
 
X1 + . . . + Xn
P 0 as n
n
19.

The central limit theorem


for i.i.d. random variables
[ Pierre Simon, Marquis de Laplace; Liapunoff in 1901-1902 ]

Let X1 , X2 , . . . , Xn be a sequence of independent random variables,


each having mean and variance 2 .
Then the distribution of
X1 + . . . + Xn n

n
tends to be the standard normal (Gaussian) as n .
That is, for < a < ,
  Z a
X1 + . . . + Xn n 1 2
P a ex /2 dx as n
n 2
20.
The central limit theorem
for independent random variables
Let X1 , X2 , . . . , Xn be a sequence of independent and identically distributed
random variables having respective means i and variances i2 .
If
(a) the variables Xi are uniformly bounded,
i.e. for some M R+ P (| Xi |< M) = 1 for all i,
and
P
(b) i=1 i2 = ,
then Pn !
i=1 (Xi i )
P p Pn 2
a (a) as n

i=1 i

where is the cumulative distribution function for the standard normal


(Gaussian) distribution.
21.

The strong law of large numbers

Let X1 , X2 , . . . , Xn be a sequence of independent and identically


distributed random variables, each having a finite mean E[Xi ] = .
Then, with probability 1,
X1 + . . . + Xn
as n
n
That is,  
P lim (X1 + . . . + Xn )/n = = 1
n
22.
Other inequalities
One-sided Chebyshev inequality:
If X is a random variable with mean 0 and finite variance 2 ,
then for any a > 0,
2
P (X a) 2
+ a2
Corollary:
If E[X] = , Var(X) = 2 , then for a > 0 2

P (X + a) 2
+ a2
2
P (X a) 2
+ a2

Chernoff bounds:
not
Let M(t) = E[etX ]. Then
P (X a) eta M(t) for all t > 0
P (X a) eta M(t) for all t < 0
23.

4. Estimation/inference of the parameters of


probabilistic models from data
(based on [Durbin et al, Biological Sequence Analysis, 1998],
p. 311-313, 319-321)

A probabilistic model can be anything from a simple distribution


to a complex stochastic grammar with many implicit probability
distributions. Once the type of the model is chosen, the parame-
ters have to be inferred from data.
We will first consider the case of the categorical distribution, and
then we will present the different strategies that can be used in
general.
24.
A case study: Estimation of the parameters of
a categorical distribution from data
Assume that the observations for example, when rolling a die about
which we dont know whether it is fair or not, or when counting the number
of times the amino acid i occurs in a column of a multiple sequence align-
ment can be expressed as counts ni for each outcome i (i = 1, l . . . , K),
and we want to estimate the probabilities i of the underlying distribution.

Case 1:
When we have plenty of data, it is natural to use the maximum likeli-
n ni
hood (ML) solution, i.e. the observed frequency iML = P i not. = .
n
j j N
Note: it is easy to show that indeed P (n | ML ) > P (n | ) for any 6= ML .
P (n | ML ) i (iML )ni X iML X
ML iML
log = log = ni log =N i log >0
P (n | ) i ini i
i i
i

The inequality follows from the fact that the relative entropy is always
positive except when the two distributions are identical.
25.
Case 2:
When the data is scarce, it is not clear what is the best estimate.
In general, we should use prior knowledge, via Bayesian statistics.
For instance, one can use the Dirichlet distribution with parameters .
P (n | )D( | )
P ( | n) =
P (n)

It can be shown (see calculus on R. Durbin et. al. BSA book, pag. 320)
that the posterior mean estimation (PME) of the parameters is
ni + i
Z
def.
iPME = P ( | n)d = P
N + j j

The s are like pseudocounts added to the real counts. (If we think of the
s as extra observations added to the real ones, this is precisely the ML
estimate!) This makes the Dirichlet regulariser very intuitive.
How to use the pseudocounts: If it is fairly obvious that a certain residue,
lets say i, is very common, than we should give it a very high pseudocount
i ; if the residue j is generaly rare, we should give it a low pseudocount.
26.
Strategies to be used in the general case
A. The Maximum Likelihood (ML) Estimate
When we wish to infer the parameters = (i ) for a model M from a set
of data D, the most obvious strategy is to maximise P (D | , M) over all
possible values of . Formally:

ML = argmax P (D | , M)

Note: Generally speaking, when we treat P (x | y) as a function of x (and


y is fixed), we refer to it as a probability. When we treat P (x | y) as a
function of y (and x is fixed), we call it a likelihood. Note that a likelihood
is not a probability distribution or density; it is simply a function of the
variable y.
A serious drawback of maximum likelihood is that it gives poor results
when data is scarce. The solution then is to introduce more prior knowl-
edge, using Bayes theorem. (In the Bayesian framework, the parameters
are themselves seen as random variables!)
27.
B. The Maximum A posteriori Probability (MAP) Estimate
def. P (D | , M )P ( | M )
MAP = argmax P ( | D, M ) = argmax
P (D | M )
= argmax P (D | , M )P ( | M )

The prior probability P ( | M) has to be chosen in some reasonable manner,


and this is the art of Bayesian estimation (although this freedom to choose
a prior has made Bayesian statistics controversial at times...).

C. The Posterior Mean Estimator (PME)


Z
PME = P ( | D, M)d

where the integral is over all probability vectors, i.e. all those that sum to
one.

D. Yet another solution is to use the posterior probability P ( | D, M) to


sample from it (see [Durbin et al, 1998], section 11.4) and thereby locate
regions of high probability for the model parameters.
28.
5. Elementary Information Theory
Definitions:
Let X and Y be discrete random variables.
def. P 1 P
Entropy: H(X) = x p(x) log = x p(x) log p(x) = Ep [ log p(X)].
p(x)
Convention: if p(x) = 0 then we shall consider p(x) log p(x) = 0.

def. P
Specific Conditional entropy: H(Y | X = x) = yY p(y | x) log p(y | x).
Average conditional entropy:
def. P imed. P P
H(Y | X) = xX p(x)H(Y | X = x) = xX yY p(x, y) log p(y | x).
Joint entropy:
def. P dem. dem.
H(X, Y ) = x p(x, y) log p(x, y) = H(X) + H(Y | X) = H(Y ) + H(X | Y ).
Mutual information (or: Information gain):
def. imed.
IG(X; Y ) = H(X) H(X | Y ) = H(Y ) H(Y | X)
imed.
= H(X, Y ) H(X | Y ) H(Y | X) = IG(Y ; X).
29.
Exemplification: Entropy of a Bernoulli Distribution

H(p) = p log2 p (1 p) log2 (1 p)

1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

p
30.
Basic properties of
Entropy, Conditional Entropy, Joint Entropy and
Mutual Information / Information Gain
 
1 1
0 H(p1 , . . . , pn ) H ,..., = log n;
n n
H(X) = 0 iff X is a constant random variable.
IG(X; Y ) 0;
IG(X; Y ) = 0 iff X and Y are independent.
H(X | Y ) H(X)
H(X | Y ) = H(X) iff X and Y are independent;
H(X, Y ) H(X) + H(Y );
H(X, Y ) = H(X) + H(Y ) iff X and Y are independent.
a chain rule: H(X1, . . . , Xn ) = H(X1 )+H(X2|X1 )+. . .+H(Xn |X1 , . . . , Xn1).
31.

The Relationship between


Entropy, Conditional Entropy, Joint Entropy and
Mutual Information

H(X,Y)

H(X|Y) I(X,Y) H(Y|X)

H(X) H(Y)
32.
Other definitions

Let X be a discrete random variable, p its pmf and q another pmf


(usually a model of p).
Cross-entropy:
 
X 1
CH(X, q) = p(x) log q(x) = Ep log
xX
q(X)

Let X and Y be discrete random variables, and p and q their respective


pmfs.
Relative entropy (or, Kullback-Leibler divergence):
 
X q(x) p(X)
KL(p || q) = p(x) log = Ep log
xX
p(x) q(X)
= CH(X, q) KL(p || q).
33.

Basic properties of
relative entropy and cross-entropy
KL(p || q) 0 for all p and q;
KL(p || q) = 0 iff p and q are identical.
If X is a discrete random variable, p its pmf, and q another pmf,
then CH(X, q) H(X) 0.
KL is NOT a distance metric (because it is not symmetric)!!
The quantity
def
d(X, Y ) = H(X, Y ) IG(X; Y ) = H(X) + H(Y ) 2IG(X; Y )
= H(X | Y ) + H(Y | X)
known as variation of information, is a distance metric. 
P P p(x)p(y)
IG(X; Y ) = KL(pXY || pX pY ) = x y p(x, y) log .
p(x, y)
34.

6. Recommended Exercises

From [Manning & Schutze, 2002 , ch. 2:]


Examples 1, 2, 4, 5, 7, 8, 9
Exercises 2.1, 2.3, 2.4, 2.5
From [Sheldon Ross, 1998 , ch. 8:]
Examples 2a, 2b, 3a, 3b, 3c, 5a, 5b
35.

Addenda:
Other Examples of Probabilistic Distributions
36.

Multinomial distribution:
generalises the binomial distribution to the case where there P are K inde-
pendent outcomes with probabilities i , i = 1, . . . , K such that Ki=1 i = 1.
The probability of getting ni occurrence of outcome i is given by
n! K ni
P (n | ) = i=1 i ,
K
i=1 (ni !)
where n = n1 + . . . + nK , and = (1 , . . . , K ).
Note: The particular case n = 1 represents the categorical distribu-
tion. This is a generalisation of the Bernoulli distribution.
Example: The outcome of rolling a die n times is described by a categor-
ical distribution. The probabilities of each of the 6 outcomes are 1 , . . . , 6 .
For a fair die, 1 = . . . = 6 , and the probability of rolling it 12 times and
getting each outcome twice is:
 12
12! 1 3
= 3.4 10
(2!)6 6
37.
Poisson distribution (or, Poisson law of small numbers):
k
p(k; ) = e , with k N and parameter > 0.
k!
Mean = variance = .

Poisson probability mass function Poisson cumulative distribution function

1.0
0.4

=1
=4

0.8
0.3

= 10

0.6
P(X k)
P(X=k)

0.2

0.4
=1
0.1

=4

0.2
= 10
0.0

0.0
0 5 10 15 20 0 5 10 15 20

k k
38.

Exponential distribution (a.k.a. negative exponential distribution):


p(x; ) = ex for x 0 and parameter > 0.
Mean = 1 , variance = 2 .

Exponential probability density function Exponential cumulative distribution function

1.0
1.5

= 0.5
=1

0.8
= 1.5
1.0

0.6
P(X x)
p(x)

0.4
0.5

= 0.5

0.2
=1
= 1.5

0.0
0.0

0 1 2 3 4 5 0 1 2 3 4 5

x x
39.
Gamma distribution:
ex/ k1
p(x; k, ) = x for x 0 and parameters k > 0 (shape) and > 0 (scale).
(k)k
Mean = k, variance = k2 .
The gamma function is a generalisation of the factorial function to real values. For any
positive real number x, (x + 1) = x(x). (Thus, for integers (n) = (n 1)!.)

Gamma probability density function Gamma cumulative distribution function


0.5

1.0
k = 1.0, = 2.0
k = 2.0, = 2.0
k = 3.0, = 2.0
0.4

0.8
k = 5.0, = 1.0
k = 9.0, = 0.5
k = 7.5, = 1.0
0.3

0.6
k = 0.5, = 1.0

P(X x)
p(x)

k = 1.0, = 2.0
0.2

0.4
k = 2.0, = 2.0
k = 3.0, = 2.0
k = 5.0, = 1.0
k = 9.0, = 0.5
0.1

0.2
k = 7.5, = 1.0
k = 0.5, = 1.0
0.0

0.0

0 5 10 15 20 0 5 10 15 20

x x
40.

2 distribution:
 /2
1 1 1
p(x; ) = x/21 e 2 x for x 0 and a positive integer.
(/2) 2
It is obtained from Gamma distribution by taking k = /2 and = 2.
Mean = , variance = 2.
Chi Squared probability density function Chi Squared cumulative distribution function

1.0
0.5

k=1
k=2

0.8
0.4

k=3
k=4
k=6

0.6
k=9
0.3

P(X x)
p(x)

0.4
0.2

k=1
k=2
k=3

0.2
0.1

k=4
k=6
k=9

0.0
0.0

0 2 4 6 8 0 2 4 6 8

x x
41.
Laplace distribution:
| x|
1
p(x; , ) = e .
2
Mean = , variance = 22 .
Laplace probability density function Laplace cumulative density function

1.0
0.5

= 0, = 1
= 0, = 2
= 0, = 4

0.8
0.4

= 5, = 4

0.6
0.3

P(X x)
p(x)

0.4
0.2

= 0, = 1

0.2
0.1

= 0, = 2
= 0, = 4
= 5, = 4

0.0
0.0

10 5 0 5 10 10 5 0 5 10

x x
42.
Students

distribution:

+1 +1
 2

2 x 2
p(x; ) =   1 + for x R and > 0 (the degree of freedom param.)

2
Mean = 0 for > 1, otherwise undefined.

Variance = for > 2, for 1 < 2, otherwise undefined.
2
The probability density function and the cumulative distribution function:

Note [from Wiki]: The t-distribution is symmetric and bell-shaped, like the normal distribution, but it has
havier tails, meaning that it is more prone to producing values that fall far from its mean.
43.

Dirichlet distribution:
1 K i 1 PK
D( | ) = ( i=1 i 1)
Z() i=1 i
where
= 1 , . . . , K with i > 0 are the parameters,
i satisfyP0 i 1 and sum to 1, this being indicated by the delta function
term ( i i 1), and
the normalising factor can be expressed in terms of the gamma function:
R K i 1 P i (i )
Z() = i=1 i ( i 1)d = P
( i i )
i
Mean of i : P .

j j

For K = 2, the Dirichlet distribution reduces to the more widely known


beta distribution, and the normalising constant is the beta function.
44.

Remark:
Concerning the multinomial and Dirichlet distributions:
The algebraic expression for the parameters i is similar in the two distri-
butions.
However, the multinomial is a distribution over its exponents ni , whereas
the Dirichlet is a distribution over the numbers i that are exponentiated.
The two distributions are said to be conjugate distributions and their
close formal relationship leads to a harmonious interplay in many estima-
tion problems.
Similarly,
the beta distribution is the conjugate of the Bernoulli distribution, and
the gamma distribution is the conjugate of the Poisson distribution.

Potrebbero piacerti anche