Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
ISBN 978-981-3228-25-2
Printed in Singapore
Preface
Imagine old times when the word probability did not exist. Facing difficult
situations that could be expressed as irregular, unpredictable, random, etc.
(in what follows, we call it random), people were helpless. After a long
time, they have found how to describe randomness, how to analyze it, how
to define it, and how to make use of it. What is really amazing is that these
all have been done in the very rigorous mathematicsjust like geometry
and algebra.
v
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page vi
vi Preface
This book presents well-known basic theorems with proofs that are not
seen in usual probability textbooks; for we want readers to learn that a
good solution is not always unique. In general, breakthroughs in science
have been made by unusual solutions. We hope readers to know more than
one proof for every important theorem.
Preface vii
viii Preface
Contents
Preface v
Notations and symbols . . . . . . . . . . . . . . . . . . . . . . . viii
Table of Greek letters . . . . . . . . . . . . . . . . . . . . . . . . viii
2. Random number 23
2.1 Recursive function . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Computable function . . . . . . . . . . . . . . . . 25
2.1.2 Primitive recursive function and partial recursive
function . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.3 Kleenes normal form() 3 . . . . . . . . . . . . . . 29
2.1.4 Enumeration theorem . . . . . . . . . . . . . . . . 31
2.2 Kolmogorov complexity and random number . . . . . . . 33
3 The subsections with () can be skipped.
ix
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page x
x Contents
3. Limit theorem 42
3.1 Bernoullis theorem . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Law of large numbers . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Sequence of independent random variables . . . . 48
3.2.2 Chebyshevs inequality . . . . . . . . . . . . . . . 54
3.2.3 CramerChernoffs inequality . . . . . . . . . . . . 57
3.3 De MoivreLaplaces theorem . . . . . . . . . . . . . . . . 60
3.3.1 Binomial distribution . . . . . . . . . . . . . . . . 60
3.3.2 Heuristic observation . . . . . . . . . . . . . . . . 61
3.3.3 Taylors formula and Stirlings formula . . . . . . 65
3.3.4 Proof of de MoivreLaplaces theorem . . . . . . . 75
3.4 Central limit theorem . . . . . . . . . . . . . . . . . . . . 80
3.5 Mathematical statistics . . . . . . . . . . . . . . . . . . . . 86
3.5.1 Inference . . . . . . . . . . . . . . . . . . . . . . . 86
3.5.2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Appendix A 105
A.1 Symbols and terms . . . . . . . . . . . . . . . . . . . . . . 105
A.1.1 Set and function . . . . . . . . . . . . . . . . . . . 105
A.1.2 Symbols for sum and product . . . . . . . . . . . . 106
A.1.3 Inequality symbol . . . . . . . . . . . . . . . 108
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page xi
Contents xi
Bibliography 121
Index 123
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 1
Chapter 1
1
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 2
Example 1.1. Let {0, 1}3 denote the set of all {0, 1}-sequences of length 3:
{0, 1}3 := { = (1 , 2 , 3 ) | i {0, 1}, 1 5 i 5 3 }
= { (0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0),
(1, 0, 1), (1, 1, 0), (1, 1, 1) }.
Let P({0, 1} ) be the power set of {0, 1}3 , i.e., the set of all subsets of
3 1
{0, 1}3 . A P({0, 1}3 ) is equivalent to A {0, 1}3 . Let #A denote the
number of elements of A. Now, define a function P3 : P({0, 1}3 ) [0, 1] :=
{ x | 0 5 x 5 1 } by
#A #A
P3 (A) := 3
= 3 , A P({0, 1}3 )
#{0, 1} 2
(See Definition A.2), and functions i : {0, 1}3 {0, 1}, i = 1, 2, 3, by
i () := i , = (1 , 2 , 3 ) {0, 1}3 . (1.2)
1P is the letter P in the Fractur typeface.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 3
s c
Readers may suspect that, in the first place, (1.1) is not correct for
real coin tosses. Indeed, rigorously speaking, since Heads and Tails are
differently carved, (1.1) is not exact in the real world. What we call coin
tosses is an idealized model, which can exist only in our mindjust as we
consider the equation (x a)2 + (y b)2 = c2 as a mathematical model of
circle, although there is no true circle in the real world.
{(, p ) | }
clidean space, topological space, Hilbert space, etc. These are sets accompanied by
some structures, operations, or functions. In general, they have nothing to do with the
3-dimensional space where we live.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 6
By the way, in Example 1.2, you may wish to assume [0, 1) to be the
whole event and P to be the probability, but since [0, 1) is an infinite set, it
is not covered by Definition 1.2. By extending the definition of probability
space, it is possible to consider an infinite set as a whole event, but to do
this, we need Lebesgues measure theory, which exceeds the level of this
book.
Since we have
0 5 pj1 ,...,jn 5 1, j1 = 1, . . . , s1 , . . . , jn = 1, . . . , sn ,
X
pj1 ,...,jn = 1
j1 =1,...,s1 , ..., jn =1,...,sn
PP marginal distribution
P PP X1
PP a11 a1s1
X2 P of X2
Ps1
a21 p11 ps1 1 i=1 pi1
.. .. .. .. ..
. . . . .
Ps1
a2s2 p1s2 ps1 s2 i=1 pis2
As you see, since they are placed in the margins of the table, they are called
marginal distributions.
Remark 1.1. Laplace insisted that randomness should not exist and all
phenomena should be deterministic ([Lapalce (1812)]). For an intelligence
who knows all about the forces that move substances and about the posi-
tions and the velocities of all molecules that consist the substances at an
initial time, and if in addition the intelligence 6 has a vast ability to analyze
the motion equation, there would be no irregular things in this world and
everything would be deterministic. However, for us who know only a little
part of the universe and do not have enough ability to analyze the very
complicated motion equation, things occur as if they do at random.
6 This intelligence is often referred to as Laplaces demon.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 10
Suppose that Alice 7 chooses an {0, 1}n of her own will. When n
is small, she can easily write down a {0, 1}-sequence of length n for . For
example, if n = 10, she writes (1, 1, 1, 0, 1, 0, 0, 1, 1, 1). When n = 1000, she
can do it somehow in the same way.
It comes into question when n 1. 8 For example, when n = 108 , how
8
on earth can Alice choose an from {0, 1}10 ? In principle, she should
write down a {0, 1}-sequence of length 108 , but it is impossible because 108
is too huge. Considering the hardness of the task, she cannot help using a
8
computer to choose an {0, 1}10 . The computer program that produces
8
= (0, 0, 0, 0, . . . , 0, 0) {0, 1}10 , (1.10)
7 Aliceis a fictitious character who makes several thought experiments in this book.
8 a b means a is much greater than b. See Sec. A.1.3.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 11
i.e., the run of 0s of length 108 , would be simple and easy to write. The
one that produces
8
= (1, 0, 1, 0, . . . , 1, 0) {0, 1}10 , (1.11)
i.e., 5 107 times repetition of a pattern 1, 0 would be less simple but still
8
easy to write. On the other hand, for some {0, 1}10 , the program to
produce it would be too long to write in practice. Let us explain it below.
In general, a program is a finite string of letters and symbols, which
is described in computer as a {0, 1}-sequence of finite length. For each
8
{0, 1}10 , let q be the shortest program that produces , and let
L(q ) denote the length 9 of q as a {0, 1}-sequence. If 6= 0 , then
q 6= q0 . Now, the number of for which L(q ) = k is at most 2k , the
number of all elements of {0, 1}k . This implies that the total number of
8
{0, 1}10 for which L(q ) 5 M is at most
Remark 1.3. We have defined randomness for long {0, 1}-sequences, such
as the outcomes of many coin tosses, but we feel even a single coin toss is
random. Why do we feel so?
For example, if we drop a coin with Heads up quietly from a height
of 5mm above a desk surface, then the outcome will be Heads without
doubt. It is thus possible to control Heads and Tails in this case. On
the other hand, if we drop it from a height of 50cm, it is not possible.
Since rebounding obeys a deterministic physical law, we can, in principle,
control Heads and Tails, even though the coin is dropped from an arbitrarily
high position (Remark 1.1). However, each rebounding motion depends so
acutely on the initial value that a slight difference of initial values causes
a big difference in the results (Fig. 1.2). In other words, when we drop a
coin from a height of 50cm, in order to control its Heads and Tails, we must
measure and set the initial mechanical state of the coin with extremely high
precision, which is impossible in practice. Thus controlling the Heads and
Tails of a coin dropped from a high position is similar to choosing a random
number in that both are beyond our ability.
Fig. 1.2 Reboundings of a coin (Two simulations with slightly different initial values)
Bernoullis theorem was published in 1713 after his death. Since then,
limit theorems have been the central theme of probability theory. On the
other hand, the concept of randomness was established in 1960s. Thus
mathematicians have been studying limit theorems since long before they
defined randomness.
method.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 18
g( 0 )
The set of such that S()/106 p < 1/200
8
Fig. 1.5 The role of g : {0, 1}238 {0, 1}10 (Conceptual figure)
Alice can choose any 0 {0, 1}238 of her own will, or by tossing a real
coin 238 times, she can get a seed 0 by random sampling. Thus to solve
Exercise I, she does not need a random number, but by the pseudoran-
dom number that g produces from her seed, she can actually get a good
approximate value of p with high probability.
Exercise I deals with a problem about coin tosses, but practical problems
to which the Monte Carlo method is applied are much more complicated.
Nevertheless, since any practical probabilistic problem can be reduced to
that of coin tosses (Sec. 1.5.2), we may assume pseudorandom numbers to
be {0, 1}-sequences.
Borels model of coin tosses (Example 1.2) can give a mathematical model
of not only 3 coin tosses but also arbitrarily many coin tosses. Furthermore,
the sequence of functions {di } i=1 defined in Example 1.2 can be regarded
as infinite coin tosses. Of course, there do not exist infinite coin tosses in
the real world, but by some reasons, it is important to consider them.
The contents of this section slightly exceeds the level of this book.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 19
it holds that
Accordingly, we see
X
j
P x [0, 1) X
2 di j
(x) < t = F (t), t R.
j=1
P
Namely, X j=1 2j dij obeys the standard normal distribution.
infinite set and hence this idea exceeds the level of this book. Here the quotation mark
of random variable shows that this word is not rigorously defined in this book. In what
follows, independent will be used similarly.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 21
X2 := X 21 d2 + 22 d5 + 23 d9 + 24 d14 + ,
X3 := X 21 d4 + 22 d8 + 23 d13 + ,
X4 := X 21 d7 + 22 d12 + ,
X5 := X 21 d11 + ,
..
.
then each Xn obeys the standard normal distribution. We emphasize that
each dk appears only in one Xn , which means the value of each Xn does
not make influence on any other Xn0 (n0 6= n). Namely, {Xn } n=1 are
independent.
Now, we are at the position to define a Brownian motion {Bt }05t5
(Fig. 1.6):
r
t 2 X sin nt
Bt := X1 + Xn+1 , 0 5 t 5 . (1.17)
n=1 n
To tell the truth, the graph of Fig. 1.6 is not exactly the Brownian
motion (1.17) itself, but its approximation 18 {Bt }05t5 . Let {Xn }1000
n=1 be
approximated by {Xn }1000
n=1 , where
X1 := X 21 d1 + 22 d2 + 23 d3 + + 231 d31 ,
..
.
X1000 := X 21 d30970 + 22 d30971 + 23 d30972 + + 231 d31000 ,
18 To
be precise, approximation in the sense of distribution.
19 Thesample of 31,000 coin tosses is produced by a pseudorandom generator in [Sugita
(2011)] Sec. 4.2.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 22
1.5
1.0
0.5
!0.5
Chapter 2
Random number
23
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 24
24 Random number
The kinds of data dealt with by computer are diverse. As for input, data
from keyboard, mouse, scanner, and video camera, and as for output, doc-
ument, image, sound, movie, control signal for IT device, etc. They all
are converted into finite {0, 1}-sequences, and then they are recorded in
computer memory or disks (Fig. 2.1), copied, or transmitted.
0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0
Each boundary of flat place and hollow place records 1, other places record 0.
Fig. 2.1 Images of CD (left) and DVD (right) by scanning electron microscope 1
Proposition 2.1. The set { f | f : N {0, 1} }, i.e., the set of all functions
from N to {0, 1}, is an uncountable set.
@
f0 (0)@f0 (1) f0 (2) f0 (3) f0 (4)
@ @
@ @
@ f1 (1)@f1 (2) f1 (3) f1 (4)
f1 (0)
@ @
@ @
@ f2 (2)@f2 (3) f2 (4)
f2 (0) f2 (1)
@ @
@ @
@ f3 (3)@f3 (4)
f3 (0) f3 (1) f3 (2)
@ @
@ @
@ f4 (4)@
f4 (0) f4 (1) f4 (2) f4 (3)
@
.. .. .. .. .. ..
. . . . . .
The above proof is called the diagonal method. The most familiar un-
countable set is the set of all real numbers R, which fact and whose proof
can be found in most of textbooks of analysis or set theory.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 26
26 Random number
{g0 , 0 , g1 , 1 , g2 , 2 , . . .}
About the definition of computable function, there had been many discus-
sions, until it reached a consensus in 1930s: we can compute recursive
functions (more precisely, primitive recursive functions, partial recursive
functions, and total recursive functions) and nothing else. The set of all
recursive functions coincide with the set of all functions that the Turing
machine 3 can compute. Any actions of real computers can be described
by recursive functions. It is amazing that all of diverse complicated actions
of computers are just combinations of small number of basic operations.
In this subsection, we introduce the definitions of primitive recursive
function, partial recursive function, and total recursive function, but we do
not develop rigorous arguments here.
zero : N0 N, zero( ) := 0,
suc : N N, suc(x) := x + 1,
proj ni : Nn N, proj ni (x1 , . . . , xn ) := xi , i = 1, . . . , n
(ii) (Composition)
For g : Nm N and gj : Nn N, j = 1, . . . , m, we define f : Nn N by
f (x1 , . . . , xn ) := g(g1 (x1 , . . . , xn ), . . . , gm (x1 , . . . , xn )).
This operation is called composition.
(iii) (Recursion)
For g : Nn N and : Nn+2 N, we define f : Nn+1 N by
f (x1 , . . . , xn , 0) := g(x1 , . . . , xn ),
f (x1 , . . . , xn , y + 1) := (x1 , . . . , xn , y, f (x1 , . . . , xn , y)).
This operation is called recursion.
(iv) A function Nn N is called a primitive recursive function if and only if
it is a basic function or a function obtained from basic functions by applying
finite combinations of composition and recursion.
Example 2.1. Two variables sum add(x, y) = x+y is a primitive recursive
function. Indeed, it is defined by
:= proj 11 (x) = x,
add(x, 0)
add(x, y + 1) := proj 33 (x, y, suc(add(x, y))).
Two variables product mult(x, y) = xy is also a primitive recursive func-
tion. Indeed, it is defined by
:= proj 22 (x, zero( )) = 0,
mult(x, 0)
mult(x, y + 1) := add(proj 21 (x, y), mult(x, y)).
Using a primitive recursive function pred(x) = max{x 1, 0} 5 :
pred(0) := zero( ) = 0,
pred(y + 1) := proj 21 (y, pred(y)),
we define two variables difference sub(x, y) = max{x y, 0} by
:= proj 11 (x) = x,
sub(x, 0)
sub(x, y + 1) := pred(proj 33 (x, y, sub(x, y))),
then it is a primitive recursive function.
28 Random number
(1) y := 0.
(2) If p(x1 , . . . , xn , y) = 0, then output y and halt.
(3) Increase y by 1, and go to (2).
Theorem 2.1. (Kleenes normal form) For any partial recursive function
f : Nn N, there exist two primitive recursive functions g, p : Nn+1 N
such that
f (x1 , x2 , . . . , xn ) = g(x1 , x2 , . . . , xn , y (p(x1 , x2 , . . . , xn , y))),
(x1 , x2 , . . . , xn ) Nn . (2.1)
1 0
Input(x1 , . . . , xn ); z := 0 A? Output(z)
no
yes
2
B
3 5
C ? no E
yes
4
D ? no
yes
Let Fig. 2.2 (Flow chart I) be a flow chart to compute the function f .
A ? C ? D ? show conditions of branches, and B E are procedures
without loops (i.e., calculation of primitive recursive functions), which set
the value of z, respectively. This program includes the main loop A ?
B C ? D ? A ? , a nested loop C ? E C ? , and an
7 The subsections with () can be skipped.
8 Flowcharts I, II are slight modifications of Fig. 3 (p.12), Fig. 4 (p.13) of [Takahashi
(1991)], respectively.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 30
30 Random number
escape branch from the main loop at D ? . Let us show that these loops
can be rearranged into a single loop by introducing a new variable u. To
do this, we put numbers 0 to 5 respectively at the top left of the boxes of
all procedures A ? C ? D ? B E and the output procedure in order
for the variable u to refer (Fig. 2.2).
Input(x1 , . . . , xn ); z := 0
Q
u := 1
u = 0 ? u=1 ? A? u:=2
yes yes yes
no no no
u:=0
Output(z)
u=2 ? B u:=3
yes
no
u=3 ? C? u:=4
yes yes
no no
u:=5
u=4 ? D? u:=1
yes yes
no no
(u = 5) u:=0
E u:=3
Fig. 2.3 (Flow chart II) shows the rearrangement of Flow chart I. It is
easy to confirm that Flow chart II also computes the same function f . Let
us show that f , which Flow chart II computes, can be expressed in the
form (2.1). Let Q be a process consisting of all the procedures enclosed by
the thick lines in Flow chart II. Define g(x1 , . . . , xn , y) as the value of the
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 31
32 Random number
or Theorem 2.7, self-referential versions of diagonal method are used. The proof below
is self-referential in that we substitute the Godel number e for the first argument of
defined by (2.2).
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 33
prime numbers.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 34
34 Random number
Definition 2.3. Let {0, 1} := nN {0, 1}n . Namely, {0, 1} is the set
S
of all finite {0, 1}-sequences. An element of {0, 1} , i.e., a finite {0, 1}-
sequence, is called a word. In particular, the {0, 1}-sequence of length 0
is called the empty word. The canonical order in {0, 1} is defined in the
following way: for x, y {0, 1} , if x is longer than y then define x > y, if
x and y have a same length then define the order regarding them as binary
integers. We identify {0, 1} with N by the canonical order. For example,
the empty word= 0, (0) = 1, (1) = 2, (0, 0) = 3, (0, 1) = 4, (1, 0) = 5,
(1, 1) = 6, (0, 0, 0) = 7, . . . .
Definition 2.4. For each q {0, 1} , let L(q) N denote the n such
that q {0, 1}n , i.e., L(q) is the length of q. For q N, L(q) means the
length of the corresponding {0, 1}-sequence to q in the canonical order. For
example, L(5) = L((1, 0)) = 2. In general, L(q) is equal to the integer part
of log2 (q + 1). i.e., L(q) = blog2 (q + 1)c.
For example, for (1) {0, 1}1 and (1, 1, 0, 1, 1) {0, 1}5 , we have
h(1), (1, 1, 0, 1, 1)i = (1, 1, 0, 1, 1, 1, 0, 1, 1) =: u {0, 1}9 .
Then, (u)21 = (1) and (u)22 = (1, 1, 0, 1, 1). At the same time, since
h(1), (1), (1)i = u, we have (u)31 = (u)32 = (u)33 = (1).
36 Random number
When KA0 (x) cA0 A , if KA0 (x) is greater than KA (x), the difference
is relatively small. Therefore for x such that KA0 (x) cA0 A , either KA0 (x)
is less than KA (x) or KA0 (x) is relatively slightly greater than KA (x). In
this sense, A0 is also called an asymptotically optimal algorithm.
Let A0 and A00 be two universal algorithms. 12 Then, putting c :=
max{cA00 A0 , cA0 A00 }, we have
KA (x) KA0 (x) < c, x {0, 1} .
0 0
(2.5)
This means that when KA0 (x) or KA00 (x) is much greater than c, their
difference can be ignored.
Definition 2.6. We fix a universal algorithm A0 , and define
K(x) := KA0 (x), x {0, 1} .
We call K(x) the Kolmogorov complexity 13 of x.
Theorem 2.6. (i) There exists a constant c > 0 such that
K(x) 5 n + c, x {0, 1}n , n N+ .
In particular, K : {0, 1} N is a total function.
(ii) If n > c0 > 0, then we have
0
#{x {0, 1}n | K(x) = n c0 } > 2n 2nc .
Proof. (i) For an algorithm A(x) := proj 11 (x) = x, we have KA (x) = n
for x {0, 1}n . Consequently, Theorem 2.5 implies K(x) 5 n + c for some
constant c > 0. (ii) The number of q such that L(q) < n c0 is equal to
0 0
20 + 21 + + 2nc 1 = 2nc 1, and hence the number of x {0, 1}
0
such that K(x) < n c0 is at most 2nc 1. From this (ii) follows.
12 Since there is more than one enumerating function, there is more than one universal
algorithm.
13 It is also called the Kolmogorov-Chaitin complexity, algorithmic complexity, descrip-
Remark 2.1. Since there is more than one universal algorithm, and the
differences among them include ambiguities (2.5), the definition of random
number cannot help having some ambiguity. 14
As is seen in Example 2.3, we know many x {0, 1} that are not ran-
dom. However, we know no concrete example of random number. Indeed,
the following theorem implies that there is no algorithm to judge whether
a given x {0, 1}n , n 1, is random or not.
38 Random number
K(x) is a total function, i.e., it is defined for all x {0, 1} , but there
is no program that computes it. Thus definable and computable are
different concepts.
Theorem 2.7 is deeply related to Theorem 2.4. Consider a function
complexity(x) defined below. In its definition, A0 is the universal algorithm
appeared in the proof of Theorem 2.5, which is assumed to be used for the
definition of K(x).
complexity(x)
(1) l := 0.
(2) Let q be the first word of {0, 1}l .
(3) If A0 (q) = x, then output l and halt.
(4) If q is the last word of {0, 1}l , increase l by 1, and go to (2).
(5) Assign q the next word of it, and go to (3).
Starting from the shortest program, the function complexity(x) executes
every program q to check whether it computes x or not, and if it does,
complexity(x) halts with output K(x). However this program does not
necessarily halt. Indeed, for some x, it must fall into an infinite loop at
step (3) before K(x) is computed, which cannot be avoided in advance
because of Theorem 2.4. 15
Suppose, for example, that a large photo image of beautiful scenery is
stored in a computer as a long word x. Since it is far from a random image,
x is by no means a random number, i.e., K(x) L(x). This means that
there is a qx {0, 1} such that A0 (qx ) = x and K(x) = L(qx ). Then,
storing qx instead of x considerably saves the computer memory. This is
the principle of data compression. The raw data x is compressed into qx ,
and A0 develops qx to x. Unfortunately, we cannot compute qx from x
because K(x) is not computable. In practice, some alternative methods
are used for data compression. 16
Proof. We show the theorem by contradiction. Suppose that there are only
finitely many prime numbers, say p1 , p2 , . . . , pk . Then, define a primitive
recursive function A : N N by
A(he1 , e2 , . . . , ek i) := pe11 pe22 pekk .
For an arbitrary m N+ , there exist e1 (m), e2 (m), . . . , ek (m) N such
that
A(he1 (m), e2 (m), . . . , ek (m)i) = m.
Consequently,
KA (m) 5 L(he1 (m), . . . , ek (m)i)
= 2L(e1 (m)) + 2 + + 2L(ek1 (m)) + 2 + L(ek (m)).
For each i, we have ei (m) 5 logpi m 5 log2 m, and hence L(ei (m)) 5
log2 (log2 m + 1). This shows that
KA (m) 5 (2k 1) log2 (log2 m + 1) + 2(k 1).
Therefore there exists a c N+ such that
K(m) 5 (2k 1) log2 (log2 m + 1) + 2(k 1) + c, m N+ .
If m 1 is a random number, K(m) log2 m. Thus the above inequality
does not hold for large random numbers m, which is a contradiction.
Of course, this proof is much more difficult than the well-known proof of
Euclids. However, polishing it, we can unexpectedly get a deep interesting
knowledge about the distribution of prime numbers.
Theorem 2.9. ([Hardy and Wright (1979)], Theorem 8) Let pn be the n-th
smallest prime number. Then, as n , we have
pn n log n. (2.7)
Here indicates that the ratio of both sides converges to 1.
40 Random number
Substituting (2.9) for the inequality (2.8), we get the following estimate:
c, c0 , c00 , c000 N+ being constants that are independent of n and m,
K(m) 5 2 log2 log2 log2 n + 2 + log2 log2 n + log2 n + log2 m log2 pn + c.
Chapter 3
Limit theorem
When we toss a coin, it comes up Heads with probability 1/2 and Tails
with probability 1/2, but this does not mean that the number of Heads
and the number of Tails in 100 coin tosses are always equal to 50 and 50.
The following is a real record of Heads(= 1) and Tails(= 0) of the authors
trial of 100 coin tosses.
42
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 43
0.8
0.6
0.4
0.2
44 Limit theorem
Pn
Proof. The number of y {0, 1}n that satisfy i=1 i (y) = np is
n n!
= .
np (n np)!(np)!
Suppose that among those y, the given is the n1 -th smallest in the canon-
ical order. Let A : {0, 1} {0, 1} be an algorithm that computes from
hn, m, li the l-th smallest word in the canonical order among the words
Pn
y {0, 1}n that satisfy i=1 i (y) = m. Then, we have
A(hn, np, n1 i) = .
Consequently,
KA () = L(hn, np, n1 i)
= 2L(n) + 2 + 2L(np) + 2 + L(n1 )
5 4L(n) + L(n1 ) + 4
n
5 4L(n) + L + 4.
np
and hence
n
5 pnp (1 p)(nnp) = 2np log2 p 2(nnp) log2 (1p) = 2nH(p) .
np
(3.2)
Then, KA () is estimated as
The entropy function H(p) has the following properties (Fig. 3.1).
(i) 0 5 H(p) 5 1. H(p) takes its maximum value 1 at p = 1/2.
(ii) For 0 < < 1/2, we have 0 < H 12 + = H 12 < 1, the common
46 Limit theorem
Remark 3.2. Since Theorem 2.6 (ii) is also valid for the computational
complexity KA depending on A, according to (3.3), the constant c of (3.6)
can be taken as 9. Then, putting c = 9, = 1/2000 and n = 108 , (3.6) is
now
1 + + 108
1 1
P108 > 5 1.9502 1013 . (3.7)
108 2 2000
Since the left-hand side is less than 1, this is unfortunately a meaningless
inequality.
The assertion of Bernoullis theorem is valid not only for coin tosses, but
also for more general sequences of random variables. Such extensions of
Bernoullis theorem are called the law of large numbers. In this book, we
present the most basic casethe law of large numbers for sequences of
independent identically distributed random variables.
Although it is called law, it is of course a mathematical theorem. 2
The law of large numbers, which entered the stage when probability was
not recognized in mathematics, was considered to be a natural law like the
law of inertia.
It took more than 200 years since Bernoullis age for probability to be
recognized in mathematics. As the 6-th of his 23 problems, Hilbert pro-
posed a problem axiomatic treatment of probability with limit theorems
for foundation of statistical physics at the Paris conference of the Interna-
tional Congress of Mathematicians in 1900. Probability has been recognized
widely in mathematics since Borel presented the normal number theorem
(1.16) in 1909. Finally, the axiomatization was done by [Kolomogorov
(1933)].
2 Poisson named it the law of large numbers.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 48
48 Limit theorem
Remark 3.3. On the probability space ({0, 1}2 , P({0, 1}2 ), P2 ), consider
the following three events
A := {(0, 0), (0, 1)}, B := {(0, 1), (1, 1)}, C := {(0, 0), (1, 1)}.
Then, since
1
P2 (A B) = P2 (A)P2 (B) = ,
4
1
P2 (B C) = P2 (B)P2 (C) = ,
4
1
P2 (C A) = P2 (C)P2 (A) = ,
4
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 49
{ ci < Xi 5 di }, i = 1, . . . , n
are independent.
for any ji = 1, . . . , si , i = 1, . . . , n.
50 Limit theorem
P (ci < Xi 5 di , i = 1, . . . , n)
X X
= P ( Xi = aiji , i = 1, . . . , n )
j1 ; c1 <a1j1 5d1 jn ; cn <anjn 5dn
X X n
Y
= P ( Xi = aiji ).
ji ; c1 <a1j1 5d1 jn ; cn <anjn 5dn i=1
P (ck < Xk 5 dk ) = 1, k 6 I,
and hence
Xi () := Xi (i ), = (1 , . . . , n ) , i = 1, . . . , n.
Here the probability measure is written as
= 1 n ,
and is called the product probability measure of 1 , . . . , n .
52 Limit theorem
V[X] = E X 2 E[X]2 .
= E X 2 E [2XE[X]] + E E[X]2
= E X 2 2E[X]E[X] + E[X]2
= E X 2 E[X]2 .
E[X1 Xn ]
s1
X sn
X
= a1j1 anjn P (X1 = a1j1 , . . . , Xn = anjn )
j1 =1 jn =1
Xs1 sn
X
= a1j1 anjn P (X1 = a1j1 ) P (Xn = anjn )
j1 =1 jn =1
Xs1 sn
X
= a1j1 P (X1 = a1j1 ) anjn P (Xn = anjn )
j1 =1 jn =1
= E[X1 ] E[Xn ],
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 54
54 Limit theorem
from the strong law of large numbers. The prototype of the latter is Borels normal
number theorem (1.16).
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 56
56 Limit theorem
Example 3.4. Let 0 < p < 1. In Example 3.2, we introduced the probabil-
(p)
ity space ({0, 1}n , P({0, 1}n ), Pn ), and the coordinate functions {k }nk=1
as n coin tosses, which are unfair if p 6= 1/2. Since we have
E[k ] = p
and Proposition 3.4 implies
V[k ] = E k2 E[k ]2 = p p2 = p(1 p),
As we can imagine from Theorem 2.6 (ii), Theorem 3.1 and Example 3.4,
it is known that for n 1, the probability that K() nH(p), {0, 1}n ,
(p)
is close to 1 under the probability measure Pn :
(p) n K()
lim Pn {0, 1} H(p) =
= 0, > 0.
n n
If H(p) is not extremely small, for n 1, nH(p) is a huge number so that
even a computer cannot produce any {0, 1}n with K() nH(p).
Strictly speaking, such is not a random number, but we may say it is
random in usual sense. Consequently, a long {0, 1}-sequence generated by
unfair coin tosses may well look random with high probability.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 57
Example 3.5. On the probability space ({0, 1}2n , P({0, 1}2n ), P2n ), let us
consider the following random variables.
Here {k }2n n
k=1 are the coordinate functions. Then, {Xk }k=1 are indepen-
dent, and we have
1 3
P2n (Xk = 1) = , P2n (Xk = 0) = , k = 1, . . . , n.
4 4
Namely, {Xk }nk=1 are n unfair coin tosses with p = 1/4. Therefore by
Example 3.4, we see
X1 + + Xn
1
lim P2n = = 0, > 0.
n n 4
This means that in fair (p = 1/2) coin tosses, the relative frequency of the
occurrences of Heads, Heads (= 11) approaches 1/4 as n with high
probability.
58 Limit theorem
the rate function. Deriving a rate function from a moment generating function by the
formula (3.20) is called the Legendre transformation, which plays important roles in
convex analysis and analytical mechanics.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 59
60 Limit theorem
Remark 3.5. Note that mint=0 u(t) or maxt=0 u(t) does not always exist.
For example, the former does not exist for u(t) = 1/(1 + t). In this book,
we deal with only cases where they exist.
= E e t1 E et n
n
= E e t1
n
= e0 Pn(p) (1 = 0) + et Pn(p) (1 = 1)
n
= (1 p) + et p
n
X n k nk
= et p (1 p)
k
k=0
n
X n k
= e tk p (1 p)nk .
k
k=0
62 Limit theorem
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 20 40 60 80 100
Fig. 3.3 The histogram of binomial distribution (Left: n = 30, Right: n = 100)
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-4 -2 0 2 4 -4 -2 0 2 4
Fig. 3.4 The histogram of the distribution of Zn (Left: n = 30, Right: n = 100)
1
r
= 1 n
nr+1 2
n 2r + 1 1
= n.
nr+1 2
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 64
64 Limit theorem
1 n
By (3.25), we have r = 2 nx +
and hence
2
n ( nx + n) + 1 1
= n
n 12 ( nx + n) + 1 2
nx + n
= x, n .
n nx + 2
From this and (3.28), we derive a differential equation:
f 0 (x)
= x.
f (x)
Integrating the both sides, we obtain
x2
log f (x) = +C (C is an integral constant).
2
Namely,
x2 x2
C
f (x) = exp + C = e exp .
2 2
By (3.25) and (3.26), the constant eC should be
2n 2n 1
eC = f (0) = lim f2n (0) = lim 2 2n.
n n n 2
By Wallis formula (Corollary 3.1), we see eC = 1/ 2, and hence we
obtain
2
1 x
f (x) = exp . (3.29)
2 2
This is the density function of the standard normal distribution (or the
standard Gaussian distribution, Fig. 3.5).
In the above reasoning, (3.28) has no logical basis. Nevertheless, in fact,
de MoivreLaplaces theorem (Theorem 3.6) holds. 5
0.4
0.3
0.2
0.1
-4 -2 2 4
2
Fig. 3.5 The graph of 1 exp x2
2
6 At high school, this is the definition of definite integral, but at university, definite
integral is defined in another way, and this formula is proved as the fundamental theorem
of calculus.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 66
66 Limit theorem
Example 3.7. (i) Applying Taylors formula (3.32) to f (x) = log(1 + x),
and putting a = 0 and b = x, we obtain
Z x
1 2 (x s)2
log(1 + x) = x x + 3
ds, 1 < x. (3.33)
2 0 (1 + s)
Modifying this a little, for a > 0, we have,
Z xa x 2
x 1 x 2 a s
log(a + x) = log a + + ds, a < x. (3.34)
a 2 a 0 (1 + s)3
(ii) Applying Taylors formula (3.32) to f (x) = ex , and putting a = 0,
b = x, we obtain
Z x
1 (x s)2 s
ex = 1 + x + x2 + e ds, x R. (3.35)
2 0 2
purpose. For example, if |x| 1, the integral terms in (3.33) and (3.35)
are very small, so that the following approximation formulas hold.
1
log(1 + x) x x2 ,
2
1
ex 1 + x + x2 .
2
In this sense, the integral term of Taylors formula (3.30) is called the
remainder term. 7
The remainder term often becomes small as n . For example, in
the formula
Z x
x2 xn (x s)n s
ex = 1 + x + + + + e ds, x R,
2! n! 0 n!
the remainder term converges to 0 for all x R as n :
Z x
(x s)n s x |x s|n
Z
x 0
e ds 5 max{e , e }ds
0 n!
0 n!
Z x n
|x|
5 max{ex , 1}
ds
n! 0
x |x|n
= max{e , 1}|x| 0, n .
n!
For the last convergence, see Proposition A.4. In other words, for all x R,
we have
x2 xn
1+x+ + + ex , n .
2! n!
Remark 3.6. In what follows, inequalities as we saw in the previous para-
graph will appear so often. Basically, those are applications of the following
inequality.
Z Z
B B
f (t)dt 5 |f (t)| dt 5 |A B| max |f (t)| .
A A min{A,B} 5 t 5 max{A,B}
68 Limit theorem
Corollary 3.1.
2n 2n 1
1
lim 2 2n = .
n n 2 2
Proof. By Stirlings formula, as n , we have
2n 2n 1 (2n)! 2n 1
2 2n = 2 2n
n 2 (n!)2 2
1
2 (2n)2n+ 2 e2n 2n 1
2 2 2n
1 2
2 nn+ 2 en
1
= .
2
= ReR eR + 1.
8 Note that the improper integral in the right-hand side of (3.37) makes sense in case
By Proposition A.3 (i), the right-hand side of the last line converges to 1
as R . Thus
Z
xex dx = 1,
0
which completes the proof of (3.37) for n = 1.
Next, assuming that (3.37) is valid for n = k, we show that it is also
valid for n = k + 1. For R > 0, applying integration by parts, we obtain
Z R Z R
R
xk+1 ex dx = xk+1 (ex ) 0 (k + 1)xk (ex )dx
0 0
Z R
= Rk+1 eR + (k + 1) xk ex dx.
0
Again, by Proposition A.3 (i), the first term of the last line converges to 0
as R , while the second term converges to (k + 1)k! = (k + 1)! by the
assumption of the induction. Thus we have
Z
xk+1 ex dx = (k + 1)! ,
0
which completes the proof.
At first glance, n! looks simpler than Eulers integral (3.37), but to study
it in detail, the integral is much more useful.
Now, let us look closely at Eulers integral.
Z Z Z
n x
x
x e dx = exp (n log x x) dx = exp n log x dx.
0 0 0 n
The change of variables t = x/n leads to
Z
n! = nn+1 exp (n (log t t)) dt. (3.38)
0
Then, we know
f (t) := log t t, t > 0,
is the key function.
We differentiate f to find an extremal value. The equation
1
f 0 (t) = 1 = 0
t
has a unique solution t = 1. Since f 00 (t) = 1/t2 < 0, the extremal value
f (1) = 1 is the maximum value of f (Fig. 3.6).
70 Limit theorem
1 2 3 4 5
-0.5
-1
-1.5
-2
-2.5
-3
-3.5
-4
Lemma 3.3. For any 0 < < 1 (Remark 3.1), it holds that
Z Z 1+
exp (nf (t)) dt exp (nf (t)) dt, n .
0 1
Namely, if n 1, the integral of exp(nf (t)) on [0, ) is almost determined
by the integral on arbitrarily small neighborhood of t = 1, at which f takes
the maximum value.
Proof. We define a function g by
1 f (1 )
(t 1) 1 (0 < t 5 1),
g(t) :=
f (1 + ) + 1
(t 1) 1 (1 < t).
The graph of g is a polyline that passes the following three points (Fig. 3.7):
(1 , f (1 )), (1, 1) and (1 + , f (1 + )).
Since f is a concave function, we have
> f (t) (0 < t < 1 ),
g(t) 5 f (t) (1 5 t 5 1 + ),
> f (t) (1 + < t).
Therefore
Z Z
exp (nf (t)) dt exp (ng(t)) dt
t>0 ; |t1|> t>0 ; |t1|>
Z 1+ 5 Z 1+ , (3.39)
exp (nf (t)) dt exp (ng(t)) dt
1 1
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 71
1 2 3 4 5
-0.5
-1
-1.5
-2
-2.5
-3
-3.5
-4
Z Z 1 Z
where denotes the integral + on the set {t > 0}
t>0 ; |t1|> 0 1+
{|t 1| > }.
Since g is a polyline, we can explicitly calculate the right-hand side of
(3.39). First, in the region 0 < t 5 1,
1 f (1 )
a := > 0
is the slope of g(t). Then, as n ,
Z 1 Z 1
exp (ng(t)) dt exp (ng(t)) dt
0
Z 1+ 5 Z 1
exp (ng(t)) dt exp (ng(t)) dt
1 1
Z 1
exp (n(a(t 1) 1)) dt
= Z
1
exp (n(a(t 1) 1)) dt
1
1 n
e exp (na)
= na
1 n
e (1 exp (na))
na
exp (na)
= 0. (3.40)
1 exp (na)
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 72
72 Limit theorem
From (3.38) and Lemma 3.3, it follows that for any 0 < < 1,
Z 1+
n! nn+1 exp (nf (t)) dt, n . (3.42)
1
Furthermore, looking carefully at the above proof, if we replace by a
positive decreasing sequence {(n)} n=1 converging to 0, there is still a pos-
sibility for (3.42) to hold. Indeed, for example,
(n) := n1/4 , n = 2, 3, . . .
satisfies the following.
Z 1+(n)
n+1
n! n exp (nf (t)) dt, n . (3.43)
1(n)
Since f (t) = log t t and (n) = n1/4 , what we have to show are
lim n log(1 n1/4 ) + n1/4 = , (3.44)
n
lim n log(1 + n1/4 ) n1/4 = . (3.45)
n
First, as for (3.44), recalling (3.33), we see
n log(1 n1/4 ) + n1/4
0
2 !
1 1/4 2 n1/4 s
Z
=n n ds
2 n1/4 (1 + s)3
1
< n1/2 , n .
2
Next, as for (3.45), we see
n log(1 + n1/4 ) n1/4
n1/4
2 !
1 1/4 2 n1/4 s
Z
=n n + ds
2 0 (1 + s)3
n1/4
2 !
1 1/4 2 n1/4
Z
5n n + ds
2 0 1
1
= n1/2 + n1/4 , n .
2
From these (3.43) follows.
74 Limit theorem
namely,
(t 1)2 (t 1)2
1 5 f (t) 5 1 .
2(1 (n))2 2(1 + (n))2
If we put b (n) := 2(1 (n)), then b (n) 2 as n , and
Z 1+(n)
n(t 1)2
n
e exp dt
1(n) b (n)2
Z 1+(n) Z 1+(n)
n(t 1)2
nf (t) n
5 e dt 5 e exp dt.
1(n) 1(n) b+ (n)2
Multiplying all by en ,
Z 1+(n)
n(t 1)2
exp dt
1(n) b (n)2
Z 1+(n) Z 1+(n)
n(t 1)2
5 en enf (t) dt 5 exp dt. (3.46)
1(n) 1(n) b+ (n)2
Changing variables u = n(t 1)/b (n) leads to
Z 1+(n) 2 ! Z n(n)/b (n)
t1
2
n exp n dt = b (n) eu du.
1(n) b (n) n(n)/b (n)
1/4
Here, as n , n(n)/b (n) = n /b (n) , and hence the right-
hand side of the above is convergent to (Remark 3.7)
Z u2
2 e du = 2, n .
From this and (3.46), by the squeeze theorem, it follows that
n 1+(n) nf (t)
Z
ne e dt 2, n .
1(n)
Therefore (3.43) implies
n n!
ne n+1 2, n ,
n
from which Stirlings formula (3.36) immediately follows.
Remark 3.7. An improper integral
Z
exp u2 du =
or its equivalent
2
1 u
Z
exp du = 1
2 2
is called Gaussian integral, which appears very often in mathematics,
physics, etc.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 75
Lemma 3.4. Let A, B (A < B) be any two real numbers. Suppose that
N 3 n, k under the condition
1 A 1 B
n+ n 5 k 5 n+ n. (3.47)
2 2 2 2
Then, we have
(k 12 n)2
n n 1
b(k; n) := 2 q exp 1 .
k 1
n 2n
2
where max denotes the maximum value over k satisfying (3.47) with n
k ; (3.47)
fixed.
Proof. 9 If we put
1
n! = 2nn+ 2 en (1 + n ), (3.50)
we have n 0, as n , by Stirlings formula (3.36). The following
holds (Proposition A.2).
max |k | 0, n . (3.51)
k ; (3.47)
k n+k
1 k nk 1 + n
b(k; n) = q n 2n .
k nk
2n n n n (1 + k )(1 + nk )
As n ,
k 1 n k 1 max{|A|, |B|}
max = max = 0. (3.52)
k ; (3.47) n 2 k ; (3.47) n 2 2 n
9 The proof of this lemma (cf. [Sinai (1992)] Theorem 3.1) is difficult. Readers should
76 Limit theorem
Then, putting
k n+k
1 k nk
b(k; n) = q 2n (1 + rn,k ),
1
n n n
2
Z z n1 z 1 s
q 2
r
2
(1) 1 1 k z n
Tn,k = k log + kz +k ds.
2 n 2 n 0 (1 + s)3
Let n,k denote the last integral term (Remark 3.6). Then,
Z 1 q 2
1 r 3 r !3
z n
z n 1 1
|n,k | 5 k ds = k z 1 z
q 3
n n
0 1
1 z n
r r !3
k 3 1 1
= |z| 1 z . (3.54)
n n n
10 The proof of (3.53) needs the knowledge of continuity of functions of several variables.
(k 12 n)2
1 00
= q exp 1 + n,k (1 + rn,k ).
1
n 2n
2
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 78
78 Limit theorem
(k 12 n)2
X 1
= q exp 1 (1 + rn (k)),
1 2n
k ; (3.47) 2 n
X
where denotes the sum over k satisfying the condition (3.47) with
k ; (3.47)
n fixed. For each k N+ , let
k 12 n
zk := 1 .
2 n
Noting that the length between two adjacent zk and zk+1 is 1/ 12 n , we
rewrite Pn (A 5 Zn 5 B) as a Riemann sum and a remainder:
2
1 X 1 z
Pn (A 5 Zn 5 B) = 1 exp k (1 + rn (k))
2 n 2 2
A 5 zk 5 B
2
1 X 1 z
= 1 exp k
2 n 2 2
A 5 zk 5 B
zk2
1 X 1
+ 1 exp rn (k).
2 n 2 2
A 5 zk 5 B
11 The proof of (3.49) needs the knowledge of continuous function of several variables.
As
for the second term, since the absolute value of the first term
1
P
1 n A5zk 5B is bounded by a constant M > 0 that is independent
2
of n (Proposition A.1), we see
2
1 X 1 zk
1 n exp rn (k) 5 M max |rn (k)| 0,
2 2 2 k ; (3.47)
A 5 zk 5 B
n .
This completes the proof of de MoivreLaplaces theorem.
This is the probability that the number of Heads is more than or equal to 55
among 100 coin tosses. Although the random variable 1 + + 100 takes
only integers 0, . . . , 100, just as we considered the event {1 + +100 = k}
as {k(1/2) 5 1 + +100 < k+(1/2)} in the histogram of Fig. 3.3(right),
if we consider (3.56) as
P100 ( 1 + + 100 = 54.5 ) ,
and apply de MoivreLaplaces theorem, the accuracy of approximation
will be improved. This is called the continuity correction. By this method,
we obtain
!
1 + + 100 50 54.5 50
P100 ( 1 + + 100 = 54.5 ) = P100 1
= 1
2 100 2 100
!
1 + + 100 50
= P100 1
= 0.9
2 100
Z 2
1 x
exp dx = 0.18406.
0.9 2 2
Since the true value of (3.56) is
100
X 100 100
2 = 233375500604595657604761955760 2100
k
k=55
= 0.1841008087,
the approximation error is only 0.00004.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 80
80 Limit theorem
(p)
Example 3.11. Let 0 < p < 1 and let ({0, 1}n , P({0, 1}n ), Pn ) be the
probability space of Example 3.2. Then, the coordinate functions {k }nk=1
is unfair coin tosses if p 6= 1/2. Since E[k ] = p and V[k ] = p(1 p),
Theorem 3.9 implies
!
B
(1 p) + + (n p) 1
Z
2
lim Pn(p) A5 p 5B = ex /2 dx.
n p(1 p)n A 2
Proof. Let a1 < < as be all possible values that X takes. We have
1
lim log MX (t)
t t
s
1 X
= lim log exp(tai )P (X = ai )
t t
i=1
s
!
1 X
= lim log exp(tas ) exp(t(ai as ))P (X = ai )
t t
i=1
s
!
1 1 X
= lim log exp(tas ) + log exp(t(ai as ))P (X = ai )
t t t i=1
s1
!
1 X
= as + lim log P (X = as ) + exp(t(ai as ))P (X = ai )
t t
i=1
= as .
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 82
82 Limit theorem
After knowing as ,
s
X
lim MX (t) exp(tas ) = lim exp(t(ai as ))P (X = ai )
t t
i=1
s1
X
= P (X = as ) + lim exp(t(ai as ))P (X = ai )
t
i=1
= P (X = as ).
Thus we can obtain as and P (X = as ) from MX (t). If we apply this
procedure to
MX (t) exp(tas )P (X = as ),
then we obtain as1 and P (X = as1 ). Repeating this procedure, we
obtain the distribution of X from MX (t). In particular, MX (t) = MY (t),
t R, implies that X and Y are identically distributed.
1 12 n 12
= E exp t 1 exp t 1
2 n 2 n
1
1 2 n 12
= E exp t 1 E exp t 1
2 n 2 n
n
1 1
= E exp t 1 2
n
2 n
1 t 1 t
= exp + exp
2 n 2 n
2 !n
1 t t
= 1+ exp exp
2 2 n 2 n
n
cn (t)
= 1+ , (3.60)
n
where
2
t2
n t t
cn (t) := exp exp , n . (3.61)
2 2 n 2 n 2
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 83
= n + |cn | n
3 + |cn c| 0, n 0.
2n 1 cn
n
Therefore
cn
lim n log 1 + = c.
n n
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 84
84 Limit theorem
The function exp(t2 /2), which is the limit in (3.62), has a deep relation
with the density function (3.29) of the standard normal distribution.
Proof. We complete the square for the quadratic function of x in the ex-
ponent:
Z 2
1 x
etx exp dx
2 2
Z
1 1
= exp tx x2 dx
2 2
Z
1 1 2 1 2
= exp (x t) + t dx
2 2 2
!
Z 2 2
1 (x t) t
= exp dx exp
2 2 2
Z 2 2
1 x t
= exp dx exp
2 2 2
2
t
= exp .
2
86 Limit theorem
3.5.1 Inference
Let us look at applications of limit theorems in mathematical statistics.
We begin with the statistical inference. It provides guidelines to construct
stochastic modelsprobability spaces, random variables, etc., from given
statistical data. Consider the following exercise. 12
(p)
Solving in p in P1000 ( ) of the left-hand side, we see the probability that
S S
0.158 < p < + 0.158
1000 1000
is not less than 0.99. We now consider that S() = 400 is an outcome of
the occurrence of the above event. Then, we obtain
0.242 < p < 0.558.
This estimate is not so good because Chebyshevs inequality is loose.
The central limit theorem (Example 3.11) gives much more precise es-
timate:
! Z 2
S 1000p 1 x
(p)
P1000 p =z 2 exp dx, z > 0.
1000p(1 p) z 2 2
Now, for 0 < < 1/2, let z() denote a positive real number z such that
Z 2
1 x
2 exp dx = .
z 2 2
z() is called the 100 % point of the standard normal distribution
(Fig. 3.8).
0.4
0.3
z()
0.2
/2
0.1
-4 -2 2 4
Fig. 3.8 The 100 % point z() of the standard normal distribution
Now, we have
!
S 1000p
(p)
P1000 < z() 1 ,
p
1000p(1 p)
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 88
88 Limit theorem
In the same way as above, we conclude that the 99% confidence interval of
the audience rating is [0.36, 0.44].
3.5.2 Test
The statistical test provides guidelines to judge whether or not given
stochastic modelsprobability spaces, random variables, etc., do not con-
tradict with observed statistical data. Consider the following exercise. 13
Exercise III A coin was tossed 200 times, and it came up Heads
115 times. Is it a fair coin?
P200 ( |S 100| = 15 ) .
= 0.040305.
90 Limit theorem
In the same way as above, one of the possible answers to Exercise III0 is
that the hypothesis the ratio of boys and girls of newborn babies in this
region is 1:1 is rejected at the significance level 5%.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 91
Chapter 4
The history of the Monte Carlo method started when Ulam, von Neumann
and others applied it to the simulation of nuclear fissions 1 by a newly
invented computer in 1940s, i.e., in the midst of World War II. Since then,
along with the development of computer, the Monte Carlo method has
been used in various fields of science and technology, and has produced
remarkable results. The development of the Monte Carlo method will surely
continue.
In this chapter, however, we do not mention such brilliant applications
of the method, but we discuss its theoretical foundation. More concretely,
through a solving process of Exercise I in Sec. 1.4, we study the basic theory
and implementation of sampling of random variables by computer.
4.1.1 Purpose
The Monte Carlo method is a kind of gambling as its name indicates. The
purpose of the player Alice, is to get a generic valuea typical value or not
an exceptional valueof a given random variable by sampling. Of course,
the mathematical problem in question is assumed to be solved by a generic
value of the random variable. The sampling is done by her will, but she has
1 In order to make the atomic bombs.
91
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 92
Fig. 4.1 The distribution and a generic value of a random variable (Conceptual figure)
The following is a very small example of the Monte Carlo method (with-
out computer).
Example 4.1. An urn contains 100 balls, 99 of which are numbered r and
one of which is numbered r+1. Alice draws a ball from the urn, and guesses
the number r to be the number of her ball. The probability that she fails
to guess the number r correctly is 1/100.
8
Finally, we define S106 : {0, 1}10 R by
6
10
X 8
S106 () := Xk (), {0, 1}10 .
k=1
Then, the mean and the variance of S106 /106 are, by (3.12) and (3.13),
p(1 p)
S106 S106 1
E 6
= p, V 6
= 6
5 .
10 10 10 4 106
Let U0 be a set of that give exceptional values to S106 :
8 S 6 () 1
U0 := {0, 1}10 10 6 p =
. (4.1)
10 200
Now, Chebyshevs inequality shows
1 1
P108 ( U0 ) 5 2002 = . (4.2)
4 106 100
Namely, a generic value of S106 /106 will be an approximate value of p.
4.2.1 Definition
To play the gamble of Example 4.2, anyhow, Alice has to choose an
8
{0, 1}10 . Let us suppose that she uses the most used device to do it, i.e.,
a pseudorandom generator.
directly from a keyboard, and the program of the function g should work
sufficiently fast.
4.2.2 Security
Let us continue to consider the case of Example 4.3. Alice can choose any
seed 0 {0, 1}238 of g freely of her own will. Her risk is now estimated by
S 6 (g( 0 ))
1
P238 10 6
p = . (4.3)
10 200
Of course, the probability (4.3) depends on g. If this probabilitythe
probability that her sample S106 (g( 0 )) is an exceptional value of Sis
large, then it is difficult for her to win the game, which is not desirable.
Now, we give the following (somewhat vague) definition: we say a pseu-
dorandom generator g : {0, 1}l {0, 1}n , l < n, is secure against a set
U {0, 1}n if it holds that
Pn ( U ) Pl (g( 0 ) U ).
8
In Example 4.3, if g : {0, 1}238 {0, 1}10 is secure against U0 in (4.1),
for the majority of the seeds 0 {0, 1}238 that Alice can choose of her
own will, the samples S106 (g( 0 )) will be generic values of S106 . In this
case, no random number is needed. In other words, in sampling a value
of S106 , using g does not make Alices risk high, and hence g is said to be
secure. The problem of sampling in a Monte Carlo method will be solved
by finding a suitable secure pseudorandom generator.
In general, a pseudorandom generator g : {0, 1}l {0, 1}n is desired
to be secure against as many subsets of {0, 1}n as possible, but there is
3 The origin of the number 238 will soon be clear in Example 4.4.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 96
r b
r b r b
r b i
r b X
r b 2m
0 i i+1 1
2m 2m
Lemma 4.1. Let X : {0, 1}m R and let X : [0, 1) R be the corre-
sponding function defined by (4.4). Then, for any j N+ , we have
Z 1 2m+j
X1 q
1
X(x)dx = m+j X m+j .
0 2 q=0
2
Proof. It is enough to prove the lemma for X(x) = 1[2m i, 2m (i+1)) (x).
2m+j 2j (i+1)1
1 X1 q 1 X
1[2m i, 2m (i+1)) = 1
2m+j q=0
2m+j 2m+j
q=2j i
1
= .
2m
2m+j m+j
X1 2 X1 k0 q
1 p p kq
= F + G +
22m+2j q=0 p=0
2m+j 2m+j 2m+j 2m+j
m+j
2m+j
X1 2 X +kq1
(k 0 k)q
1 p p
= F + G
22m+2j q=0
2m+j 2m+j 2m+j
p=kq
2m+j m+j
X1 2 X1 (k 0 k)q
1 p p
= F + G . (4.12)
22m+2j q=0 p=0
2m+j 2m+j 2m+j
then
s (q q 0 ) mod 2m+ji = 0,
i.e., s (q q 0 ) is divisible by 2m+ji . Since s is odd, q q 0 is divisible by
2m+ji , but q, q 0 {0, 1, 2, . . . , 2m+ji 1}, it means q = q 0 . Therefore a
correspondence
{0, 1, 2, . . . , 2m+ji 1} 3 q sq mod 2m+ji {0, 1, 2, . . . , 2m+ji 1}
is one-to-one. Let qr {0, 1, 2, . . . , 2m+ji 1} be a unique solution to
sqr mod 2m+ji = r, r {0, 1, 2, . . . , 2m+ji 1}.
Then, for each r, we have
#{0 5 q 5 2m+j 1 | sq mod 2m+ji = r}
= #{0 5 q 5 2m+j 1 | sq mod 2m+ji = sqr mod 2m+ji }
= #{0 5 q 5 2m+j 1 | q qr is divisible by 2m+ji } = 2i .
From this, (4.13) is calculated as
2m+j
X1 2m+ji
X 1 p
1 p sq 2i r
F + = F +
2m+j q=0
2m+j 2m+ji 2m+j r=0 2m+j 2m+ji
2m+ji
X 1
1 r
= F
2m+ji r=0
2m+ji
Z 1
= F (u)du. (4.14)
0
The last = is due to Lemma 4.1. From (4.12), (4.13), and (4.14), it follows
that
E[F (Zk0 )G(Zk )]
2m+j
X1 2m+j
X1 0
(k k)q p
1 1 p
= F + G m+j
2m+j p=0
2m+j q=0
2m+j 2m+j 2
2m+j
X1 2m+j
X1
1 1
p sq p
= F + m+ji G m+j
2m+j p=0
2m+j q=0
2m+j 2 2
m+j
1 2 X1
1 p
Z
= F (u)du G
0 2m+j p=0
2m+j
Z 1 Z 1
= F (u)du G(v)dv.
0 0
This completes the proof of (4.11).
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 102
c [0, 1) is called the Weyl transformation, after which this pseudorandom generator
is named.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 103
Example 4.4. Applying the random Weyl sampling, we can solve Exer-
cise I. Let S106 be the random variable defined in Example 4.2. We take
m := 100 and N := 106 in Definition 4.2. In order to let N 5 2j+1 , it is
enough to take j := 19. Then, we have 2m + 2j = 238 so that the pseu-
8
dorandom generator (4.9) is now a function g : {0, 1}238 {0, 1}10 . The
risk (4.3) is estimated, by Theorem 4.1, as
S106 (g( 0 ))
= 1 1
P238 p 200 5 100 . (4.15)
106
Thus g is secure against U0 in (4.1). Since Alice can freely choose any seed
0 {0, 1}238 , this risk estimate has a practical meaning and she no longer
needs a long random number.
Here is a concrete example. Instead of her, the author chose the fol-
lowing seed 0 = (x, ) D119 D119 = {0, 1}238 written in the binary
numeral system:
Then, we obtained S106 (g( 0 )) = 546, 177 by a computer (see Sec. A.5). In
this case,
S106 (g( 0 ))
= 0.546177 (4.16)
106
is the estimated value of the probability p. This result with the risk estimate
(4.15) is expressed in statistical terms as The 99% confidence interval of p
is 0.546177 0.05, i.e., 0.541 < p < 0.551. Indeed, the true value of p is
692255904222999797557597756576 2100 = 0.5460936192.
Thus the error of the estimated value (4.16) is only 0.00008.
In a practical Monte Carlo integration, usually, the sample size N is not
determined in advance, but it is determined in doing numerical experiments.
To be ready for such a situation, we should take j somewhat large.
Remark 4.1. We can advise Alice a little in choosing a seed 0 = (x, )
{0, 1}2m+2j for the random Weyl sampling: she should not choose an ex-
tremely simple . Indeed, if she chooses = (0, 0, . . . , 0) {0, 1}m+j , the
sampling will certainly end in failure.
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 104
Appendix A
A.1 Symbols and terms
Definition A.1. For two sets A and B, the set of all ordered pairs (x, y)
of x A and y B is written as
A B := {(x, y) | x A, y B},
and it is called the direct product of A and B.
y
(a, c) (b, c)
B AB
(a, d) (b, d)
O A x
For more than two sets, the direct product is similarly defined. For
example,
R3 = R R R := {(x, y, z) | x R, y R, z R}
is the set of all ordered triplets of real numbers, i.e., the set of all points in
3-dimensional space. {0, 1}3 in Example 1.1 is nothing but {0, 1} {0, 1}
{0, 1}.
105
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 106
106 Appendix A
{bij }i=1,2,...,m, .
j=1,2,...,n
This is called a double sequence. As we write the sum of a sequence {ai }ni=1
Pn
as i=1 ai , the sum of a double sequence {bij }i=1,2,...,m, is written as
j=1,2,...,n
X
bij .
i=1,2,...,m,
j=1,2,...,n
Similarly, triple sequence and triple sum (more generally, multiple sequence
and multiple sum) are defined.
The product a1 a2 an of a sequence {ai }ni=1 is written as
n
Y
ai .
i=1
Obviously,
m n n m
!
Y Y Y Y Y
bij = bij = bij .
i=1,2,...,m, i=1 j=1 j=1 i=1
j=1,2,...,n
108 Appendix A
P
The summation symbol is applied not only to sequences but also to
any finite set of numbers. For example, suppose that for each element of
a finite set there corresponds a number p R. Then, the total sum of
such p is written as
X
p .
The sum of p over that satisfies a condition X() = ai is written
as
X
p .
; X()=ai
Q
The product symbol is used similarly.
To describe numbers, we usually use the decimal numeral system (or the
base-10 numeral system). It is a positional numeral system employing 10
as the base and requiring 10 different numerals, the digits 0, 1, 2, 3, 4, 5, 6,
7, 8, 9. It also requires a dot (decimal point) to represent decimal fractions.
The same thing can be done with only two different numerals , the digits 0,
1. This is called the binary numeral system (or the base-2 numeral system).
It is Leibniz who first systematized it in mathematics.
The binary numeral system is the simplest positional numeral system,
which can be expressed by ON(= 1) and OFF(= 0) of electronic circuits,
so that it is now a mathematical basis of all digital technologies.
110 Appendix A
0.1001 . . .
1111101000 1000110011.0
111110100 0
111111 0000
111110 1000
At high school, about the limit of sequences and functions, students learn,
for example, limn an = a as If n gets larger and larger, an gets closer
and closer to a. This description is too vague for advanced mathematics.
Here we introduce the rigorous treatment of limit that was established by
Cauchy, Weirestrass and others in the 19-th century.
112 Appendix A
114 Appendix A
( + 2|c|) < .
Let us solve this inequality in ; adding |c|2 to the both sides, we get
|x2 c2 | < .
For example, in the proof of Lemma 3.4, (3.53) is proved by the conti-
nuity of a function of 5 variables
1 1 + x3
f (x1 , x2 , x3 , x4 , x5 ) :=
4x1 x2 (1 + x4 )(1 + x5 )
at a point ( 12 , 12 , 0, 0, 0). Similarly, for the proof of (3.49), we use, in the
last paragraph of the proof of the lemma, the continuity of a function of
two variables
Proposition A.3.
(i) lim xa bx = 0, a > 0, b > 1.
x
More precisely, it means that for any > 0, there exists a > 0 such that for any x
satisfying 0 < x < , we have |x log x| < .
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 116
116 Appendix A
Then,
a
c(x + 1) 1
0< = 1+ b1 < r, x > x0 ,
c(x) x
and hence
c(x + n) c(x + n) c(x + n 1) c(x + 1)
0< = < rn , x > x0 .
c(x) c(x + n 1) c(x + n 2) c(x)
We therefore have
0 < c(x) < rbxx0 c max c(y) 0, x .
x0 5 y 5 x0 +1
/*==========================================================*/
/* file name: example4_4.c */
/*==========================================================*/
#include <stdio.h>
/* seed */
char xch[] =
"1110110101" "1011101101" "0100000011" "0110101001"
"0101000100" "0101111101" "1010000000" "1010100011"
"0100011001" "1101111101" "1101010011" "111100100";
char ach[] =
"1100000111" "0111000100" "0001101011" "1001000001"
"0010001000" "1010101101" "1110101110" "0010010011"
"1000000011" "0101000110" "0101110010" "010111111";
118 Appendix A
int main()
{
int n, s = 0;
for( n = 0; n <= M_PLUS_J-1; n++ ){
if( xch[n] == 1 ) x[n] = 1; else x[n] = 0;
if( ach[n] == 1 ) a[n] = 1; else a[n] = 0;
}
for ( n = 1; n <= SAMPLE_NUM; n++ ){
longadd();
if ( maxLength() >= 6 ) s++;
}
printf( "s=%6d, p=%7.6f\n", s, (double)s/(double)SAMPLE_NUM);
return 0;
}
/*================ End of example4_4.c =====================*/
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 119
List of mathematicians
119
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 120
Further reading
This book exclusively dealt with limit theorems to emphasize the most im-
portant mission of probability theoryanalysis of randomness. Of course,
there are many other limit theorems not presented in this book, not neces-
sarily determining events of probability close to 1. Anyhow, limit theorem
is not the only theme of probability theory. To learn richness of probability
theory, [Feller (1968)] and [Sinai (1992)] are recommended to read.
Before then, readers should master calculus and linear algebra. What
follows are hints for those who have completed them.
In this book, we restricted the sample space to be a finite set. To
deal with infinite sample space, we need measure theory (1.5.1). To study
it, [Bilingsley (2012)] is recommended to read. Random number is merely
an item of computation theory. Including it, to study computation theory,
[Sipser (2012)] is recommended to read. About recursive function and algo-
rithmic randomness, [Rogers (1967)], [Downey and Hirschfeld (2010)] and
[Nies (2009)] are textbooks for graduate students and researchers. Books
about the Monte Carlo method are really numerous. To study rigorous
basic theory of it, [Sugita (2011)] is recommended to read. For advanced
theory of it, [Bouleau and Lepingle (1994)] is recommended to read.
120
September 15, 2017 14:13 ws-book9x6 Probability and Random Number 10662-main page 121
Bibliography
121
b2530 International Strategic Relations and Chinas National Security: World at the Crossroads
Index
Symbols N, viii, 24
#, viii, 2 N+ , viii, 34
, viii, 4, 106 Pn , 10
:=, viii P(), viii, 2
, , viii, 10 P, 3
=, viii
P viii P
R,
, viii , ; X()=ai , 108
=, 99 V, 51
7, 106 i , 13
,
Qn51 Q Terms
i=1 , iI , viii, 48, 107 algorithm, 35
, viii, 39, 67 Alice, 10
, viii, 11 Bernoullis theorem, 45
hx1 , x2 , . . . , xn i, 34 Borel
(u)ni , 34 s model of coin tosses, 3
{0, 1} , 34 s normal number theorem,
btc, viii, 34, 109 19
1A (x), viii, 106 Brownian motion, 21
Ac ,!viii, 106 canonical
n order, 34
, viii, 44
k realization, 9
di (x), 3, 110 central limit theorem, 81
E, 51 Chebyshevs inequality, 55
exp(x), 14 coin tosses
H(p), 43 infinite , 18
K(x), 36 n , 10
KA (x), 35 unfair , 51
L(q), 34 complexity
max[min], viii, 58 computational depending
mod, 100 on algorithm, 35
y (p(x1 , . . . , xn , y)), 28 Kolmogorov , 36
123
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 124
124 Index
Index 125