Probability and Random Number A First Guide To Randomness

10662_9789813228252_tp.
indd 1 18/9/17 12:25 PM

b2530 International Strategic Relations and Chinas National Security: World at the Crossroads
This page intentionally left blank
b2530_FM.indd 6 01-Sep-16 11:03:06 AM

10662_9789813228252_tp.indd 2 18/9/17 12:25 PM
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.
(KAKURITHU TO RANSU) by Hiroshi Sugita

Copyright Hiroshi Sugita 2014
English translation published by arrangement with Sugakushobo Co.
PROBABILITY A ND RANDOM NUMBER

A First Guide to Randomness
Copyright 2018 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
ISBN 978-981-3228-25-2
Printed in Singapore
Rok Ting - 10662 - Probability and Random Number.indd 1 07-09-17 3:06:40 PM

September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page v
Preface
Imagine old times when the word probability did not exist. Facing difficult
situations that could be expressed as irregular, unpredictable, random, etc.
(in what follows, we call it random), people were helpless. After a long
time, they have found how to describe randomness, how to analyze it, how
to define it, and how to make use of it. What is really amazing is that these
all have been done in the very rigorous mathematicsjust like geometry
and algebra.
At high school, students calculate probabilities by counting the number

of permutations, combinations, etc. At university, counting is also the
most basic method to study probability. The only difference is that we
count huge numbers at university; e.g., we ask how large 10, 000! is. To
count huge numbers, calculusdifferentiation and integrationis useful.
To deal with extremely huge numbers, taking the limit to infinity often
makes things simpler, in which case, calculus is again useful. In short
words, at university, counting huge numbers by calculus is the most basic
method to study probability.
Why do we count huge numbers? It is because we want to find as
many limit theorems as possible. Limit theorems are very useful to solve
practical problems; e.g., to analyze statistical data of 10,000 people, or to
study properties of a certain substance consisting of 6.02 1023 molecules.
They have a yet more important mission. It is to unlock the secrets of
randomness, which is the ultimate aim of studying probability.
What is randomness? Why can limit theorems unlock its secrets? To
answer these questions, we feature random number as one of the two main
subjects of this book. 1 Without learning random number, we can do
1 The other one is, of course, probability.
v
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page vi
vi Preface
calculations and prove theorems about probability, but to understand the

essential relation between probability and randomness, the knowledge of
random number is necessary.
Another reason why we feature random number is that for proper under-
standing and implementation of the Monte Carlo method, the knowledge
of random number is indispensable. The Monte Carlo method is a numer-
ical method to solve mathematical problems by computer-aided sampling
of random variables. Thus, not only in theory but also in practice, learning
random number is important.
The prerequisite for this book is first-year university calculus. University

mathematics is really difficult. There are three elements of the difficulty.
First, the subtle nuance of concepts described by unfamiliar technical
terms. For example, a random variable needs a probability space as a setup,
and should be accompanied by a distribution, to make sense. In partic-
ular, special attention must be paid to termssuch as eventthat have
mathematical meanings other than usual ones.
Secondly, long proofs and complicated calculations. This book includes
many of them. They are unavoidable; for it is not easy to obtain important
concepts or theorems. In this book, reasoning by many inequalities, which
readers may not be used to, appear here and there. We hope readers to
follow the logic patiently.
Thirdly, the fact that infinity plays an essential role. Since the latter
half of the 19th century, mathematics has developed very much by dealing
with infinity directly. However, infinity essentially differs from finity 2 in
many respects, and our usual intuition does not work at all for it. Therefore
mathematical concepts about infinity cannot help being so delicate that we
must be very careful in dealing with them. In this book, we discuss the
distinction between countable set and uncountable set, and the rigorous
definition of limit.
This book presents well-known basic theorems with proofs that are not
seen in usual probability textbooks; for we want readers to learn that a
good solution is not always unique. In general, breakthroughs in science
have been made by unusual solutions. We hope readers to know more than
one proof for every important theorem.
This is an English translation of my Japanese book Kakuritsu to ransu

published by Sugakushobo Co. To that book, Professors Masato Takei and
2 finiteness
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page vii
Preface vii
Tetsuya Hattori gave me suggestions for better future publication. Indeed,

they were very useful in preparing the present English version. Mr. Shin
Yokoyama at Sugakushobo encouraged me to translate the book. Professor
Nicolas Bouleau carefully read the translated English manuscript, and gave
me valuable comments, which helped me improve it. Dr. Pan Suqi and
Ms. Tan Rok Ting at World Scientific Publishing Co. kindly supported me
in producing this English version. I am really grateful to all of them.
Osaka, September 2017 Hiroshi Sugita

September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page viii
viii Preface
Notations and symbols

A := B A is defined by B (B =: A as well).
P = Q P implies Q (logical inclusion).
end of proof.
N := {0, 1, 2, . . .}, the set of all non-negative integers.
N+ := {1, 2, . . .}, the set of all positive integers.
R := the set of all real numbers.
n
Y
ai := a1 an .
i=1
max[min]A := the maximum[minimum] value of A R.

max min u(t) := the maximum[minimum] value of u(t) over all t = 0.
t=0 t=0
btc := the largest integer not exceeding t = 0 (rounding down).

n n!
:= .
k (n k)!k!
a b a and b are approximately equal to each other.
a b a is much greater than b (b a as well).
an bn an /bn 1, as n .
:= the empty set.
P() := the set of all subsets of .
1A (x) := 1 (x A), 0 (x 6 A) (the indicator function of A).
#A := the number of elements of A.
Ac := the complement of A.
A B := {(x, y) | x A, y B} (the direct product of A and B).
Table of Greek letters

A, alpha I, iota P, (%) rho
B, beta K, kappa , () sigma
, gamma , lambda T, tau
, delta M, mu , upsilon
E, () epsilon N, nu , () phi
Z, zeta , xi X, chi
H, eta O, o omicron , psi
, () theta , ($) pi , omega
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page ix
Contents
Preface v
Notations and symbols . . . . . . . . . . . . . . . . . . . . . . . viii
Table of Greek letters . . . . . . . . . . . . . . . . . . . . . . . . viii
1. Mathematics of coin tossing 1

1.1 Mathematical model . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Probability space . . . . . . . . . . . . . . . . . . 4
1.1.2 Random variable . . . . . . . . . . . . . . . . . . . 6
1.2 Random number . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Analysis of randomness . . . . . . . . . . . . . . . 12
1.3.2 Mathematical statistics . . . . . . . . . . . . . . . 15
1.4 Monte Carlo method . . . . . . . . . . . . . . . . . . . . . 16
1.5 Infinite coin tosses . . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Borels normal number theorem . . . . . . . . . . 19
1.5.2 Construction of Brownian motion . . . . . . . . . 19
2. Random number 23
2.1 Recursive function . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Computable function . . . . . . . . . . . . . . . . 25
2.1.2 Primitive recursive function and partial recursive
function . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.3 Kleenes normal form() 3 . . . . . . . . . . . . . . 29
2.1.4 Enumeration theorem . . . . . . . . . . . . . . . . 31
2.2 Kolmogorov complexity and random number . . . . . . . 33
3 The subsections with () can be skipped.
ix
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page x
x Contents
2.2.1 Kolmogorov complexity . . . . . . . . . . . . . . . 34

2.2.2 Random number . . . . . . . . . . . . . . . . . . . 37
2.2.3 Application: Distribution of prime numbers() . . 38
3. Limit theorem 42
3.1 Bernoullis theorem . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Law of large numbers . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Sequence of independent random variables . . . . 48
3.2.2 Chebyshevs inequality . . . . . . . . . . . . . . . 54
3.2.3 CramerChernoffs inequality . . . . . . . . . . . . 57
3.3 De MoivreLaplaces theorem . . . . . . . . . . . . . . . . 60
3.3.1 Binomial distribution . . . . . . . . . . . . . . . . 60
3.3.2 Heuristic observation . . . . . . . . . . . . . . . . 61
3.3.3 Taylors formula and Stirlings formula . . . . . . 65
3.3.4 Proof of de MoivreLaplaces theorem . . . . . . . 75
3.4 Central limit theorem . . . . . . . . . . . . . . . . . . . . 80
3.5 Mathematical statistics . . . . . . . . . . . . . . . . . . . . 86
3.5.1 Inference . . . . . . . . . . . . . . . . . . . . . . . 86
3.5.2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4. Monte Carlo method 91

4.1 Monte Carlo method as gambling . . . . . . . . . . . . . . 91
4.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.2 Exercise I, revisited . . . . . . . . . . . . . . . . . 93
4.2 Pseudorandom generator . . . . . . . . . . . . . . . . . . . 94
4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.2 Security . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Monte Carlo integration . . . . . . . . . . . . . . . . . . . 96
4.3.1 Mean and integral . . . . . . . . . . . . . . . . . . 96
4.3.2 Estimation of mean . . . . . . . . . . . . . . . . . 97
4.3.3 Random Weyl sampling . . . . . . . . . . . . . . . 98
4.4 From the viewpoint of mathematical statistics . . . . . . . 104
Appendix A 105
A.1 Symbols and terms . . . . . . . . . . . . . . . . . . . . . . 105
A.1.1 Set and function . . . . . . . . . . . . . . . . . . . 105
A.1.2 Symbols for sum and product . . . . . . . . . . . . 106
A.1.3 Inequality symbol . . . . . . . . . . . . . . . 108
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page xi
Contents xi
A.2 Binary numeral system . . . . . . . . . . . . . . . . . . . . 108

A.2.1 Binary integers . . . . . . . . . . . . . . . . . . . . 108
A.2.2 Binary fractions . . . . . . . . . . . . . . . . . . . 110
A.3 Limit of sequence and function . . . . . . . . . . . . . . . 111
A.3.1 Convergence of sequence . . . . . . . . . . . . . . 111
A.3.2 Continuity of function of one variable . . . . . . . 114
A.3.3 Continuity of function of several variables . . . . . 115
A.4 Limits of exponential function and logarithm . . . . . . . 115
A.5 C language program . . . . . . . . . . . . . . . . . . . . . 116
List of mathematicians 119

Further reading 120
Bibliography 121
Index 123
September 8, 2017 14:8 ws-book9x6 Probability and Random Number 10662-main page 1
Chapter 1
Mathematics of coin tossing
Tossing a coin many times, record 1 if it comes up Heads and record 0 if it

comes up Tails at each coin toss. Then, we get a long sequence consisting
of 0 and 1let us call such a sequence a {0, 1}-sequencethat is random.
In this chapter, with such random {0, 1}-sequences as material, we study
outlines of
how to describe randomness (Sec. 1.1),
how to define randomness (Sec. 1.2),
how to analyze randomness (Sec. 1.3.1), and
how to make use of randomness (Sec. 1.3.2, Sec. 1.4).
Readers may think that coin tosses are too simple as a random object,
but as a matter of fact, virtually all random objects can be mathemati-
cally constructed from them (Sec. 1.5.2). Thus analyzing coin tosses means
analyzing all random objects.
In this chapter, we present only basic ideas, and do not prove theorems.
1.1 Mathematical model
For example, the concept circle is obtained by abstracting an essence from

various round objects in the world. To deal with circle in mathematics, we
consider an equation (x a)2 + (y b)2 = c2 as a mathematical model.
Namely, what we call a circle in mathematics is the set of all solutions of
this equation
{(x, y) | (x a)2 + (y b)2 = c2 }.
Similarly, to analyze random objects, since we cannot deal with them di-
rectly in mathematics, we consider their mathematical models. For exam-
ple, when we say n coin tosses, it does not mean that we toss a real coin
1
2 Mathematics of coin tossing
n times, but it means a mathematical model of it, which is described by

mathematical expressions in the same way as circle.
Let us consider a mathematical model of 3 coin tosses. Let Xi {0, 1}
be the outcome (Heads = 1 and Tails = 0) of the i-th coin toss. At high
school, students learn the probability that the consecutive outcomes of 3
coin tosses are Heads, Tails, Heads, is
3
1 1
P (X1 = 1, X2 = 0, X3 = 1) = = . (1.1)
2 8
Here, however, the mathematical definitions of P and Xi are not clear.
After making them clear, we can call them a mathematical model of 3 coin
tosses.
Fig. 1.1 Heads and Tails of 1 JPY coin
Example 1.1. Let {0, 1}3 denote the set of all {0, 1}-sequences of length 3:
{0, 1}3 := { = (1 , 2 , 3 ) | i {0, 1}, 1 5 i 5 3 }
= { (0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0),
(1, 0, 1), (1, 1, 0), (1, 1, 1) }.
Let P({0, 1} ) be the power set of {0, 1}3 , i.e., the set of all subsets of
3 1
{0, 1}3 . A P({0, 1}3 ) is equivalent to A {0, 1}3 . Let #A denote the
number of elements of A. Now, define a function P3 : P({0, 1}3 ) [0, 1] :=
{ x | 0 5 x 5 1 } by
#A #A
P3 (A) := 3
= 3 , A P({0, 1}3 )
#{0, 1} 2
(See Definition A.2), and functions i : {0, 1}3 {0, 1}, i = 1, 2, 3, by
i () := i , = (1 , 2 , 3 ) {0, 1}3 . (1.2)
1P is the letter P in the Fractur typeface.
1.1. Mathematical model 3
Each i is called a coordinate function. Then, we have

{0, 1}3 1 () = 1, 2 () = 0, 3 () = 1

P3
{0, 1}3 1 = 1, 2 = 0, 3 = 1

= P3
1
= P3 ( { (1, 0, 1) } ) = 3 . (1.3)
2
Although (1.3) has nothing to do with the real coin tosses, it is formally
the same as (1.1). Readers can easily examine the formal identity not only
for the case Heads, Tails, Heads, but also for any other possible outcomes
of 3 coin tosses. Thus we can compute every probability concerning 3 coin
tosses by using P3 and {i }3i=1 . This means that P and {Xi }3i=1 in (1.1)
can be considered as P3 and {i }3i=1 , respectively. In other words, by the
correspondence
P P3 , {Xi }3i=1 {i }3i=1 ,
P3 and {i }3i=1 are a mathematical model of 3 coin tosses.
The equation (x a)2 + (y b)2 = c2 is not a unique mathematical
model of circle. There are different models of it; e.g., a parametrized
representation

x = c cos t + a,
0 5 t 5 2.
y = c sin t + b,
You can select suitable mathematical models according with your particular
purposes. In the case of coin tosses, it is all the same. We can present
another mathematical model of 3 coin tosses.
Example 1.2. (Borels model of coin tosses) For each x [0, 1) := { x | 0 5
x < 1}, let di (x) {0, 1} denote the i-th digit of x in its binary expansion
(Sec. A.2.2). We write the length of each semi-open interval [a, b) [0, 1)
as
P( [a, b) ) := b a.
Here, the function P that returns the lengths of semi-open intervals is called
the Lebesgue measure. Then, the length of the set of x [0, 1) for which
d1 (x), d2 (x), d3 (x), are 1, 0, 1, respectively, is
P ( {x [0, 1) | d1 (x) = 1, d2 (x) = 0, d3 (x) = 1} )

1 0 1 1 0 1 1
=P x [0, 1) + 2 + 3 5 x < + 2 + 3 + 3

2 2 2 2 2 2 2

5 6 1
=P , = .
8 8 8
In the number line with binary scale, it is expressed as a segment:
s c
0 0.001 0.01 0.011 0.1 0.101 0.11 0.111 1
Under the correspondence
P P, {Xi }3i=1 {di }3i=1 ,
P and {di }3i=1 are also a mathematical model of 3 coin tosses.
Readers may suspect that, in the first place, (1.1) is not correct for
real coin tosses. Indeed, rigorously speaking, since Heads and Tails are
differently carved, (1.1) is not exact in the real world. What we call coin
tosses is an idealized model, which can exist only in our mindjust as we
consider the equation (x a)2 + (y b)2 = c2 as a mathematical model of
circle, although there is no true circle in the real world.
1.1.1 Probability space

Let us present what we stated in the previous section in a general setup.
In what follows, probability theory means the axiomatic system for prob-
ability established by [Kolomogorov (1933)] 2 and all its derived theorems.
Let us begin with probability distribution and probability space.
Definition 1.1. (Probability distribution) Let be a non-empty finite

set, i.e., 6= 3 and # < . Suppose that for each , there
corresponds a real number 0 5 p 5 1 so that
X
p = 1

(Sec. A.1.2 ). Then, we call the set of all pairs and p
{(, p ) | }
a probability distribution (or simply, a distribution) in .
Definition 1.2. (Probability space) Let be a non-empty finite set and

let P() be the power set of . If a function P : P() R satisfies
(i) 0 5 P (A) 5 1, A P(),
(ii) P () = 1, and
2 See Bibliography at the end of the book.
3 denotes the empty set.
(iii) A, B P() are disjoint, i.e., A B =

= P (A B) = P (A) + P (B),
then the triplet (, P(), P ) is called a probability space. 4 An element of
P() (i.e., a subset of ) is called an event, in particular, is called the
whole event (or the sample space), and the empty event. A one point
set {} or itself is called an elementary event. P is called a probability
measure (or simply, a probability) and P (A) the probability of A.
For a non-empty finite set , to give a distribution in it and to give a

probability space are equivalent. In fact, if a distribution {(, p ) | }
is given, by defining a probability P : P() R as
X
P (A) := p , A P(),
A
a triplet (, P(), P ) becomes a probability space. Conversely, if a proba-

bility space (, P(), P ) is given, by defining
p := P ({}), ,
{(, p ) | } becomes a distribution in .
A triplet (, P(), P ) is a probability space provided that the conditions

(i)(ii)(iii) of Definition 1.2 are satisfied, no matter whether it is related to
a random phenomenon or not. Thus a triplet
({0, 1}3 , P({0, 1}3 ), P3 ),
whose components have been defined in Example 1.1, is a probability space
not because it is related to 3 coin tosses but because it satisfies all the
conditions (i)(ii)(iii).
Like P3 , in general, a probability measure satisfying
#A
P (A) = , A P(),
#
is called a uniform probability measure. Equivalently, a distribution satis-
fying
1
p = , ,
#
4 In mathematics, many kinds of spaces enter the stage, such as linear space, Eu-
clidean space, topological space, Hilbert space, etc. These are sets accompanied by
some structures, operations, or functions. In general, they have nothing to do with the
3-dimensional space where we live.
is called a uniform distribution. Setting the uniform distribution means

that we assume every element of is equally likely to be chosen.
By the way, in Example 1.2, you may wish to assume [0, 1) to be the
whole event and P to be the probability, but since [0, 1) is an infinite set, it
is not covered by Definition 1.2. By extending the definition of probability
space, it is possible to consider an infinite set as a whole event, but to do
this, we need Lebesgues measure theory, which exceeds the level of this
book.
Each assertion of the following proposition is easy to derive from Defi-

nition 1.2.
Proposition 1.1. Let (, P(), P ) be a probability space. For A, B

P(), we have
(i) P (Ac ) = 1 P (A), 5 in particular, P () = 0,
(ii) A B = P (A) 5 P (B), and
(iii) P (A B) = P (A) + P (B) P (A B), in particular, P (A B) 5
P (A) + P (B).
1.1.2 Random variable
Definition 1.3. Let (, P(), P ) be a probability space. We call a function

X : R a random variable. Let {a1 , . . . , as } R be the set of all possible
values that X can take, which is called the range of X, and let pi be the
probability that X = ai :
P ( { | X() = ai } ) =: pi , i = 1, . . . , s. (1.4)
Then, we call the set of all pairs (ai , pi )
{(ai , pi ) | i = 1, . . . , s} (1.5)
the probability distribution (or simply, the distribution) of X. Since we have
0 5 pi 5 1, i = 1, . . . , s, p1 + + ps = 1,
(1.5) is a distribution in the range {a1 , . . . , as } of X. The left-hand side of
(1.4) and the event inside P ( ) are often abbreviated as
P (X = ai ) and {X = ai },
respectively.
5 Ac := { | 6 A } is the complement of A,
For several random variables X1 , . . . , Xn , let {ai1 , . . . , aisi } R be the

range of Xi , i = 1, . . . , n, and let
P ( X1 = a1j1 , . . . , Xn = anjn ) =: pj1 ,...,jn ,

j1 = 1, . . . , s1 , . . . , jn = 1, . . . , sn . (1.6)
Then, the set
{ ((a1j1 , . . . , anjn ), pj1 ,...,jn ) | j1 = 1, . . . , s1 , . . . , jn = 1, . . . , sn } (1.7)
is called the joint distribution of X1 , . . . , Xn . Of course, the left-hand side

of (1.6) is an abbreviation of
P ({ | X1 () = a1j1 , X2 () = a2j2 , . . . , Xn () = anjn }) .
Since we have
0 5 pj1 ,...,jn 5 1, j1 = 1, . . . , s1 , . . . , jn = 1, . . . , sn ,
X
pj1 ,...,jn = 1
j1 =1,...,s1 , ..., jn =1,...,sn
(Sec. A.1.2), (1.7) is a distribution in the direct product of the ranges of

X1 , . . . , Xn (Definition A.1)
{a11 , . . . , a1s1 } {an1 , . . . , ansn }.
In contrast with joint distribution, the distribution of each individual

Xi is called the marginal distribution.
Example 1.3. Let us look closely at joint distribution and marginal

distribution in the case of n = 2. Suppose that the joint distribution of two
random variables X1 and X2 is given by
P ( X1 = a1i , X2 = a2j ) = pij , i = 1, . . . , s1 , j = 1, . . . , s2 .
Then, their marginal distributions are computed as

s2
X s2
X
P ( X1 = a1i ) = P ( X1 = a1i , X2 = a2j ) = pij , i = 1, . . . , s1 ,
j=1 j=1
Xs1 Xs1
P ( X2 = a2j ) = P ( X1 = a1i , X2 = a2j ) = pij , j = 1, . . . , s2 .
i=1 i=1
This situation is illustrated in the following table.

PP marginal distribution
P PP X1
PP a11 a1s1
X2 P of X2
Ps1
a21 p11 ps1 1 i=1 pi1
.. .. .. .. ..
. . . . .
Ps1
a2s2 p1s2 ps1 s2 i=1 pis2
marginal distribution Ps2 Ps2 P

j=1 p1j j=1 ps1 j i=1,...,s1 pij = 1
of X1 j=1,...,s2
As you see, since they are placed in the margins of the table, they are called
marginal distributions.
The joint distribution uniquely determines the marginal distributions

(Example 1.3), but in general the marginal distributions do not uniquely
determine the joint distribution.
Example 1.4. The coordinate functions i : {0, 1}3 R, i = 1, 2, 3, de-

fined on the probability space ( {0, 1}3 , P({0, 1}3 ), P3 ) by (1.2) are random
variables. They all have a same distribution:
{ (0, 1/2), (1, 1/2) }.
Their joint distribution is the uniform distribution in {0, 1}3 :
{ ((0, 0, 0), 1/8), ((0, 0, 1), 1/8), ((0, 1, 0), 1/8), ((0, 1, 1), 1/8),
((1, 0, 0), 1/8), ((1, 0, 1), 1/8), ((1, 1, 0), 1/8), ((1, 1, 1), 1/8) }.
Example 1.5. A constant can be considered to be a random variable. Let

(, P(), P ) be a probability space and c R be a constant. Then, a
random variable X() := c, , has a distribution {(c, 1)}.
Random variables play the leading role in probability theory. A ran-

dom variable X is defined on a probability space (, P(), P ) is inter-
preted as is randomly chosen from with probability P ({}), and ac-
cordingly the value X() becomes random. In general, choosing an
and getting the value X() is called sampling, and X() is called a sample
value of X.
In probability theory, we always deal with random variables as func-

tions, and we are indifferent to individual sample values or sampling meth-
ods. Therefore random variables need not have interpretation that they
are random, and need not be chosen randomly. However in practical
applications, such as mathematical statistics or the Monte Carlo method,
sample values or sampling methods may become significant.
Usually, a probability space is just a stage that random variables en-

ter. Given a distribution or a joint distribution, we often make a suitable
probability space and define a random variable or several random variables
on it, so that its distribution or their joint distribution coincides with the
given one. For example, for any given distribution { (ai , pi ) | i = 1, . . . , s },
define a probability space ( , P(), P ) and a random variable X by
:= {a1 , . . . , as }, P ({ai }) := pi , X(ai ) := ai , i = 1, . . . , s.
Then, the distribution of X coincides with the given one. Similarly, for any
given joint distribution
{((a1j1 , . . . , anjn ), pj1 ,...,jn ) | j1 = 1, . . . , s1 , . . . , jn = 1, . . . , sn }, (1.8)
define a probability space (, P(), P ) and random variables X1 , . . . , Xn
by
:= {a11 , . . . , a1s1 } {an1 , . . . , ansn },
P ({(a1j1 , . . . , anjn )}) := pj1 ,...,jn , j1 = 1, . . . , s1 , . . . , jn = 1, . . . , sn ,
Xi ((a1j1 , . . . , anjn )) := aiji , i = 1, . . . , n.
Each Xi is a coordinate function. Then, the joint distribution of X1 , . . . , Xn
coincides with the given one (1.8). Such realization of probability space and
random variable(s) is called the canonical realization. Example 1.4 shows
the canonical realization of 3 coin tosses.
Remark 1.1. Laplace insisted that randomness should not exist and all
phenomena should be deterministic ([Lapalce (1812)]). For an intelligence
who knows all about the forces that move substances and about the posi-
tions and the velocities of all molecules that consist the substances at an
initial time, and if in addition the intelligence 6 has a vast ability to analyze
the motion equation, there would be no irregular things in this world and
everything would be deterministic. However, for us who know only a little
part of the universe and do not have enough ability to analyze the very
complicated motion equation, things occur as if they do at random.
6 This intelligence is often referred to as Laplaces demon.
The formulation of random variable in probability theory reflects

Laplaces determinism. Namely, the whole event can be interpreted as
the set of all possible initial values, each elementary event as one of the
initial values, and X() as the solution of the very complicated motion
equation under the initial value .
Laplaces determinism had been a dominating thought in the history of
science, until quantum mechanics was discovered.
1.2 Random number
Randomness is involved in the procedure that an is chosen from , on the

other hand, probability theory is indifferent to that procedure. Therefore
it seems impossible to analyze randomness by probability theory, but as a
matter of fact, it is possible. We explain why it is possible in this and in
the following sections.
Randomness can be defined in mathematics by formulating the proce-

dure of choosing an from .
Generalizing Example 1.1, consider
({0, 1}n , P({0, 1}n ), Pn )
as a probability space for n coin tosses, where Pn : P({0, 1}n ) R is
the uniform probability measure on {0, 1}n , i.e., the probability measure
satisfying
#A #A
Pn (A) := = n , A P({0, 1}n ). (1.9)
#{0, 1}n 2
Suppose that Alice 7 chooses an {0, 1}n of her own will. When n
is small, she can easily write down a {0, 1}-sequence of length n for . For
example, if n = 10, she writes (1, 1, 1, 0, 1, 0, 0, 1, 1, 1). When n = 1000, she
can do it somehow in the same way.
It comes into question when n 1. 8 For example, when n = 108 , how
8
on earth can Alice choose an from {0, 1}10 ? In principle, she should
write down a {0, 1}-sequence of length 108 , but it is impossible because 108
is too huge. Considering the hardness of the task, she cannot help using a
8
computer to choose an {0, 1}10 . The computer program that produces
8
= (0, 0, 0, 0, . . . , 0, 0) {0, 1}10 , (1.10)
7 Aliceis a fictitious character who makes several thought experiments in this book.
8 a b means a is much greater than b. See Sec. A.1.3.
1.2. Random number 11
i.e., the run of 0s of length 108 , would be simple and easy to write. The
one that produces
8
= (1, 0, 1, 0, . . . , 1, 0) {0, 1}10 , (1.11)
i.e., 5 107 times repetition of a pattern 1, 0 would be less simple but still
8
easy to write. On the other hand, for some {0, 1}10 , the program to
produce it would be too long to write in practice. Let us explain it below.
In general, a program is a finite string of letters and symbols, which
is described in computer as a {0, 1}-sequence of finite length. For each
8
{0, 1}10 , let q be the shortest program that produces , and let
L(q ) denote the length 9 of q as a {0, 1}-sequence. If 6= 0 , then
q 6= q0 . Now, the number of for which L(q ) = k is at most 2k , the
number of all elements of {0, 1}k . This implies that the total number of
8
{0, 1}10 for which L(q ) 5 M is at most
#{0, 1}1 + #{0, 1}2 + + #{0, 1}M = 21 + 22 + + 2M

= 2M +1 2.
8
Conversely, the number of {0, 1}10 for which L(q ) = M + 1 is at
8 8
least 210 2M +1 + 2. More concretely, the number of {0, 1}10 for
8 8
which L(q ) = 108 10 is at least 210 210 10 + 2, which account for at
8
least 1 210 = 1023/1024 of all elements in {0, 1}10 . On the other hand,
8
for any {0, 1}10 , there is a program that outputs as it iswhose
length would be a little greater than and hence approximately equal to the
8
length of (= 108 ). These facts show that for nearly all {0, 1}10 , we
have L(q ) 108 . 10 In general, when n 1, we call an {0, 1}n with
L(q ) n a random {0, 1}-sequence or a random number. 11
An that can be produced by a short program q can be done so

because it has some regularity, or rather, q itself describes the regularity
of . The longer q is, the less regular is. Therefore we can say that
random numbers are the least regular {0, 1}-sequences. Random numbers
require too long programs to produce them. This means that they cannot
be chosen by Alice. Thus the notion of random number well expresses
randomness that we intuitively know.
9 The length of the shortest program depends on the programming language, but do
not mind it here. We will explain it in detail in Sec. 2.2.1.

10 x y means x and y are approximately equal to each other.
11 In this manner, the definition of random number cannot help being somewhat am-
biguous. We will explain it in detail in Remark 2.1.

When n 1, if each {0, 1}n is chosen with probability Pn ({}) =

n
2 , a random number will be chosen with very high probability. At the
opening of this chapter, we mentioned Tossing a coin many times, record
1 if it comes up Heads and record 0 if it comes up Tails at each coin toss.
Then, we get a long {0, 1}-sequence that is random. Exactly speaking,
we must revise it as we get a long random {0, 1}-sequence with very high
probability. Indeed, with very small probability, we may have a non-
random {0, 1}-sequence such as (1.10) or (1.11).
Remark 1.2. In probability theory, the term random usually means

that the quantity in question is a random variable. In general sciences,
random is used in many contexts with various meanings. The randomness
introduced in this section is called algorithmic randomness to specify the
meaning.
Remark 1.3. We have defined randomness for long {0, 1}-sequences, such
as the outcomes of many coin tosses, but we feel even a single coin toss is
random. Why do we feel so?
For example, if we drop a coin with Heads up quietly from a height
of 5mm above a desk surface, then the outcome will be Heads without
doubt. It is thus possible to control Heads and Tails in this case. On
the other hand, if we drop it from a height of 50cm, it is not possible.
Since rebounding obeys a deterministic physical law, we can, in principle,
control Heads and Tails, even though the coin is dropped from an arbitrarily
high position (Remark 1.1). However, each rebounding motion depends so
acutely on the initial value that a slight difference of initial values causes
a big difference in the results (Fig. 1.2). In other words, when we drop a
coin from a height of 50cm, in order to control its Heads and Tails, we must
measure and set the initial mechanical state of the coin with extremely high
precision, which is impossible in practice. Thus controlling the Heads and
Tails of a coin dropped from a high position is similar to choosing a random
number in that both are beyond our ability.
1.3 Limit theorem
1.3.1 Analysis of randomness

Since mathematical models of virtually all random phenomena can be con-
structed from coin tosses (Sec. 1.5.2), to analyze randomness we have es-
sentially only to study properties of random numbers. However, we cannot
1.3. Limit theorem 13
Fig. 1.2 Reboundings of a coin (Two simulations with slightly different initial values)
know any property of each individual random number. Indeed, if is a ran-

dom number, we cannot know that it is so (Theorem 2.7). What is sure is
that when n 1, random numbers account for nearly all {0, 1}-sequences.
In such a situation, is it possible to study properties of random numbers?
To study properties of them as a whole is possible. The answer is
unexpectedly simple: study properties that nearly all long {0, 1}-sequences
share. Such properties are stated by various limit theorems in probability
theory. Therefore the study of limit theorems is the most important in
probability theory.
Here is an example of limit theorem. For each i = 1, . . . , n, being

defined on the probability space ({0, 1}n , P({0, 1}n ), Pn ), the coordinate
function 12
i () := i , = (1 , . . . , n ) {0, 1}n , (1.12)
represents the i-th outcome (0 or 1) of n coin tosses. Then, no matter how
small > 0 is,
it holds that
1 () + + n () 1

lim Pn {0, 1}n > = 0. (1.13)
n n 2
12 Strictly speaking, since the domain of definition of is {0, 1}n , which depends on n,
i
we must write it as n,i . We drop n because its meaning is common to any n: the i-th
coordinate of .
Since 1 ()+ +n () is the number of Heads(= 1) in the n coin tosses ,

(1.13) asserts that when n 1, the relative frequency of Heads should be
approximately 1/2 for nearly all {0, 1}n . (1.13) is a limit theorem called
Bernoullis theorem (Theorem 3.2, which is a special case of the law of large
numbers (Theorem 3.3)). As a quantitative estimate, we have Chebyshevs
inequality (Example 3.3):

1 () + + n () 1

1
Pn {0, 1}n = 5 .
n 2 4n2
As a special case where = 1/2000 and n = 108 , we have

8 1 () + + 108 ()

1 1 1
P108 {0, 1}10 = 5 .
108 2 2000 100
A much more advanced theorem called de MoivreLaplaces theorem (The-
orem 3.6, a special case of the central limit theorem (Theorem 3.9)) shows
that this probability is very precisely estimated as (Example 3.10) 13

108 1 () + + 108 ()

1 1
P108 {0, 1} =
108 2 2000
Z
x2

1
2 exp dx = 1.52551 1023 .
9.9999 2 2
Subtract both sides from 1, we get

108 1 () + + 108 ()

1 1
P108 {0, 1} <
108 2 2000
1 1.52551 1023 .
8
Namely, 1 1.52551 1023 of all the elements of {0, 1}10 are those that
satisfy

1 () + + 108 () 1
< 1 . (1.14)
10 8 2 2000
8
On the other hand, nearly all elements of {0, 1}10 are random numbers.
Therefore nearly all random numbers share the property (1.14), and
conversely, nearly all that satisfy (1.14) are random numbers.
8
Such as in (1.11), there are {0, 1}10 that are not random but
satisfy (1.14). In general, an event of probability close to 1 that a limit theo-
rem specifies does not completely coincide with the set of random numbers,
but the difference between them is very little (Fig. 1.3).
13 exp(x)
is an alternative description for the exponential function ex , hence exp(x2 /2)
Z Z R
2
stands for ex /2 , and is an abbreviation of lim , which is called an
9.9999 R 9.9999
improper integral.
1.3. Limit theorem 15
All {0, 1}-sequences Random numbers
An event of probability close to 1

that a limit theorem specifies
Fig. 1.3 Random number and limit theorem (Conceptual figure)
Bernoullis theorem was published in 1713 after his death. Since then,
limit theorems have been the central theme of probability theory. On the
other hand, the concept of randomness was established in 1960s. Thus
mathematicians have been studying limit theorems since long before they
defined randomness.
1.3.2 Mathematical statistics

Applying the law of large numbers and the central limit theorem, we can
estimate various probabilities in the real world.
For example, a thumbtack (drawing pin) can land either point up or
down. When we toss it, what is the probability it lands point up? To
answer this, toss it many times, count the number of landings with point
up, and calculate the relative frequency of landings with point up, which
will be an approximate value of the probability. A similar method applies to
opinion polls: without interviewing all citizens, choose a part of citizens by
dice or lot (random sampling), interview them, and compute the proportion
of those who say yea among them.
Suppose we are given a certain coin, and asked When it is tossed,
is the probability that it comes up Heads 1/2? To answer this, toss it
many times, calculate the relative frequency of Heads, and examine if it is
Fig. 1.4 Thumbtacks (Left: point up, Right: point down)
a reasonable value compared with de MoivreLaplaces theorem under the

assumption that the probability is 1/2. This method is also useful when we
examine if the ratio of boys and girls among newborn babies in a certain
country is 1 : 1.
In this way, through experiments, observations, and investigations, mak-
ing a mathematical model for random phenomena, testing it, or predicting
the random phenomena by it, are all included in a branch of study called
mathematical statistics.
1.4 Monte Carlo method
The Monte Carlo method is a numerical method to solve mathematical

problems by computer-aided sampling of random variables.
When # is small, sampling of a random variable X : R is easy.
If # = 108 , we have only to specify a number of at most 9 decimal digits
8
to choose an , but when = {0, 1}10 , a computer is indispensable
for sampling. Let us consider the following exercise.
Exercise I When we toss a coin 100 times, what is the probability

p that it comes up Heads at least 6 times in succession?
We apply one of the ideas of mathematical statistics stated in Sec. 1.3.2.

Repeat 100 coin tosses 106 times, and let S be the number of the occur-
rences of the coin comes up Heads 6 times in succession among the 106
trials. Then, by the law of large numbers, S/106 will be a good approxi-
mate value of p with high probability. To do this, the total number of coin
tosses we need is 100 106 = 108 . Of course, we do not toss a real coin so
many times, instead we use a computer.
1.4. Monte Carlo method 17
Let us sort the problem. S is formulated as a random variable defined

8 8
on the probability space ({0, 1}10 , P({0, 1}10 ), P108 ). In this case, Cheby-
shevs inequality shows

S() 1 1
P108 6 p =
5
10 200 100
(Example 4.2). Subtract the both sides from 1, we have

S() 1 99
P108 6 p <
= . (1.15)
10 200 100
8
Namely, if Alice chooses an from {0, 1}10 , and computes the value of
S()/106 , she can get an approximate value of p with error less than 1/200
with probability at least 0.99.
Now, to give the inequality (1.15) a practical meaning, Alice should be
8
equally likely to choose an {0, 1}10 . This means that she should choose
from mainly among random numbers because they account for nearly all
8
elements of {0, 1}10 . 14 However, as we saw in Sec. 1.2, it is impossible to
choose a random number even by computer.
In most of practical Monte Carlo methods, pseudorandom numbers are
used instead of random numbers. A program that produces pseudorandom
numbers, mathematically speaking, a function
g : {0, 1}l {0, 1}n , l < n,
is called a pseudorandom generator. For practical use, l is assumed to be
small enough for Alice to be equally likely to choose 0 {0, 1}l , and on the
other hand, n is assumed to be too large for her to be equally likely to choose
{0, 1}n . Namely, l n. The program produces g( 0 ) {0, 1}n from
0 {0, 1}l that Alice has chosen. Here, g( 0 ) is called a pseudorandom
number, and 0 is called its seed.
For any 0 , the pseudorandom number g( 0 ) is not a random number.
Nevertheless, it is useful in some situations. In fact, in the case of Exercise I,
8
there exists a suitable pseudorandom generator g : {0, 1}238 {0, 1}10
such that an inequality
S(g( 0 ))

< 1 99

P238 p 200 = 100 ,
106
which is similar to (1.15), holds (Fig. 1.5, the Random Weyl sampling (Ex-
ample 4.4)).
14 This is the reason why random number is said to be needed for the Monte Carlo
method.
8 The set of seeds

{0, 1}10 {0, 1}238 0
Random numbers
g( 0 )

The set of such that S()/106 p < 1/200

8
Fig. 1.5 The role of g : {0, 1}238 {0, 1}10 (Conceptual figure)
Alice can choose any 0 {0, 1}238 of her own will, or by tossing a real
coin 238 times, she can get a seed 0 by random sampling. Thus to solve
Exercise I, she does not need a random number, but by the pseudoran-
dom number that g produces from her seed, she can actually get a good
approximate value of p with high probability.
Exercise I deals with a problem about coin tosses, but practical problems
to which the Monte Carlo method is applied are much more complicated.
Nevertheless, since any practical probabilistic problem can be reduced to
that of coin tosses (Sec. 1.5.2), we may assume pseudorandom numbers to
be {0, 1}-sequences.
1.5 Infinite coin tosses
Borels model of coin tosses (Example 1.2) can give a mathematical model
of not only 3 coin tosses but also arbitrarily many coin tosses. Furthermore,
the sequence of functions {di } i=1 defined in Example 1.2 can be regarded
as infinite coin tosses. Of course, there do not exist infinite coin tosses in
the real world, but by some reasons, it is important to consider them.
The contents of this section slightly exceeds the level of this book.
1.5. Infinite coin tosses 19
1.5.1 Borels normal number theorem

Rational numbers are sufficient for practical computation, but to let calcu-
lus be available, real numbers are necessary. Just like this, the first reason
why we consider infinite coin tosses is that they are useful when we analyze
the limit behavior of n coin tosses as n . Indeed, the fact that the prob-
ability space for n coin tosses varies as n varies is awkward, bad-looking,
and deficient for advanced study of probability theory.
For example, Borels normal number theorem
n
!
1X 1
P lim di = = 1 (1.16)
n n 2
i=1
asserts just one of the analytic properties of the sequence of functions
{di }
i=1 , but it is interpreted in the context of probability theory as When
we toss a coin infinitely many times, the asymptotic limit of the relative
frequency of Heads is 1/2 with probability 1. It is known that Borels
normal number theorem implies Bernoullis theorem (1.13). Note that it is
not easy to grasp the exact meaning of (1.16). Intuitively, it means that
the length of the(set n
)
1X 1
A := x [0, 1) lim di (x) = [0, 1)

n n 2
i=1
is equal to 1, but since A is not a simple set like a semi-open interval, how
to define its length and how to compute it come into question. To solve
them, we need measure theory.
1.5.2 Construction of Brownian motion

The second reason why we consider infinite coin tosses is that we can
construct a random variable with arbitrary distribution from infinite coin
tosses. What is more, except very special cases 15 , any probabilistic object
can be constructed from them. As an example, we here construct from
them a Brownian motionthe most important stochastic process both in
theory and in practice.
Define a function F : R (0, 1) := { x | 0 < x < 1 } R as follows. 16
Z t 2
1 u
F (t) := exp du, t R.
2 2
15 For example, construction of uncountable (cf. p.25) independent random variables.
Z t Z t
16 is an abbreviation of lim , which is also called an improper integral. See
R R
Remark 3.7.
The integrand in the right-hand side is the probability density function of

the standard normal distribution (or the standard Gaussian distribution).
Since F : R (0, 1) is a continuous increasing function, its inverse function
F 1 : (0, 1) R exists. Then, putting
F 1 (x) (0 < x < 1),

X(x) :=
(x = 0),
it holds that
P(X < t) := P({x [0, 1) | X(x) < t})

= P({x [0, 1) | x < F (t)})
= P( [0, F (t)) ) = F (t), t R.
In the context of probability theory, X is interpreted as a random vari-

able. 17 Accordingly, the above expression shows that The probability
that X < t is F (t), in other words, X obeys the standard normal distri-
bution.
P
Now, since x = i=1 2i di (x), it is clear that
( )!
X
i
x [0, 1) 2 di (x) < t = P( [0, t) ) = t, t [0, 1).

P

i=1
Any subsequence {dij }

j=1 , 1 5 i1 < i2 < , is also infinite coin tosses,
and hence we have

j
X
P x [0, 1) 2 dij (x) < t = t, t [0, 1).
j=1
Accordingly, we see

X
j

P x [0, 1) X
2 di j
(x) < t = F (t), t R.

j=1
P

Namely, X j=1 2j dij obeys the standard normal distribution.
17 We want to regard [0, 1) as a whole event, P as a probability measure, but [0, 1) is an
infinite set and hence this idea exceeds the level of this book. Here the quotation mark
of random variable shows that this word is not rigorously defined in this book. In what
follows, independent will be used similarly.
1.5. Infinite coin tosses 21
Here is an amazing idea: if we put

X1 := X 21 d1 + 22 d3 + 23 d6 + 24 d10 + 25 d15 + ,

X2 := X 21 d2 + 22 d5 + 23 d9 + 24 d14 + ,

X3 := X 21 d4 + 22 d8 + 23 d13 + ,

X4 := X 21 d7 + 22 d12 + ,

X5 := X 21 d11 + ,

..
.
then each Xn obeys the standard normal distribution. We emphasize that
each dk appears only in one Xn , which means the value of each Xn does
not make influence on any other Xn0 (n0 6= n). Namely, {Xn } n=1 are
independent.
Now, we are at the position to define a Brownian motion {Bt }05t5
(Fig. 1.6):
r
t 2 X sin nt
Bt := X1 + Xn+1 , 0 5 t 5 . (1.17)
n=1 n
To tell the truth, the graph of Fig. 1.6 is not exactly the Brownian
motion (1.17) itself, but its approximation 18 {Bt }05t5 . Let {Xn }1000
n=1 be
approximated by {Xn }1000
n=1 , where
X1 := X 21 d1 + 22 d2 + 23 d3 + + 231 d31 ,

X2 := X 21 d32 + 22 d33 + 23 d34 + + 231 d62 ,

X3 := X 21 d63 + 22 d64 + 23 d65 + + 231 d93 ,

X4 := X 21 d94 + 22 d95 + 23 d96 + + 231 d124 ,

..
.
X1000 := X 21 d30970 + 22 d30971 + 23 d30972 + + 231 d31000 ,

and using these, define {Bt }05t5 by

r 999
t 2 X sin nt
Bt := X1 + Xn+1 , 0 5 t 5 . (1.18)
n=1 n
Thus, Fig. 1.6 is based on (1.18) with the sample of 31,000 coin tosses
{di }31000
i=1 .
19
18 To
be precise, approximation in the sense of distribution.
19 Thesample of 31,000 coin tosses is produced by a pseudorandom generator in [Sugita
(2011)] Sec. 4.2.
1.5
1.0
0.5
0.5 1.0 1.5 2.0 2.5 3.0
!0.5
Fig. 1.6 A sample path of Brownian motion
We have constructed a Brownian motion from the infinite coin tosses

{di }i=1 . Do not be surprised yet. If we apply the same method to infinite
coin tosses d1 , d3 , d6 , d10 , d15 , . . ., which compose X1 , we can construct
a Brownian motion from them. Similarly, from the infinite coin tosses
d2 , d5 , d9 , d14 , . . ., which compose X2 , we can construct another indepen-
dent Brownian motion. Repeating this procedure, we can construct infinite
independent Brownian motions from {di } i=1 .
Chapter 2
Random number
What is randomness? To this question, which puzzled many scholars of

all ages and countries, many answers were presented. For example, as
we saw in Remark 1.1, Laplace stated a determinist view in his book
[Lapalce (1812)]. By Kolmogorovs axiomatization of probability theory
([Kolomogorov (1933)]), a random variable was formulated as a function
X : R, and individual sample values X() as well as sampling meth-
ods were passed over in silence. Thanks to it, mathematicians were released
from the question.
In 1950s, the situation has changed since computer-aided sampling of
random variables was realized: the Monte Carlo method came into the
world. Before then, sampling of random variables had been done in mathe-
matical statistics, for which tables of random numbersarrays of numbers
obtained by rolling dice, etc.were used. However, they were useless for
large-scale sampling by computer. Consequently, people had to consider
how to make a huge table of random numbers by computer. Then, What
is randomness? again came into question.
In order to define randomness, we have to formulate the procedure of
choosing an from . For this purpose, maturation of computation theory
was indispensable. Finally, in 1960s, the definition of randomness was
declared by Kolmogorov, Chaitin, and Solomonoff, independently.
This chapter deals with computation theory. Rigorous treatment of the

theory needs considerably many preliminaries, and hence exceeds the level
of this book. Instead of giving detailed proofs of theorems, we explain the
meanings of the theorems by comparing them with mechanisms of com-
puter.
23
24 Random number
2.1 Recursive function
The kinds of data dealt with by computer are diverse. As for input, data
from keyboard, mouse, scanner, and video camera, and as for output, doc-
ument, image, sound, movie, control signal for IT device, etc. They all
are converted into finite {0, 1}-sequences, and then they are recorded in
computer memory or disks (Fig. 2.1), copied, or transmitted.
0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0
Each boundary of flat place and hollow place records 1, other places record 0.
Fig. 2.1 Images of CD (left) and DVD (right) by scanning electron microscope 1
Since each finite {0, 1}-sequence corresponds to a non-negative integer

through the binary numeral system (Definition 2.3), every input/output
data dealt with by computer can be regarded as a non-negative integer.
Thus every action of computer can be regarded as a function f : N N. 2
It may sound surprising that in the final analysis, computer can calculate
only non-negative integers. Such extreme simplification seems to be useless,
but as a matter of fact, it often leads to a great development in mathematics.
Now, conversely, can an arbitrary function f : N N be realized by an
action of computer? The answer is No. Even computer is not almighty.
In this section, we introduce the class of recursive functions, which was
presented as computable functions, and their basic properties.
1 Source: Japan Science and Technology Agency Rika Network

2N := {0, 1, 2, . . .} is the set of all non-negative integers.
2.1. Recursive function 25
2.1.1 Computable function

Each action of computer is described by a program, which just like any
other input/output data is a finite {0, 1}-sequence, or a non-negative in-
teger. Therefore, by numbering all programs, we can make a one-to-one
correspondence between all computable functions and all non-negative in-
tegers N. In general, if there is a one-to-one correspondence between a set
A and N, the set A is called a countable set, and if there is not, it is called
an uncountable set.
Proposition 2.1. The set { f | f : N {0, 1} }, i.e., the set of all functions
from N to {0, 1}, is an uncountable set.
Proof. We show the proposition by contradiction. Suppose { f | f : N

{0, 1} } is a countable set and numbered as {f0 , f1 , . . .}. Then, if we define
a function g : N {0, 1} by
g(n) := 1 fn (n), n N,
we have g(n) 6= fn (n) for every n N, i.e., g does not belong to the
numbered set {f0 , f1 , . . .}. This is a contradiction. Therefore { f | f : N
{0, 1} } is an uncountable set.
@
f0 (0)@f0 (1) f0 (2) f0 (3) f0 (4)
@ @
@ @
@ f1 (1)@f1 (2) f1 (3) f1 (4)
f1 (0)
@ @
@ @
@ f2 (2)@f2 (3) f2 (4)
f2 (0) f2 (1)
@ @
@ @
@ f3 (3)@f3 (4)
f3 (0) f3 (1) f3 (2)
@ @
@ @
@ f4 (4)@
f4 (0) f4 (1) f4 (2) f4 (3)
@
.. .. .. .. .. ..
. . . . . .
The above proof is called the diagonal method. The most familiar un-
countable set is the set of all real numbers R, which fact and whose proof
can be found in most of textbooks of analysis or set theory.
26 Random number
The set of all functions f : N N includes the uncountable set { f | f :

N {0, 1} }, and hence it is uncountable. The set of all computable
functions is a countable subset of it, say, {0 , 1 , 2 , . . .}. Then, the set of
all incomputable functions is uncountable because if it is countable and is
numbered as {g0 , g1 , g2 , . . .}, we get
{g0 , 0 , g1 , 1 , g2 , 2 , . . .}
as a numbering of the set { f | f : N N }, which is a contradiction.

An important example of incomputable function is the Kolmogorov com-
plexity (Definition 2.6, Theorem 2.7).
2.1.2 Primitive recursive function and partial recursive

function
About the definition of computable function, there had been many discus-
sions, until it reached a consensus in 1930s: we can compute recursive
functions (more precisely, primitive recursive functions, partial recursive
functions, and total recursive functions) and nothing else. The set of all
recursive functions coincide with the set of all functions that the Turing
machine 3 can compute. Any actions of real computers can be described
by recursive functions. It is amazing that all of diverse complicated actions
of computers are just combinations of small number of basic operations.
In this subsection, we introduce the definitions of primitive recursive
function, partial recursive function, and total recursive function, but we do
not develop rigorous arguments here.
First, we begin with the definition of primitive recursive function.
Definition 2.1. (Primitive recursive function)

(i) (Basic functions)
zero : N0 N, zero( ) := 0,
suc : N N, suc(x) := x + 1,
proj ni : Nn N, proj ni (x1 , . . . , xn ) := xi , i = 1, . . . , n
are basic functions. 4

3A virtual computer with infinite memory. For details, see [Sipser (2012)].
4 zero( )
is a constant function that returns 0. We use N0 as formal notation, but do
not mind it. suc denotes successor and proj n
i is a coordinate function, an abbreviation
of projection.
(ii) (Composition)
For g : Nm N and gj : Nn N, j = 1, . . . , m, we define f : Nn N by
f (x1 , . . . , xn ) := g(g1 (x1 , . . . , xn ), . . . , gm (x1 , . . . , xn )).
This operation is called composition.
(iii) (Recursion)
For g : Nn N and : Nn+2 N, we define f : Nn+1 N by

f (x1 , . . . , xn , 0) := g(x1 , . . . , xn ),
f (x1 , . . . , xn , y + 1) := (x1 , . . . , xn , y, f (x1 , . . . , xn , y)).
This operation is called recursion.
(iv) A function Nn N is called a primitive recursive function if and only if
it is a basic function or a function obtained from basic functions by applying
finite combinations of composition and recursion.
Example 2.1. Two variables sum add(x, y) = x+y is a primitive recursive
function. Indeed, it is defined by
:= proj 11 (x) = x,

add(x, 0)
add(x, y + 1) := proj 33 (x, y, suc(add(x, y))).
Two variables product mult(x, y) = xy is also a primitive recursive func-
tion. Indeed, it is defined by
:= proj 22 (x, zero( )) = 0,

mult(x, 0)
mult(x, y + 1) := add(proj 21 (x, y), mult(x, y)).
Using a primitive recursive function pred(x) = max{x 1, 0} 5 :

pred(0) := zero( ) = 0,
pred(y + 1) := proj 21 (y, pred(y)),
we define two variables difference sub(x, y) = max{x y, 0} by
:= proj 11 (x) = x,

sub(x, 0)
sub(x, y + 1) := pred(proj 33 (x, y, sub(x, y))),
then it is a primitive recursive function.
Secondly, we introduce partial recursive function. A partial function is

a function g defined on a certain subset D of Nn and taking values in N.
We do not explicitly write the domain of definition D of g, but write it
simply as g : Nn N. 6 A function defined on the whole set Nn is called
a total function. Since N is a subset of N itself, a total function is a partial
function.
5 pred
is an abbreviation of predecessor. max{a, b} denotes the greater of a and b.
6 This
notation applies to this chapter only. In other chapters, if we write f : E F ,
the function f is defined for all elements of E.
28 Random number
Definition 2.2. (Partial recursive function)

(i) (Minimization, -operator)
For a partial function p : Nn+1 N, we define y (p(, , , y)) : Nn N
by

min Ap (x1 , . . . , xn ) ( Ap (x1 , . . . , xn ) 6= ),
y (p(x1 , . . . , xn , y)) :=
not defined ( Ap (x1 , . . . , xn ) = ).
Here min Ap (x1 , . . . , xn ) is the minimum value of the following set.

p(x1 , . . . , xn , z) is defined for z such that
Ap (x1 , . . . , xn ) := y N .
0 5 z 5 y, and p(x1 , . . . , xn , y) = 0
(ii) A function Nn N is called a partial recursive function if and only if

it is a basic function or a partial function obtained from basic functions by
applying finite combinations of composition, recursion, and minimization.
The computer programming counterpart of -operator is called loop,

which is shown below.
y (p(x1 , . . . , xn , y))
(1) y := 0.
(2) If p(x1 , . . . , xn , y) = 0, then output y and halt.
(3) Increase y by 1, and go to (2).

The loop does not halt if Ap (x1 , . . . , xn ) = , which is called an infinite

loop.
If a partial recursive function f does not include -operator (and hence

it is a primitive recursive function), or if it includes y (p(x1 , . . . , xn , y))
only when Ap (x1 , . . . , xn ) 6= , f is a total function. Such f is called a total
recursive function.
Example 2.2. The following partial recursive function
f (x) := y (add(sub(mult(y, y), x), sub(x, mult(y, y))))
returns the positive square root of x if x is a squared integer. If x is not a

squared integer, f (x) is not defined.
2.1.3 Kleenes normal form() 7

Let us stop along the way. Partial recursive functions are so diverse that at
a glance, we cannot expect their general structure. However, in fact, here
is an amazing theorem.
Theorem 2.1. (Kleenes normal form) For any partial recursive function
f : Nn N, there exist two primitive recursive functions g, p : Nn+1 N
such that
f (x1 , x2 , . . . , xn ) = g(x1 , x2 , . . . , xn , y (p(x1 , x2 , . . . , xn , y))),
(x1 , x2 , . . . , xn ) Nn . (2.1)
We here only give an idea of the proof by explaining an example. The

point is that if a given program has more than one loop, which correspond
to -operators, we can rearrange them into a single loop.
1 0
Input(x1 , . . . , xn ); z := 0 A? Output(z)
no
yes
2
B

3 5
C ? no E
yes
4
D ? no
yes
Fig. 2.2 Flow chart I 8
Let Fig. 2.2 (Flow chart I) be a flow chart to compute the function f .
A ? C ? D ? show conditions of branches, and B E are procedures
without loops (i.e., calculation of primitive recursive functions), which set
the value of z, respectively. This program includes the main loop A ?
B C ? D ? A ? , a nested loop C ? E C ? , and an
7 The subsections with () can be skipped.
8 Flowcharts I, II are slight modifications of Fig. 3 (p.12), Fig. 4 (p.13) of [Takahashi
(1991)], respectively.
30 Random number
escape branch from the main loop at D ? . Let us show that these loops
can be rearranged into a single loop by introducing a new variable u. To
do this, we put numbers 0 to 5 respectively at the top left of the boxes of
all procedures A ? C ? D ? B E and the output procedure in order
for the variable u to refer (Fig. 2.2).
Input(x1 , . . . , xn ); z := 0
Q
u := 1

u = 0 ? u=1 ? A? u:=2
yes yes yes
no no no
u:=0
Output(z)

u=2 ? B u:=3
yes
no

u=3 ? C? u:=4
yes yes
no no
u:=5

u=4 ? D? u:=1
yes yes
no no
(u = 5) u:=0
E u:=3
Fig. 2.3 Flow chart II
Fig. 2.3 (Flow chart II) shows the rearrangement of Flow chart I. It is
easy to confirm that Flow chart II also computes the same function f . Let
us show that f , which Flow chart II computes, can be expressed in the
form (2.1). Let Q be a process consisting of all the procedures enclosed by
the thick lines in Flow chart II. Define g(x1 , . . . , xn , y) as the value of the
output variable z that is produced after Q being executed y times under

the input (x1 , . . . , xn ), and define p(x1 , . . . , xn , y) as the value of u after
Q being executed y times under the input (x1 , . . . , xn ). Then,we see (2.1)
holds.
Kleenes normal form is introduced into the design of computer. In-

deed, a software called compiler converts Flow chart I, which we can easily
understand, to Flow chart II, which a computer can easily deal with. The
number written at the top left of each box in Flow chart I corresponds to
the address of memory where the procedure in the box is physically stored.
The new variable u corresponds to the program counter of the central pro-
cessing unit (CPU), which is used to indicate the address of memory that
the CPU is looking at.
2.1.4 Enumeration theorem

The set of all partial recursive functions is a countable set, which we can
actually enumerate.
Theorem 2.2. (Enumeration theorem) There exists a partial recursive

function
univn : N Nn N
that has the following property: for each partial recursive function f : Nn
N, there exists an ef N such that
univn (ef , x1 , . . . , xn ) = f (x1 , . . . , xn ), (x1 , . . . , xn ) Nn .
The function univn is called an enumerating function or a universal

function, and ef is called a Godel number (or an index) of f . We here give
only an idea of the proof of Theorem 2.2. First, write a computer program
of a given partial recursive function f , and regard it as a {0, 1}-sequence, or
a non-negative integer. The Godel number ef is such an integer. The enu-
merating function univn (e, x1 , . . . , xn ) checks if e is a Godel number of some
partial recursive function f of n variables. If it is not, univn (e, x1 , . . . , xn )
is not defined. If it is, univn (e, x1 , . . . , xn ) then reconstructs the partial
recursive function f from e = ef , and finally computes f (x1 , . . . , xn ).
A personal computer (PC), although it has only finite memory, can be

regarded as a realization of an enumerating function. On the other hand, a
partial recursive function is usually regarded as a single-purpose computer,
32 Random number
such as a pocket calculator. If we install a program (= Godel number) of

pocket calculator into a PC, it quickly acts as a pocket calculator. In this
sense, an enumerating function or a PC is universal. Just like there is
more than one type of PC, there is more than one enumerating function.
Just like there is more than one program for a given function, there is more
than one Godel number for a given partial recursive function.
In order for the enumeration theorem to hold, the notion of partial
function is essential.
Theorem 2.3. Any total function that is an extension of univn is not

recursive.
Proof. We show the theorem by contradiction. 9 Suppose that there exists

a total recursive function g that is an extension of univn . Then,
(z, x2 , . . . , xn ) := g(z, z, x2 , . . . , xn ) + 1 (2.2)
is also a total recursive function. Therefore using a Godel number e of ,
we can write
(z, x2 , . . . , xn ) = univn (e , z, x2 , . . . , xn ).
Since is total, so is univn (e , , , . . . , ). Since g is an extension of univn ,
we see
(z, x2 , . . . , xn ) = g(e , z, x2 , . . . , xn ).
Putting z = e in the above equality, we get
(e , x2 , . . . , xn ) = g(e , e , x2 , . . . , xn ).
On the other hand, (2.2) implies that
(e , x2 , . . . , xn ) = g(e , e , x2 , . . . , xn ) + 1,
which is a contradiction.
Let us introduce an important consequence of Theorem 2.3. Define a

function haltn : N Nn {0, 1} by

1 (univn (z, x1 , . . . , xn ) is defined),
haltn (z, x1 , . . . , xn ) :=
0 (univn (z, x1 , . . . , xn ) is not defined).
9 In proving impossibilities seen in such theorems of computer science as this theorem
or Theorem 2.7, self-referential versions of diagonal method are used. The proof below
is self-referential in that we substitute the Godel number e for the first argument of
defined by (2.2).
2.2. Kolmogorov complexity and random number 33
haltn is a total function. Let f : Nn N be a partial recursive function

and let ef be its Godel number. Then, haltn (ef , x1 , . . . , xn ) judges whether
f (x1 , . . . , xn ) is defined or not. Since not defined means an infinite loop
for actual computers, for any given f and (x1 , . . . , xn ), asking whether
f (x1 , . . . , xn ) is defined or not is called the halting problem. Here is an
important theorem.
Theorem 2.4. haltn is not a total recursive function.
Proof. Consider a function g : N Nn N defined by

univn (z, x1 , x2 , . . . , xn ) (haltn (z, x1 , . . . , xn ) = 1),
g(z, x1 , x2 , . . . , xn ) :=
0 (haltn (z, x1 , . . . , xn ) = 0).
If haltn is a total recursive function, so is g, but this is impossible by
Theorem 2.3 because g is an extension of univn .
Theorem 2.4 implies that there is no program that computes haltn .

Namely, there is no program that judges whether an arbitrarily given
program halts or fall into an infinite loop for an arbitrarily given input
(x1 , . . . , xn ). In more familiar words, there is no program that judges
whether an arbitrarily given program has a bug or not.
A little knowledge of number theory helps us feel Theorem 2.4 more
familiar. Define a subset B(x) N, x N, by

y is a positive even integer that cannot be
B(x) := y = x ,
expressed as a sum of two prime numbers
and a partial recursive function f : N N by

min B(x) (B(x) 6= ),
f (x) :=
not defined (B(x) = ).
Then, if halt1 were a total recursive function, there would exist a program
that computes halt1 (ef , 4), i.e., we would be able to know whether Gold-
bachs conjecture 10 is true or not. Like this, the function halt1 would solve
many other unsolved problems in number theory. This is quite unlikely.
2.2 Kolmogorov complexity and random number
In Sec. 1.2, we mentioned that a long {0, 1}-sequence is called a random

number if the shortest program q to produce it is almost as long as itself.
In this section, we exactly define random number in terms of recursive
function.
10 A conjecture that any even integer not less than 4 can be expressed as a sum of two
prime numbers.
34 Random number
2.2.1 Kolmogorov complexity

Each finite {0, 1}-sequence can be identified with a non-negative integer in
the following way.
Definition 2.3. Let {0, 1} := nN {0, 1}n . Namely, {0, 1} is the set
S
of all finite {0, 1}-sequences. An element of {0, 1} , i.e., a finite {0, 1}-
sequence, is called a word. In particular, the {0, 1}-sequence of length 0
is called the empty word. The canonical order in {0, 1} is defined in the
following way: for x, y {0, 1} , if x is longer than y then define x > y, if
x and y have a same length then define the order regarding them as binary
integers. We identify {0, 1} with N by the canonical order. For example,
the empty word= 0, (0) = 1, (1) = 2, (0, 0) = 3, (0, 1) = 4, (1, 0) = 5,
(1, 1) = 6, (0, 0, 0) = 7, . . . .
Definition 2.4. For each q {0, 1} , let L(q) N denote the n such
that q {0, 1}n , i.e., L(q) is the length of q. For q N, L(q) means the
length of the corresponding {0, 1}-sequence to q in the canonical order. For
example, L(5) = L((1, 0)) = 2. In general, L(q) is equal to the integer part
of log2 (q + 1). i.e., L(q) = blog2 (q + 1)c.
L(q) is non-decreasing function. In particular,

L(q) 5 log2 q + 1, q N+ . 11 (2.3)
Now, we introduce a useful operation that makes a function of one

variable from that of several variables. If xi {0, 1}mi , i = 1, 2, are
x1 = (x11 , x12 , . . . , x1m1 ), x2 = (x21 , x22 , . . . , x2m2 ),
we define hx1 , x2 i {0, 1}2m1 +2+m2 by
hx1 , x2 i := (x11 , x11 , x12 , x12 , . . . , x1m1 , x1m1 , 0, 1, x21 , x22 , . . . , x2m2 ).
(2.4)
By induction, we define
hx1 , x2 , . . . , xn i := hx1 , hx2 , . . . , xn i i, n = 3, 4, . . . .
The inverse functions of u = hx1 , . . . , xn i are written as
(u)ni := xi , i = 1, 2, . . . , n,
which are primitive recursive functions.
11 N := {1, 2, . . .} is the set of all positive integers.
+
For example, for (1) {0, 1}1 and (1, 1, 0, 1, 1) {0, 1}5 , we have
h(1), (1, 1, 0, 1, 1)i = (1, 1, 0, 1, 1, 1, 0, 1, 1) =: u {0, 1}9 .
Then, (u)21 = (1) and (u)22 = (1, 1, 0, 1, 1). At the same time, since
h(1), (1), (1)i = u, we have (u)31 = (u)32 = (u)33 = (1).
The function hx1 , . . . , xn i is a mathematical description of a way to

hand more than one parameter to a function. For example, as is seen
in a description like f (2, 3), a string of letters 2, 3 is put into f (), each
parameter being divided by , . In computer, 2, 3 is coded as a certain
word u. Since u has the information about , , we can decode 2 and 3 from
u. The delimiter , corresponds to 0, 1 in the definition (2.4) of hx1 , x2 i.
Definition 2.5. (Computational complexity depending on algorithm) Let

A : {0, 1} {0, 1} be a partial recursive function as a function N N.
We call A an algorithm. The computational complexity of x {0, 1}
under the algorithm A is defined by
KA (x) := min{L(q) | q {0, 1} , A(q) = x }.
If there is no such q that A(q) = x, we set KA (x) := .
In Definition 2.5, Kolmogorov named A an algorithm, but today, it may

be better to call it a programming language. The input q of A is then a
program. Thus, KA (x) is interpreted as the length of the shortest program
that computes x under the programming langauge A.
Since KA depends on one particular algorithm A, it is not a universal
index for complexity. Now, we introduce the following theorem.
Theorem 2.5. There exists an algorithm A0 : {0, 1} {0, 1} such that

for any algorithm A : {0, 1} {0, 1} , we can find such a constant cA0 A
N that
KA0 (x) 5 KA (x) + cA0 A , x {0, 1} .
A0 is called a universal algorithm.
Proof. Using an enumerating function univ1 , we define an algorithm A0

by
A0 (z) := univ1 ((z)21 , (z)22 ), z {0, 1} .
If z is not of the form z = he, qi, we do not define A0 (z). If eA is a Godel
number of A, we have A0 (heA , qi) = univ1 (eA , q) = A(q). Take an arbitrary
36 Random number
x {0, 1} . If there is no q such that A(q) = x, then KA (x) = and the

desired inequality holds. If there is such a q, let qx be the shortest such
one, i.e., KA (x) = L(qx ). Since A0 (heA , qx i) = x, we have
KA0 (x) 5 L(heA , qx i).
It follows from (2.4) that
L(heA , qx i) = L(qx ) + 2L(eA ) + 2 = KA (x) + 2L(eA ) + 2.
Hence
KA0 (x) 5 KA (x) + 2L(eA ) + 2, x {0, 1} .
Putting cA0 A := 2L(eA ) + 2, the theorem holds.
When KA0 (x) cA0 A , if KA0 (x) is greater than KA (x), the difference
is relatively small. Therefore for x such that KA0 (x) cA0 A , either KA0 (x)
is less than KA (x) or KA0 (x) is relatively slightly greater than KA (x). In
this sense, A0 is also called an asymptotically optimal algorithm.
Let A0 and A00 be two universal algorithms. 12 Then, putting c :=
max{cA00 A0 , cA0 A00 }, we have
KA (x) KA0 (x) < c, x {0, 1} .

0 0
(2.5)
This means that when KA0 (x) or KA00 (x) is much greater than c, their
difference can be ignored.
Definition 2.6. We fix a universal algorithm A0 , and define
K(x) := KA0 (x), x {0, 1} .
We call K(x) the Kolmogorov complexity 13 of x.
Theorem 2.6. (i) There exists a constant c > 0 such that
K(x) 5 n + c, x {0, 1}n , n N+ .

In particular, K : {0, 1} N is a total function.
(ii) If n > c0 > 0, then we have
0
#{x {0, 1}n | K(x) = n c0 } > 2n 2nc .
Proof. (i) For an algorithm A(x) := proj 11 (x) = x, we have KA (x) = n
for x {0, 1}n . Consequently, Theorem 2.5 implies K(x) 5 n + c for some
constant c > 0. (ii) The number of q such that L(q) < n c0 is equal to
0 0
20 + 21 + + 2nc 1 = 2nc 1, and hence the number of x {0, 1}
0
such that K(x) < n c0 is at most 2nc 1. From this (ii) follows.
12 Since there is more than one enumerating function, there is more than one universal
algorithm.
13 It is also called the Kolmogorov-Chaitin complexity, algorithmic complexity, descrip-
tion complexity, . . . etc.

2.2.2 Random number

For n 1, we call x {0, 1}n a random number if K(x) n. Theorem 2.6
implies that when n 1, nearly all x {0, 1}n are random numbers.
Remark 2.1. Since there is more than one universal algorithm, and the
differences among them include ambiguities (2.5), the definition of random
number cannot help having some ambiguity. 14
Example 2.3. The world record of computation of is 2,576,980,370,000

decimal digits or approximately 8,560,543,490,000 bits (as of August 2009).
Since the program that produced the record is much shorter than this, the
{0, 1}-sequence of in its binary expansion up to 8,560,543,490,000 digit
is not a random number.
As is seen in Example 2.3, we know many x {0, 1} that are not ran-
dom. However, we know no concrete example of random number. Indeed,
the following theorem implies that there is no algorithm to judge whether
a given x {0, 1}n , n 1, is random or not.
Theorem 2.7. The Kolmogorov complexity K(x) is not a total recursive

function.
Proof. Let us identify {0, 1} with N. We show the theorem by contradic-

tion. Suppose that K(x) is a total recursive function. Then, a function
(x) := min{z N | K(z) = x}, x N,
is also a total recursive function. We see x 5 K((x)) by definition. Con-

sidering to be an algorithm, we have
K ((x)) = min{L(q) | q N, (q) = (x)},
and consequently, K ((x)) 5 L(x). Therefore by Theorem 2.5, we know

that there exists a constant c > 0 such that for any x,
x 5 K((x)) 5 L(x) + c. (2.6)
However, (2.6) is impossible for x 1 because L(x) 5 log2 x + 1 ((2.3)

and Proposition A.3 (ii) in case a = 1). Thus K(x) is not a total recursive
function.
14 We can also define random infinite {0, 1}-sequences, i.e., infinite random number. In
that case, there is no ambiguity.

38 Random number
K(x) is a total function, i.e., it is defined for all x {0, 1} , but there
is no program that computes it. Thus definable and computable are
different concepts.
Theorem 2.7 is deeply related to Theorem 2.4. Consider a function
complexity(x) defined below. In its definition, A0 is the universal algorithm
appeared in the proof of Theorem 2.5, which is assumed to be used for the
definition of K(x).
complexity(x)
(1) l := 0.
(2) Let q be the first word of {0, 1}l .
(3) If A0 (q) = x, then output l and halt.
(4) If q is the last word of {0, 1}l , increase l by 1, and go to (2).
(5) Assign q the next word of it, and go to (3).

Starting from the shortest program, the function complexity(x) executes
every program q to check whether it computes x or not, and if it does,
complexity(x) halts with output K(x). However this program does not
necessarily halt. Indeed, for some x, it must fall into an infinite loop at
step (3) before K(x) is computed, which cannot be avoided in advance
because of Theorem 2.4. 15
Suppose, for example, that a large photo image of beautiful scenery is
stored in a computer as a long word x. Since it is far from a random image,
x is by no means a random number, i.e., K(x) L(x). This means that
there is a qx {0, 1} such that A0 (qx ) = x and K(x) = L(qx ). Then,
storing qx instead of x considerably saves the computer memory. This is
the principle of data compression. The raw data x is compressed into qx ,
and A0 develops qx to x. Unfortunately, we cannot compute qx from x
because K(x) is not computable. In practice, some alternative methods
are used for data compression. 16
2.2.3 Application: Distribution of prime numbers()

Let us stop along the way, again. The Kolmogorov complexity has many
applications not only in probability theory but also in other fields of math-
ematics ([Li and Vitanyi (2008)]). We here present one of their applications
15 It is known that incomputability of the halting problem and that of the Kolmogorov
complexity are equivalent.

16 One of such methods will be used in the proof of Theorem 3.1.
in number theory. First, we look at Euclids theorem:
Theorem 2.8. There are infinitely many prime numbers.
Proof. We show the theorem by contradiction. Suppose that there are only
finitely many prime numbers, say p1 , p2 , . . . , pk . Then, define a primitive
recursive function A : N N by
A(he1 , e2 , . . . , ek i) := pe11 pe22 pekk .
For an arbitrary m N+ , there exist e1 (m), e2 (m), . . . , ek (m) N such
that
A(he1 (m), e2 (m), . . . , ek (m)i) = m.
Consequently,
KA (m) 5 L(he1 (m), . . . , ek (m)i)
= 2L(e1 (m)) + 2 + + 2L(ek1 (m)) + 2 + L(ek (m)).
For each i, we have ei (m) 5 logpi m 5 log2 m, and hence L(ei (m)) 5
log2 (log2 m + 1). This shows that
KA (m) 5 (2k 1) log2 (log2 m + 1) + 2(k 1).
Therefore there exists a c N+ such that
K(m) 5 (2k 1) log2 (log2 m + 1) + 2(k 1) + c, m N+ .
If m 1 is a random number, K(m) log2 m. Thus the above inequality
does not hold for large random numbers m, which is a contradiction.
Of course, this proof is much more difficult than the well-known proof of
Euclids. However, polishing it, we can unexpectedly get a deep interesting
knowledge about the distribution of prime numbers.
Theorem 2.9. ([Hardy and Wright (1979)], Theorem 8) Let pn be the n-th
smallest prime number. Then, as n , we have
pn n log n. (2.7)
Here indicates that the ratio of both sides converges to 1.
Theorem 2.9 is a paraphrase of the well-known prime number theorem:

as n ,
n
#{2 5 p 5 n | p is a prime number} .
log n
40 Random number
In what follows, we approach Theorem 2.9 by using the Kolmogorov com-

plexity, although the discussion below is somewhat lacking in rigor.
Let pn be a divisor of m N+ . Then, we can compute m from the two
integers n and m/pn . 17 Therefore, c, c0 , c00 , c000 N+ being constants that
are independent of n and m,

m
K(m) 5 L n, +c
pn

m
= 2L(n) + 2 + L +c
pn
5 2 log2 n + log2 m log2 pn + c0 . (2.8)
If m is a random number, we have K(m) = log2 m c00 , and hence
log2 pn 5 2 log2 n + c000 ,
namely,
000
pn 5 2c n2 .
This inequality is much weaker than (2.7). Let us improve it by describing
the information about n and m/pn as shortly as possible.
For x = (x1 , . . . , xk ) {0, 1} and y = (y1 , . . . , yl ) {0, 1} , we define
their concatenation xy by
xy := (x1 , . . . , xk , y1 , . . . , yl ).
For more than two words, we define their concatenation similarly. It is
impossible to decode x and y from xy uniquely because xy includes no
delimiter. To decode them, we consider hL(x), xyi. This time, since the
length of x is known, we can decode x and y from xy uniquely. The length
of the word hL(x), xyi is
L(hL(x), xyi) = 2L(L(x)) + 2 + L(x) + L(y),
which is much shorter than L(hx, yi) for x 1. This is again more com-
pressed into
hL(L(x)), L(x)xyi,
whose length is estimated as
L(hL(L(x)), L(x)xyi) 2 log2 log2 log2 x + 2 + log2 log2 x + log2 x + log2 y.
(2.9)
17 The function n 7 pn is a primitive recursive function.
Substituting (2.9) for the inequality (2.8), we get the following estimate:
c, c0 , c00 , c000 N+ being constants that are independent of n and m,
K(m) 5 2 log2 log2 log2 n + 2 + log2 log2 n + log2 n + log2 m log2 pn + c.
If m is a random number, we have K(m) = log2 m c0 , and hence

log2 pn 5 2 log2 log2 log2 n + log2 log2 n + log2 n + c00 ,
namely,
pn 5 c000 n log2 n (log2 log2 n)2 . (2.10)
Although (2.10) shows only an upper bound with an unknown constant
c000 N+ , its increasing rate is very close to that of (2.7). It is amazing
that (2.10) is obtained only by the definition of random number and a rough
estimate of the Kolmogorov complexity.
Chapter 3
Limit theorem
It is extremely important to study and understand limit theorems because

they are concrete means and expression forms to explore randomness, i.e.,
properties of random numbers. To explore randomness, the main purpose
of the study of limit theorems is to discover as many non-trivial examples
of events as possible whose probabilities are very close to 1.
Many practical applications of probability theory make use of limit the-
orems. For limit theorems that specify events whose probabilities are close
to 1 enable us to predict future with very high probability even in ran-
dom phenomena. Therefore limit theorems are the central theme in both
theoretical and practical aspects.
In this chapter, we study two of the most important limit theorems
for coin tosses, i.e., Bernoullis theorem and de MoivreLaplaces theorem.
Furthermore, we learn that these two limit theorems extend to the law
of large numbers and the central limit theorem, respectively, for arbitrary
sequences of independent identically distributed random variables.
Even in the forefront of probability theory, these two theorems are ex-
tending in many ways and in diverse situations.
3.1 Bernoullis theorem
When we toss a coin, it comes up Heads with probability 1/2 and Tails
with probability 1/2, but this does not mean that the number of Heads
and the number of Tails in 100 coin tosses are always equal to 50 and 50.
The following is a real record of Heads(= 1) and Tails(= 0) of the authors
trial of 100 coin tosses.
1110110101 1011101101 0100000011 0110101001 0101000100

0101111101 1010000000 1010100011 0100011001 1101111101
42
3.1. Bernoullis theorem 43
The number of 1s in the above sequence, i.e., the number of Heads in

the 100 coin tosses, is 51. Another trial will bring a different sequence,
but the number of Heads seldom deviates from 50 so much. In fact, when
n 1, if {0, 1}n is a random number, the relative frequency of 1s is
approximately 1/2. This follows from Theorem 3.1 as we will explain on
p.45.
This section exclusively deals with n coin tosses; recall that Pn is the
uniform probability measure on {0, 1}n , and i : {0, 1}n R, i = 1, . . . , n,
are the coordinate functions defined by (1.12). Then, the random vari-
ables 1 , . . . , n defined on the probability space ({0, 1}n , P({0, 1}n ), Pn )
represent n coin tosses.
Theorem 3.1. There exists a constant c N+ , such that for any n N+

Pn
and any {0, 1}n , putting p := i=1 i ()/n, we have
K() 5 nH(p) + 4 log2 n + c.
Here H(p) is the (binary) entropy function:

p log2 p (1 p) log2 (1 p) ( 0 < p < 1 ),
H(p) := (3.1)
0 ( p = 0, 1 ).
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8 1
Fig. 3.1 The graph of H(p)

44 Limit theorem
Pn
Proof. The number of y {0, 1}n that satisfy i=1 i (y) = np is

n n!
= .
np (n np)!(np)!
Suppose that among those y, the given is the n1 -th smallest in the canon-
ical order. Let A : {0, 1} {0, 1} be an algorithm that computes from
hn, m, li the l-th smallest word in the canonical order among the words
Pn
y {0, 1}n that satisfy i=1 i (y) = m. Then, we have
A(hn, np, n1 i) = .
Consequently,
KA () = L(hn, np, n1 i)
= 2L(n) + 2 + 2L(np) + 2 + L(n1 )
5 4L(n) + L(n1 ) + 4

n
5 4L(n) + L + 4.
np
Using (2.3), we obtain

n
KA () 5 4 log2 n + log2 + 9.
np
Now, the binomial theorem implies that
n
n np nnp
X n k
p (1 p) 5 p (1 p)nk = 1,
np k
k=0
and hence

n
5 pnp (1 p)(nnp) = 2np log2 p 2(nnp) log2 (1p) = 2nH(p) .
np
(3.2)
Then, KA () is estimated as
KA () 5 nH(p) + 4 log2 n + 9. (3.3)
Finally, Theorem 2.5 implies that there exists a c N+ such that
K() 5 nH(p) + 4 log2 n + c.

3.1. Bernoullis theorem 45
The entropy function H(p) has the following properties (Fig. 3.1).
(i) 0 5 H(p) 5 1. H(p) takes its maximum value 1 at p = 1/2.
(ii) For 0 < < 1/2, we have 0 < H 12 + = H 12 < 1, the common

value, which is written as H 12 below, decreases as increases.
In the proof of Theorem 3.1, if L() = n 1 and p 6= 1/2, we showed

that the word can be compressed into the shorter word hn, np, n1 i. This
means that is not a random number. Taking the contrapositive, we can
Pn
say that if L() 1 and is a random number, then i=1 i ()/n should
be close to 1/2.
Theorem 3.1 implies Bernoullis theorem (Theorem 3.2). 1
Theorem 3.2. For any >
0, it holds that
1 + + n

1
lim Pn > = 0. (3.4)
n n 2
Remark 3.1. If 0 < < 0 and if (3.4) holds for , it also holds for 0 .
Namely, the smaller is, the stronger the assertion (3.4) becomes. Therefore
For any > 0 in Theorem 3.2 actually means No matter how small > 0
is. Such descriptions will often be seen below.
Proof of Theorem 3.2. Theorem 3.1 and the property (ii) of H(p) imply
that
1 () + + n ()

1 1
> + = K() < nH + 4 log2 n + c,
n 2 2
1 () + + n ()

1 1
< = K() < nH + 4 log2 n + c.
n 2 2
Consequently, we have
1 () + + n () 1

{0, 1}n >
n 2

n 1
() + + n () 1
= {0, 1} > +
n 2

n
1 () + + n () 1
{0, 1} <
n 2

n
1
{0, 1} K() < nH + 4 log2 n + c
2

n
1
{0, 1} K() < nH + 4 log2 n + c + 1 . (3.5)
2
1 Bernoullis theorem includes the case of unfair coin tosses (Example 3.4). Theorem 3.2
deals with its special case.

46 Limit theorem
By Theorem 2.6 (ii), we have

1 () + + n () 1

# {0, 1}n >
n 2

1
5 # {0, 1}n K() < nH

+ 4 log2 n + c + 1
2
5 2b ( 2 ) c 5 n4 2c+1 2nH ( 12 ) .
1
nH +4 log 2 n+c+1
Dividing the both sides by #{0, 1}n = 2n , it holds that

1 + + n

1
> 5 n4 2c+1 2n (1H ( 2 )) .
1
Pn (3.6)
n 2
Since 1 H 12 > 0, the right-hand side of (3.6) converges to 0 as

n (see Proposition A.3 (i)).
Let c0 N+ be a constant, and let An := { {0, 1}n | K() > n c0 }.

Taking complements of the both sides of (3.5), if n 1, we have

n
1
An {0, 1} K() = nH + 4 log2 n + c + 1
2

n 1 () + + n ()

1
{0, 1} 5 =: Bn .
n 2
An is considered to be the set of random numbers of length n, which is
included by an event Bn of probability close to 1 that Theorem 3.2 specifies.
Thus, in this case, Fig. 1.3 is like Fig. 3.2.
Here we derived a limit theorem (Theorem 3.2) directly from a property
of random numbers (Theorem 3.1). In general, such derivation is difficult,
and it is more usual to derive properties of randomness indirectly from limit
theorems (Sec. 1.3.1).
Remark 3.2. Since Theorem 2.6 (ii) is also valid for the computational
complexity KA depending on A, according to (3.3), the constant c of (3.6)
can be taken as 9. Then, putting c = 9, = 1/2000 and n = 108 , (3.6) is
now

1 + + 108

1 1
P108 > 5 1.9502 1013 . (3.7)
108 2 2000
Since the left-hand side is less than 1, this is unfortunately a meaningless
inequality.
In the following sections, the inequality (3.7) will be much improved by

the strong power of calculus. See (3.15), (3.22) and Exampe 3.10.
3.2. Law of large numbers 47
All {0, 1}-sequences Random numbers
The event of probability close to 1

that Theorem 3.2 specifies
Fig. 3.2 Theorem 3.2 and random numbers (Conceptual figure)
3.2 Law of large numbers
The assertion of Bernoullis theorem is valid not only for coin tosses, but
also for more general sequences of random variables. Such extensions of
Bernoullis theorem are called the law of large numbers. In this book, we
present the most basic casethe law of large numbers for sequences of
independent identically distributed random variables.
Although it is called law, it is of course a mathematical theorem. 2
The law of large numbers, which entered the stage when probability was
not recognized in mathematics, was considered to be a natural law like the
law of inertia.
It took more than 200 years since Bernoullis age for probability to be
recognized in mathematics. As the 6-th of his 23 problems, Hilbert pro-
posed a problem axiomatic treatment of probability with limit theorems
for foundation of statistical physics at the Paris conference of the Interna-
tional Congress of Mathematicians in 1900. Probability has been recognized
widely in mathematics since Borel presented the normal number theorem
(1.16) in 1909. Finally, the axiomatization was done by [Kolomogorov
(1933)].
2 Poisson named it the law of large numbers.
48 Limit theorem
3.2.1 Sequence of independent random variables

Let (, P(), P ) be a probability space. For two events A, B P(), if
P (B) > 0, we define
P (A B)
P (A|B) := .
P (B)
P (A|B) is called the conditional probability of A given B. One of the
following three holds:
(i) P (A|B) > P (A), (ii) P (A|B) = P (A), (iii) P (A|B) < P (A).
(i) is interpreted as A is more likely to occur conditional on B, (ii) as
The likelihood of the occurrence of A does not change conditional on B,
and (iii) as A is less likely to occur conditional on B.
Let us say A is independent of B if (ii) holds. If P (B) > 0, (ii) is
equivalent to P (A B) = P (A)P (B), and in addition, if P (A) > 0, then
it is also equivalent to P (B|A) = P (B), i.e., B is independent of A. Thus
independence of two events is symmetric. Making the symmetry clear and
including the cases P (A) = 0 or P (B) = 0, we say A and B are independent
if
P (A B) = P (A)P (B).
In general, independence of n events is defined as follows.
Definition 3.1. We say events A1 , . . . , An are independent if for any subset

I {1, . . . , n}, I 6= ,
!
\ Y
P Ai = P (Ai )
iI iI
T
holds. Here iI Ai denotes the intersection of Ai for all i I, and
Q
iI P (Ai ) denotes the product of P (Ai ) for all i I (Sec. A.1.2).
Remark 3.3. On the probability space ({0, 1}2 , P({0, 1}2 ), P2 ), consider
the following three events
A := {(0, 0), (0, 1)}, B := {(0, 1), (1, 1)}, C := {(0, 0), (1, 1)}.
Then, since
1
P2 (A B) = P2 (A)P2 (B) = ,
4
1
P2 (B C) = P2 (B)P2 (C) = ,
4
1
P2 (C A) = P2 (C)P2 (A) = ,
4
any two out of the three are independent. However, since

1
P2 (A B C) = P2 () = 0, P2 (A)P2 (B)P2 (C) = ,
8
A, B, C are not independent.
Definition 3.2. (i) If random variables X and Y (not necessarily defined

on a same probability space) have a same distribution, we say they are
identically distributed. If random variables X1 , . . . , Xn and Y1 , . . . , Yn (not
necessarily defined on a same probability space) have a same joint distri-
bution, we say they are identically distributed.
(ii) We say random variables X1 , . . . , Xn are independent if for any ci < di ,
i = 1, . . . , n, the events
{ ci < Xi 5 di }, i = 1, . . . , n
are independent.
Events A1 , . . . , An P() are independent if and only if their indicator

functions 1A1 , . . . , 1An (Sec. A.1.1) are independent as random variables.
Proposition 3.1. Let X1 , . . . , Xn be random variables, and let { aij | j =

1, . . . , si } be the range of each Xi , i = 1, . . . , n. The following two are
equivalent.
(i) X1 , . . . , Xn are independent.
(ii) It holds that
n
Y
P ( Xi = aiji , i = 1, . . . , n ) = P ( Xi = aiji )
i=1
for any ji = 1, . . . , si , i = 1, . . . , n.
Proof. (i) = (ii): If we take ci and di so that {ci < Xi 5 di } = {Xi =

aiji } holds, (i) implies that {Xi = aiji }, i = 1, . . . , n, are independent, i.e.,
(ii) follows.
(ii) = (i): We have
X
P (c1 < X1 5 d1 ) = P ( X1 = a1j1 ),
j1 ; c1 <a1j1 5d1
X
where denotes the sum over all j1 such that c1 < a1j1 5 d1
j1 ; c1 <a1j1 5d1
50 Limit theorem
(Sec. A.1.2). Similarly,
P (ci < Xi 5 di , i = 1, . . . , n)
X X
= P ( Xi = aiji , i = 1, . . . , n )
j1 ; c1 <a1j1 5d1 jn ; cn <anjn 5dn
X X n
Y
= P ( Xi = aiji ).
ji ; c1 <a1j1 5d1 jn ; cn <anjn 5dn i=1
Resolving it into factors (Example A.2), we obtain

n
Y X
= P ( Xi = aiji )
i=1 ji ; ci <aij 5di
i
n
Y
= P ( ci < Xi 5 di ).
i=1
Take any subset I {1, 2, . . . , n}, I 6= . For each k 6 I, if we put

ck := min{akj | j = 1, 2, . . . , sk } 1, and dk := max{akj | j = 1, 2, . . . , sk },
we have
P (ck < Xk 5 dk ) = 1, k 6 I,
and hence
P (ci < Xi 5 di , i I) = P (ci < Xi 5 di , i = 1, 2, . . . , n)

Yn
= P ( ci < Xi 5 di )
i=1
Y
= P ( ci < Xi 5 di ).
iI
Example 3.1. The coordinate functions 1 , . . . , n defined on the proba-

bility space ({0, 1}n , P({0, 1}n ), Pn ) are independent identically distributed
(abbreviated as i.i.d.) random variables (Example 1.4).
Proposition 3.1 (ii) shows that if X1 , . . . , Xn are independent, their joint

distribution can be constructed of their marginal distributions. Suppose
that for each i = 1, . . . , n, a random variable Xi : i R is defined on
a probability space (i , P(i ), i ). Then, on a suitable probability space
(, P(), ), we can construct independent random variables X1 , . . . , Xn
so that Xi and Xi are identically distributed for each i = 1, . . . , n. For

example,
:= 1 n ,
Yn
({}) := i ({i }), = (1 , . . . , n ) ,
i=1
Xi () := Xi (i ), = (1 , . . . , n ) , i = 1, . . . , n.
Here the probability measure is written as
= 1 n ,
and is called the product probability measure of 1 , . . . , n .
Example 3.2. Let us construct a mathematical model of general coin

tosses that assumes the probabilities of Heads(= 1) and Tails(= 0) are
0 < p < 1 and 1 p, respectively. If p 6= 1/2, it is a model of unfair coin
tosses.
For each 1 5 i 5 n, define a probability space (i , P(i ), i ) by
i := {0, 1}, i ({1}) := p, i ({0}) = 1 p.
(p)
Then, we construct a probability space (, P(), Pn ), where
:= 1 n = {0, 1}n ,
Pn(p) := 1 n .
On this probability space, the coordinate functions {i }ni=1 defined by (1.12)
represent n coin tosses, which are unfair if p 6= 1/2. Namely, {i }ni=1 are
i.i.d. with
Pn(p) ( i = 1 ) = p, Pn(p) ( i = 0 ) = 1 p, i = 1, . . . , n.
(p)
The product probability measure Pn is given by
Pn(p) ({}) := p1 ++n (1 p)n(1 ++n ) , = (1 , . . . , n ) {0, 1}n .
(1/2)
In particular, Pn = Pn .
Definition 3.3. Let (, P(), P ) be a probability space, and X : R

be a random variable defined on it. We define the mean (or the expectation)
E[X] of X and the variance V[X] of X by
X
E[X] := X()P ({}),

2
X
V[X] := E (X E[X])2 =

(X() E[X]) P ({}).

52 Limit theorem
Proposition 3.2. Let X be a random variable, and let {a1 , a2 , . . . , as } be

the range of X. Then, we have
Xs
E[X] = ai P (X = ai ),
i=1
s
X
V[X] = (ai E[X])2 P (X = ai ).
i=1
In particular, for an event A P(), we have E[ 1A ] = P (A).
Proof. X
E[X] = X()P ({})

Xs X
= X()P ({}) (Sec. A.1.2)
i=1 ; X() = ai
s
X X
= ai P ({})
i=1 ; X() = ai
s
X X
= ai P ({})
i=1 ; X() = ai
s
X
= ai P (X = ai ).
i=1
The proof is similar for V[X].
By Proposition 3.2, the mean and the variance of a random variable

are determined by its distribution. Therefore if X and Y are identically
distributed, their means and variances coincide, respectively.
Proposition 3.3. Let (, P(), P ) be a probability space. For any random
variables X1 , . . . , Xn : R and any constants c1 , . . . , cn R, it holds
that
E[c1 X1 + + cn Xn ] = c1 E[X1 ] + + cn E[Xn ]. (3.8)
Proof. X
E[c1 X1 + + cn Xn ] = (c1 X1 () + + cn Xn ()) P ({})

X X
= c1 X1 ()P ({}) + + cn Xn ()P ({})

= c1 E[X1 ] + + cn E[Xn ].
Proposition 3.4. The variance of a random variable X can be calculated

as follows.
V[X] = E X 2 E[X]2 .

Proof. By Proposition 3.3, we have
V[X] = E X 2 2XE[X] + E[X]2

= E X 2 E [2XE[X]] + E E[X]2

= E X 2 2E[X]E[X] + E[X]2

= E X 2 E[X]2 .

Proposition 3.5. Let (, P(), P ) be a probability space. For any in-

dependent random variables X1 , . . . , Xn : R and any constants
c1 , . . . , cn R, it holds that
E[X1 Xn ] = E[X1 ] E[Xn ], (3.9)

V[c1 X1 + + cn Xn ] = c21 V[X1 ] + + c2n V[Xn ]. (3.10)
Proof. For i = 1, . . . , n, let {aij | j = 1, . . . , si } be the range of Xi . By

Proposition 3.2, we have
E[X1 Xn ]
s1
X sn
X
= a1j1 anjn P (X1 = a1j1 , . . . , Xn = anjn )
j1 =1 jn =1
Xs1 sn
X
= a1j1 anjn P (X1 = a1j1 ) P (Xn = anjn )
j1 =1 jn =1
Xs1 sn
X
= a1j1 P (X1 = a1j1 ) anjn P (Xn = anjn )
j1 =1 jn =1
= E[X1 ] E[Xn ],
54 Limit theorem
thus (3.9) is proved. For the variance, we have

V[c1 X1 + + cn Xn ]
h i
2
= E (c1 X1 + + cn Xn E[c1 X1 + + cn Xn ])
h i
2
= E (c1 (X1 E[X1 ]) + + cn (Xn E[Xn ]))
n
X X
c2i E (Xi E[Xi ])2 +

= ci cj E [(Xi E[Xi ])(Xj E[Xj ])]
i=1 i6=j
n
X X
= c2i V[Xi ] + ci cj E [(Xi E[Xi ])(Xj E[Xj ])] .
i=1 i6=j
Since Xi and Xj are independent if i 6= j, so are Xi E[Xi ] and Xj E[Xj ].
Hence
E[(Xi E[Xi ])(Xj E[Xj ])] = E[(Xi E[Xi ])]E[(Xj E[Xj ])]
= 0 0 = 0,
from which (3.10) follows.
Remark 3.4. In order for (3.10) to hold, the assumption X1 , . . . , Xn are

independent is somewhat too much, and an assumption Xi and Xj are
independent if i 6= j (pairwise independent) suffices. For example, in
Remark 3.3, the indicator functions 1A , 1B and 1C of the events A, B, and
C are not independent, but they are pairwise independent.
3.2.2 Chebyshevs inequality

In general, for any non-negative random variable X = 0, the following
inequality holds.
E[X]
P (X = a) 5 , a > 0. (3.11)
a
This is called Markovs inequality. Indeed, if {a1 , a2 , . . . , as }, ai = 0, is the
range of X, Proposition 3.2 implies that
Xs
E[X] = ai P (X = ai )
i=1
X X
= ai P (X = ai ) + ai P (X = ai )
i ; 05 ai < a i ; ai = a
X
= ai P (X = ai )
i ; ai = a
X
=a P (X = ai ) = aP (X = a).
i ; ai = a
Dividing the both sides by a, we obtain (3.11).
Lemma 3.1. The following inequality holds.

V[X]
P ( |X E[X]| = ) 5 , > 0.
2
This is called Chebyshevs inequality.
Proof. Applying Markovs inequality to Y := (X E[X])2 , we obtain

E[Y ] V[X]
P ( |X E[X]| = ) = P (Y = 2 ) 5 2
= .
2
Suppose that for each n N+ , a probability space (n , P(n ), n ), and

i.i.d. random variables Xn,1 , Xn,2 , . . . , Xn,n are given. Suppose further that
their common mean and variance are
E[Xn,k ] = mn , V[Xn,k ] = n2 5 2 , k = 1, . . . , n,
where 2 is a constant that does not depend on n. Then, Proposition 3.3
and Proposition 3.5 imply that
Xn,1 + + Xn,n

E = mn , (3.12)
n
Xn,1 + + Xn,n 2 2

V = n 5 . (3.13)
n n n
Applying Chebyshevs inequality (Lemma 3.1), we have
2

Xn,1 + + Xn,n

n mn = 5 , > 0.
n n2
From this, the next theorem immediately follows.
Theorem 3.3. For any > 0, we have

Xn,1 + + Xn,n

lim n mn = = 0.
n n
Theorems, like Theorem 3.3, that assert the distribution of the arith-
metic mean of n identically distributed random variables concentrates to
their common mean as n are called the law of large numbers. 3
3 More exactly, Theorem 3.3 is called the weak law of large numbers to distinguish it
from the strong law of large numbers. The prototype of the latter is Borels normal
number theorem (1.16).
56 Limit theorem
Example 3.3. Applying Theorem 3.3 to the case where

(n , P(n ), n ) = ({0, 1}n , P({0, 1}n ), Pn )
and
Xn,k () := k () = k , = (1 , . . . , n ) {0, 1}n , k = 1, . . . , n,
we obtain Theorem 3.2. In this case, Chebyshevs inequality shows

1 + + n

1 1
Pn = 5 . (3.14)
n 2 4n2
In particular, putting = 1/2000 and n = 108 , we obtain

1 + + 108

1 1 1
P108 = 5 . (3.15)
108 2 2000 100
Compared with (3.7) being meaningless, the inequality (3.15) has a practi-
cal meaning.
Example 3.4. Let 0 < p < 1. In Example 3.2, we introduced the probabil-
(p)
ity space ({0, 1}n , P({0, 1}n ), Pn ), and the coordinate functions {k }nk=1
as n coin tosses, which are unfair if p 6= 1/2. Since we have
E[k ] = p
and Proposition 3.4 implies
V[k ] = E k2 E[k ]2 = p p2 = p(1 p),

it follows from Theorem 3.3 that

1 + + n

(p)

lim Pn p = = 0,
> 0.
n n
As we can imagine from Theorem 2.6 (ii), Theorem 3.1 and Example 3.4,
it is known that for n 1, the probability that K() nH(p), {0, 1}n ,
(p)
is close to 1 under the probability measure Pn :

(p) n K()

lim Pn {0, 1} H(p) =
= 0, > 0.
n n
If H(p) is not extremely small, for n 1, nH(p) is a huge number so that
even a computer cannot produce any {0, 1}n with K() nH(p).
Strictly speaking, such is not a random number, but we may say it is
random in usual sense. Consequently, a long {0, 1}-sequence generated by
unfair coin tosses may well look random with high probability.
Example 3.5. On the probability space ({0, 1}2n , P({0, 1}2n ), P2n ), let us
consider the following random variables.
Xk () := 2k1 ()2k (), {0, 1}2n , k = 1, . . . , n.
Here {k }2n n
k=1 are the coordinate functions. Then, {Xk }k=1 are indepen-
dent, and we have
1 3
P2n (Xk = 1) = , P2n (Xk = 0) = , k = 1, . . . , n.
4 4
Namely, {Xk }nk=1 are n unfair coin tosses with p = 1/4. Therefore by
Example 3.4, we see

X1 + + Xn

1
lim P2n = = 0, > 0.
n n 4
This means that in fair (p = 1/2) coin tosses, the relative frequency of the
occurrences of Heads, Heads (= 11) approaches 1/4 as n with high
probability.
According to Kolmogorovs axioms of probability theory, if a triplet

(, P(), P ) satisfies the required mathematical conditions of probability
space, we call P a probability measure, and P (A) the probability that
A occurs, no matter whether it is related to a random phenomenon or
not. Before his axiomatization, the question What is probability? had
been so serious. Among answers, many people supported the following
definition. Let A be an event that is obtained as an outcome of a certain
trial. Repeating the trials many times, if the relative frequency of the
occurrences of A is considered to converge to a constant p, we define the
probability of A as this value p. This is called the empirical probability.
The law of large numbers gave a mathematical ground to the idea of the
empirical probability.
3.2.3 CramerChernoff s inequality
Bernoullis theorem (Theorem 3.2) was proved by the inequalities (3.6) or

(3.14). The former is very loose for small n but converges to 0 exponentially
fast as n , while the latter gives a better estimate than the former for
small n but it converges to 0 only as fast as 1/n as n , and hence it
gives a worse estimate than the former for n 1. In this subsection, we
prove an inequality (3.21), which improves both (3.6) and (3.14).
58 Limit theorem
Applying Markovs inequality (3.11) to the exponential function etX of

a random variable X, we have for any t > 0,
E etX

tX tx
= MX (t) etx , x R. (3.16)

P (X = x) = P e = e 5
etx
Here
MX (t) := E etX , t R,

is called the moment generating function of X. Since P (X = x) 5

MX (t) etx clearly holds for t = 0, (3.16) will be the most improved if we
select t = 0 so that the right-hand side takes its minimum value. Namely,
the left-hand side of (3.16) is bounded from above by the minimum value
of the function MX (t) etx over all t = 0:
P (X = x) 5 min MX (t) etx , x R. (3.17)
t=0
We call this inequality CramerChernoff s inequality.
Moment generating function originated with [Lapalce (1812)]. The k-th

derivative of MX (t) is
dk
k
(k) d tX
MX (t) := k E etX = E = E X k etX .

k
e
dt dt
Putting t = 0, we obtain
(k)
MX (0) = E X k , k N+ .

(3.18)
k
E X is called the k-th moment of X. Thus, as its name indicates, MX (t)
generates every moment of X.
Let us apply CramerChernoffs inequality to the sum of i.i.d. random

variables.
Theorem 3.4. For i.i.d. random variables X1 , . . . , Xn defined on a proba-

bility space (, P(), P ), it holds that
P (X1 + + Xn = nx) 5 exp (nI(x)) , x R. (3.19)
Here
I(x) := max (tx log MX1 (t)) , 4 (3.20)
t=0
where maxt=0 u(t) is the maximum value of u(t) over all t = 0.

4 Since the function I(x) in (3.20) characterizes the rate of decay of (3.19), it is called
the rate function. Deriving a rate function from a moment generating function by the
formula (3.20) is called the Legendre transformation, which plays important roles in
convex analysis and analytical mechanics.
Proof. By CramerChernoffs inequality,

h i
P (X1 + + Xn = nx) 5 min E et(X1 ++Xn ) etnx
t=0
= min E etX1 etXn etnx

t=0
= min E etX1 E etXn etnx

t=0
= min MX1 (t)n etnx

t=0
= min exp (n (tx log MX1 (t)))

t=0

= exp n max (tx log MX1 (t)) .
t=0
Let us apply Theorem 3.4 to the sum of n coin tosses 1 + + n . Since

1
M1 (t) = et0 Pn (1 = 0) + et1 Pn (1 = 1) = (1 + et ),
2
to calculate I(x), fixing x, we have to find the maximum value of
1
g(t) := tx log (1 + et ), t = 0.
2
Here let us assume
1
< x < 1.
2
We first note that g(0) = 0 and limt g(t) = . By differentiating g(t),
we determine its increase and decrease. Solving
et
g 0 (t) = x = 0,
1 + et
we get

x x
et = , t = log > 0.
1x 1x
Since g 00 (s) = es /(1 + es )2 < 0, g is a concave function, and hence the
above solution t gives the maximum value of g. Thus

x
I(x) = g log
1x
= x log x + (1 x) log(1 x) + log 2.
Here the entropy function H(p) appears, i.e.,
I(x) = H(x) log 2 + log 2.
60 Limit theorem
Therefore (3.19) is now

Pn (1 + + n = nx) 5 2n(1H(x)) .
In particular, putting x = 12 + , 0 < < 12 , we see
1 + + n

1
= + 5 2n(1H ( 2 )) .
1
Pn
n 2
Exchanging 1 and 0 in the coin tosses, we have
1 + + n 1 + + n

1 1
Pn 5 = Pn = + .
n 2 n 2
Consequently, we finally obtain

1 + + n

1
= 5 2 2n(1H ( 2 )) .
1
Pn (3.21)
n 2
This is an improvement of both (3.6) and (3.14).
Example 3.6. Putting = 1/2000 and n = 108 in (3.21), we obtain

1 + + 108

1 1
P108 = 5 3.85747 1022 . (3.22)
108 2 2000
This inequality is much more precise than Chebyshevs inequality (3.15).
Remark 3.5. Note that mint=0 u(t) or maxt=0 u(t) does not always exist.
For example, the former does not exist for u(t) = 1/(1 + t). In this book,
we deal with only cases where they exist.
3.3 De MoivreLaplaces theorem
The inequality (3.21) derived from CramerChernoffs inequality can be

more improved. The ultimate form (Example 3.10) is shown by de Moivre
Laplaces theorem.
3.3.1 Binomial distribution
Theorem 3.5. (Binomial distribution) We use the setup of general coin

tosses introduced in Example 3.2. Then, we have

n k
Pn(p) ( 1 + + n = k ) = p (1 p)nk , k = 0, 1, . . . , n, (3.23)
k
in particular, if p = 1/2,

n n
Pn ( 1 + + n = k ) = 2 , k = 0, 1, . . . , n.
k
3.3. De MoivreLaplaces theorem 61
Proof. We prove the theorem using the moment generating function of

1 + + n . For each t R,
h i
E e t(1 ++n ) = E e t1 e tn

= E e t1 E et n

n
= E e t1

n
= e0 Pn(p) (1 = 0) + et Pn(p) (1 = 1)
n
= (1 p) + et p
n
X n k nk
= et p (1 p)
k
k=0
n
X n k
= e tk p (1 p)nk .
k
k=0
On the other hand,

h i n
X
E e t(1 ++n ) = e tk Pn(p) ( 1 + + n = k ) .
k=0
Comparing the coefficients of e tk in the two expressions, we obtain (3.23).
Figure 3.3 shows the distributions of 1 + + n (n = 30, 100) for

p = 1/2. The rectangles (columns) composing the histogram are

1 1 n n
k , k+ 0, 2 , k = 0, 1, . . . , n,
2 2 k
n A.1). The area of the k-th rectangle is Pn ( 1 + + n = k ) =

(Example
n
k 2 , and the sum of the areas of all rectangles is 1.
3.3.2 Heuristic observation

Let p = 1/2. Then, we have E[i ] = 1/2 and V[i ] = 1/4, and hence (3.12)
and (3.13) imply that
n n
E [1 + + n ] = , V [1 + + n ] = .
2 4
From these, it follows that
1 12 + + n 12

Yn () := , >0
n
62 Limit theorem
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 20 40 60 80 100
Fig. 3.3 The histogram of binomial distribution (Left: n = 30, Right: n = 100)
satisfies E [Yn ()] = 0 and V [Yn ()] = 1/(4n21 ). In particular, if >

1/2, we have limn V [Yn ()] = 0, and hence, by Chebyshevs inequality,
1
Pn ( |Yn ()| = ) 5 21
0, n .
4n 2
When = 1, this is Bernoullis theorem.
Now, what happens if = 1/2? Modifying Yn (1/2) a little, we define
1 12 + + n 12

Zn := 1 . (3.24)
2 n
Then, we see E [Zn ] = 0 and V [Zn ] = 1. In general, for a random variable
X, the mean and the variance of
X E[X]
p
V[X]
are 0 and 1, respectively. This is called the standardization of X.
The histogram of the distribution of Zn seems to be convergent to a
smooth function as n (Fig. 3.4). Sacrificing rigor a little, let us find
the limit function heuristically.
For x R satisfying
r n
x = 1 2 (3.25)
2 n
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-4 -2 0 2 4 -4 -2 0 2 4
Fig. 3.4 The histogram of the distribution of Zn (Left: n = 30, Right: n = 100)
with some r = 0, 1, . . . , n, the probability that Zn = x is given by

n n
Pn (Zn = x) = 2 .
r
The area of the column of the histogram standing on x is the probability
1

Pn (Zn = x), and its width is 1/ 2 n , and hence its height fn (x) is given
by
n n 1

1
fn (x) = Pn (Zn = x) 1 = 2 n. (3.26)
2 n r 2
Now, consider the following ratio:

1
fn (x) fn x 1
2 n
fn (x). (3.27)
1
1
2 n
If fn (x) converges to a smooth function f (x) as n , we may well expect

that
f 0 (x)
(3.27) , n . (3.28)
f (x)
(3.27) can be calculated, by using (3.26), as

1
fn x 1 n
!
n

1 n
1 n = 1 r1 2 1
2
n

n
n
fn (x) 2 r 2
2
1

r
= 1 n
nr+1 2
n 2r + 1 1
= n.
nr+1 2
64 Limit theorem
1 n
By (3.25), we have r = 2 nx +
and hence
2

n ( nx + n) + 1 1
= n
n 12 ( nx + n) + 1 2

nx + n
= x, n .
n nx + 2
From this and (3.28), we derive a differential equation:
f 0 (x)
= x.
f (x)
Integrating the both sides, we obtain
x2
log f (x) = +C (C is an integral constant).
2
Namely,
x2 x2

C
f (x) = exp + C = e exp .
2 2
By (3.25) and (3.26), the constant eC should be
2n 2n 1

eC = f (0) = lim f2n (0) = lim 2 2n.
n n n 2

By Wallis formula (Corollary 3.1), we see eC = 1/ 2, and hence we
obtain
2
1 x
f (x) = exp . (3.29)
2 2
This is the density function of the standard normal distribution (or the
standard Gaussian distribution, Fig. 3.5).
In the above reasoning, (3.28) has no logical basis. Nevertheless, in fact,
de MoivreLaplaces theorem (Theorem 3.6) holds. 5
Theorem 3.6. Let Zn be the standardized sum of n coin tosses defined by

(3.24). Then, for any real numbers A < B, it holds that
Z B 2
1 x
lim Pn ( A 5 Zn 5 B ) = exp dx.
n A 2 2
We will prove Theorem 3.6 in Sec. 3.3.4.

5 De
MoivreLaplaces theorem includes the case of unfair coin tosses (Example 3.2).
Theorem 3.6 is a special case of it.
0.4
0.3
0.2
0.1
-4 -2 2 4
2
Fig. 3.5 The graph of 1 exp x2
2
3.3.3 Taylors formula and Stirlings formula
In order to prove de MoivreLaplaces theorem, we introduce two basic

formulas of calculus.
Theorem 3.7. (Taylors formula) If a function f is n times continuously

differentiable in an interval that contains a, b R, it holds that
(b a)2 00 (b a)n1 (n1)

f (b) = f (a) + (b a)f 0 (a) + f (a) + + f (a)
2! (n 1)!
b
(b s)n1 (n)
Z
+ f (s)ds
a (n 1)!
n1
X (b a)k Z b
(b s)n1 (n)
= f (k) (a) + f (s)ds. (3.30)
k! a (n 1)!
k=0
Here f (k) denotes the k-th derivative of f and f (0) := f .
Proof. The basic relation of differentiation and integration 6

Z b
f (b) = f (a) + f 0 (s)ds
a
6 At high school, this is the definition of definite integral, but at university, definite
integral is defined in another way, and this formula is proved as the fundamental theorem
of calculus.
66 Limit theorem
is Taylors formula for n = 1. Applying integration by parts, we obtain

Z b
f (b) = f (a) + ((b s)0 )f 0 (s)ds
a
Z b
0 s=b
= f (a) + [(b s)f (s)]s=a ((b s)) f 00 (s)ds
a
Z b
= f (a) + (b a)f 0 (a) + (b s)f 00 (s)ds. (3.31)
a
This is Taylors formula for n = 2. Applying integration by parts again, we
obtain
Z b Z b 0
(b s)2
(b s)f 00 (s)ds = f 00 (s)ds
a a 2
s=b Z b
(b s)2 00 (b s)2 000

= f (s) + f (s)ds
2 s=a a 2
Z b
(b a)2 00 (b s)2 000
= f (a) + f (s)ds.
2 a 2
Substitute this for (3.31), and we obtain Taylors formula for n = 3:
Z b
(b a)2 00 (b s)2 000
f (b) = f (a) + (b a)f 0 (a) + f (a) + f (s)ds. (3.32)
2 a 2
Repeating this procedure will complete the proof.
Example 3.7. (i) Applying Taylors formula (3.32) to f (x) = log(1 + x),
and putting a = 0 and b = x, we obtain
Z x
1 2 (x s)2
log(1 + x) = x x + 3
ds, 1 < x. (3.33)
2 0 (1 + s)
Modifying this a little, for a > 0, we have,
Z xa x 2
x 1 x 2 a s
log(a + x) = log a + + ds, a < x. (3.34)
a 2 a 0 (1 + s)3
(ii) Applying Taylors formula (3.32) to f (x) = ex , and putting a = 0,
b = x, we obtain
Z x
1 (x s)2 s
ex = 1 + x + x2 + e ds, x R. (3.35)
2 0 2
In order to study a general function f , it is often very effective to approx-

imate it by a quadratic function g, and study g in detail with the knowledge
we learned at high school. Indeed, Taylors formula (3.32) is useful for this
purpose. For example, if |x| 1, the integral terms in (3.33) and (3.35)
are very small, so that the following approximation formulas hold.
1
log(1 + x) x x2 ,
2
1
ex 1 + x + x2 .
2
In this sense, the integral term of Taylors formula (3.30) is called the
remainder term. 7
The remainder term often becomes small as n . For example, in
the formula
Z x
x2 xn (x s)n s
ex = 1 + x + + + + e ds, x R,
2! n! 0 n!
the remainder term converges to 0 for all x R as n :
Z x
(x s)n s x |x s|n
Z
x 0

e ds 5 max{e , e }ds

0 n!
0 n!
Z x n
|x|
5 max{ex , 1}

ds
n! 0
x |x|n
= max{e , 1}|x| 0, n .
n!
For the last convergence, see Proposition A.4. In other words, for all x R,
we have
x2 xn
1+x+ + + ex , n .
2! n!
Remark 3.6. In what follows, inequalities as we saw in the previous para-
graph will appear so often. Basically, those are applications of the following
inequality.
Z Z
B B
f (t)dt 5 |f (t)| dt 5 |A B| max |f (t)| .

A A min{A,B} 5 t 5 max{A,B}
Theorem 3.8. (Stirlings formula) As n ,

1
n! 2 nn+ 2 en . (3.36)
Here means that the ratio of the both sides converges to 1.

7 The remainder term can also be described by high order derivatives, which is usually
taught in first year at university.

68 Limit theorem
Example 3.8. Applying Stirlings formula to the calculation of 10000!, we

obtain an approximate value of it:
1
2 1000010000+ 2 e10000 = 2.84623596219 1035659 .
The true value is 2.846259680911035659 , and hence the ratio (approximate
value)/ (true value)= 0.999992.
Stirlings formula is necessary to estimate the binomial distribution pre-

cisely. For example, Wallis formula follows immediately from it.
Corollary 3.1.
2n 2n 1

1
lim 2 2n = .
n n 2 2
Proof. By Stirlings formula, as n , we have
2n 2n 1 (2n)! 2n 1

2 2n = 2 2n
n 2 (n!)2 2
1
2 (2n)2n+ 2 e2n 2n 1
2 2 2n
1 2
2 nn+ 2 en
1
= .
2
We prove Stirlings formula by Laplaces method ([Lapalce (1812)]).
Lemma 3.2. (Eulers integral (of the second kind))

Z
n! = xn ex dx, 8 n N+ . (3.37)
0
Proof. We show (3.37) by mathematical induction. First, we show it for

n = 1. For R > 0, applying integration by parts, we obtain
Z R Z R
R
xex dx = x(ex ) 0 (x)0 (ex )dx

0 0
Z R
R x
= Re + e dx
0
R
= ReR + ex 0

= ReR eR + 1.
8 Note that the improper integral in the right-hand side of (3.37) makes sense in case
n is not an integer. (s + 1) := 0 xs ex dx is called Gamma function, which defines

R
factorial for any real s > 0.

By Proposition A.3 (i), the right-hand side of the last line converges to 1
as R . Thus
Z
xex dx = 1,
0
which completes the proof of (3.37) for n = 1.
Next, assuming that (3.37) is valid for n = k, we show that it is also
valid for n = k + 1. For R > 0, applying integration by parts, we obtain
Z R Z R
R
xk+1 ex dx = xk+1 (ex ) 0 (k + 1)xk (ex )dx

0 0
Z R
= Rk+1 eR + (k + 1) xk ex dx.
0
Again, by Proposition A.3 (i), the first term of the last line converges to 0
as R , while the second term converges to (k + 1)k! = (k + 1)! by the
assumption of the induction. Thus we have
Z
xk+1 ex dx = (k + 1)! ,
0
which completes the proof.
At first glance, n! looks simpler than Eulers integral (3.37), but to study
it in detail, the integral is much more useful.
Now, let us look closely at Eulers integral.
Z Z Z
n x
x
x e dx = exp (n log x x) dx = exp n log x dx.
0 0 0 n
The change of variables t = x/n leads to
Z
n! = nn+1 exp (n (log t t)) dt. (3.38)
0
Then, we know
f (t) := log t t, t > 0,
is the key function.
We differentiate f to find an extremal value. The equation
1
f 0 (t) = 1 = 0
t
has a unique solution t = 1. Since f 00 (t) = 1/t2 < 0, the extremal value
f (1) = 1 is the maximum value of f (Fig. 3.6).
The following lemma is really amazing.

70 Limit theorem
1 2 3 4 5
-0.5
-1
-1.5
-2
-2.5
-3
-3.5
-4
Fig. 3.6 The graph of f (t)
Lemma 3.3. For any 0 < < 1 (Remark 3.1), it holds that
Z Z 1+
exp (nf (t)) dt exp (nf (t)) dt, n .
0 1
Namely, if n 1, the integral of exp(nf (t)) on [0, ) is almost determined
by the integral on arbitrarily small neighborhood of t = 1, at which f takes
the maximum value.
Proof. We define a function g by
1 f (1 )

(t 1) 1 (0 < t 5 1),

g(t) :=
f (1 + ) + 1
(t 1) 1 (1 < t).

The graph of g is a polyline that passes the following three points (Fig. 3.7):
(1 , f (1 )), (1, 1) and (1 + , f (1 + )).
Since f is a concave function, we have

> f (t) (0 < t < 1 ),
g(t) 5 f (t) (1 5 t 5 1 + ),
> f (t) (1 + < t).

Therefore
Z Z
exp (nf (t)) dt exp (ng(t)) dt
t>0 ; |t1|> t>0 ; |t1|>
Z 1+ 5 Z 1+ , (3.39)
exp (nf (t)) dt exp (ng(t)) dt
1 1
1 2 3 4 5
-0.5
-1
-1.5
-2
-2.5
-3
-3.5
-4
Fig. 3.7 The graph of g(t) ( = 4/5, polyline)
Z Z 1 Z
where denotes the integral + on the set {t > 0}
t>0 ; |t1|> 0 1+
{|t 1| > }.
Since g is a polyline, we can explicitly calculate the right-hand side of
(3.39). First, in the region 0 < t 5 1,
1 f (1 )
a := > 0

is the slope of g(t). Then, as n ,
Z 1 Z 1
exp (ng(t)) dt exp (ng(t)) dt
0
Z 1+ 5 Z 1
1 1
Z 1
exp (n(a(t 1) 1)) dt
= Z
1
exp (n(a(t 1) 1)) dt
1
1 n
e exp (na)
= na
1 n
e (1 exp (na))
na
exp (na)
= 0. (3.40)
1 exp (na)
72 Limit theorem
Similarly, in the region 1 < t,

f (1 + ) + 1
a0 := < 0

is the slope of g(t). Then, as n ,
Z Z
1+ 1+
Z 1+ 5 Z 1+
1 1
exp (na0 )
= 0. (3.41)
1 exp (na0 )
By (3.40) and (3.41), the right-hand side (and hence the left-hand side)
of (3.39) converges to 0 as n . Therefore
Z Z 1+ Z
exp (nf (t)) dt exp (nf (t)) dt + exp (nf (t)) dt
0 1 t>0 ; |t1|>
Z 1+ = Z 1+
exp (nf (t)) dt exp (nf (t)) dt
1 1
Z
exp (nf (t)) dt
t>0 ; |t1|>
= 1+ Z 1+ 1, n .
exp (nf (t)) dt
1
Thus the proof of Lemma 3.3 is complete.
From (3.38) and Lemma 3.3, it follows that for any 0 < < 1,
Z 1+
n! nn+1 exp (nf (t)) dt, n . (3.42)
1
Furthermore, looking carefully at the above proof, if we replace by a
positive decreasing sequence {(n)} n=1 converging to 0, there is still a pos-
sibility for (3.42) to hold. Indeed, for example,
(n) := n1/4 , n = 2, 3, . . .
satisfies the following.
Z 1+(n)
n+1
n! n exp (nf (t)) dt, n . (3.43)
1(n)
Let us prove (3.43). To this end, considering (3.40) and (3.41), it is

enough to show
lim exp(na(n)) = 0, lim exp(na0 (n)) = 0.
n n
To show these two, considering the definitions of a and a0 , it is enough to

show
lim n (1 + f (1 (n))) = , lim n (1 + f (1 + (n))) = .
n n
Since f (t) = log t t and (n) = n1/4 , what we have to show are

lim n log(1 n1/4 ) + n1/4 = , (3.44)
n

lim n log(1 + n1/4 ) n1/4 = . (3.45)
n
First, as for (3.44), recalling (3.33), we see

n log(1 n1/4 ) + n1/4
0
2 !
1 1/4 2 n1/4 s
Z
=n n ds
2 n1/4 (1 + s)3
1
< n1/2 , n .
2
Next, as for (3.45), we see

n log(1 + n1/4 ) n1/4
n1/4
2 !
1 1/4 2 n1/4 s
Z
=n n + ds
2 0 (1 + s)3
n1/4
2 !
1 1/4 2 n1/4
Z
5n n + ds
2 0 1
1
= n1/2 + n1/4 , n .
2
From these (3.43) follows.
Applying Taylors formula to the function f (t), since f (1) = 1, f 0 (1) =

0, and f 00 (s) = 1/s2 , we have
Z t
0
f (t) = f (1) + (t 1)f (1) + (t s)f 00 (s)ds
1
Z t
ts
= 1 ds.
1 s2
By (3.43), to prove Stirlings formula, it is enough to look at the behavior
of f (t) in the region 1 (n) 5 t 5 1 + (n). In this region, it holds that
Z t Z t
ts ts
1 2
ds 5 f (t) 5 1 ds,
1 (1 (n)) 1 (1 + (n))2
74 Limit theorem
namely,
(t 1)2 (t 1)2
1 5 f (t) 5 1 .
2(1 (n))2 2(1 + (n))2

If we put b (n) := 2(1 (n)), then b (n) 2 as n , and
Z 1+(n)
n(t 1)2

n
e exp dt
1(n) b (n)2
Z 1+(n) Z 1+(n)
n(t 1)2

nf (t) n
5 e dt 5 e exp dt.
1(n) 1(n) b+ (n)2
Multiplying all by en ,
Z 1+(n)
n(t 1)2

exp dt
1(n) b (n)2
Z 1+(n) Z 1+(n)
n(t 1)2

5 en enf (t) dt 5 exp dt. (3.46)
1(n) 1(n) b+ (n)2

Changing variables u = n(t 1)/b (n) leads to
Z 1+(n) 2 ! Z n(n)/b (n)
t1

2
n exp n dt = b (n) eu du.
1(n) b (n) n(n)/b (n)
1/4
Here, as n , n(n)/b (n) = n /b (n) , and hence the right-
hand side of the above is convergent to (Remark 3.7)
Z u2
2 e du = 2, n .

From this and (3.46), by the squeeze theorem, it follows that
n 1+(n) nf (t)
Z
ne e dt 2, n .
1(n)
Therefore (3.43) implies
n n!
ne n+1 2, n ,
n
from which Stirlings formula (3.36) immediately follows.
Remark 3.7. An improper integral
Z

exp u2 du =

or its equivalent
2
1 u
Z
exp du = 1
2 2
is called Gaussian integral, which appears very often in mathematics,
physics, etc.
3.3.4 Proof of de MoivreLaplaces theorem

The following is the key lemma to prove de MoivreLaplaces theorem.
Lemma 3.4. Let A, B (A < B) be any two real numbers. Suppose that
N 3 n, k under the condition
1 A 1 B
n+ n 5 k 5 n+ n. (3.47)
2 2 2 2
Then, we have
(k 12 n)2

n n 1
b(k; n) := 2 q exp 1 .
k 1
n 2n
2
More precisely, if we put

(k 12 n)2

1
b(k; n) = q exp 1 (1 + rn (k)), (3.48)
1 2n
2 n
then rn (k) satisfies the following.

max |rn (k)| 0, n , (3.49)
k ; (3.47)
where max denotes the maximum value over k satisfying (3.47) with n
k ; (3.47)
fixed.
Proof. 9 If we put
1
n! = 2nn+ 2 en (1 + n ), (3.50)
we have n 0, as n , by Stirlings formula (3.36). The following
holds (Proposition A.2).
max |k | 0, n . (3.51)
k ; (3.47)
Substituting (3.50) for b(k; n) = nk 2n = n!/((n k)!k!) 2n , we have

k n+k
1 k nk 1 + n
b(k; n) = q n 2n .
k nk
2n n n n (1 + k )(1 + nk )
As n ,

k 1 n k 1 max{|A|, |B|}
max = max = 0. (3.52)
k ; (3.47) n 2 k ; (3.47) n 2 2 n
9 The proof of this lemma (cf. [Sinai (1992)] Theorem 3.1) is difficult. Readers should
challenge it after reading Sec. A.3.

76 Limit theorem
Then, putting
k n+k
1 k nk
b(k; n) = q 2n (1 + rn,k ),
1
n n n
2
by (3.51) and (3.52), we have 10

max |rn,k | 0, n . (3.53)
k ; (3.47)
Using the exponential function, we have

1
b(k; n) = q exp (Tn,k ) (1 + rn,k ),
1
2 n
where
k nk 1
Tn,k := k log (n k) log + n log .
n n 2
To show the lemma, putting
k 12 n
z := 1 ,
2 n
we estimate the difference between Tn,k and z 2 /2. First we write Tn,k in
terms of z,
(1) (2) 1
Tn,k = Tn,k Tn,k + n log ,
2
where
r ! r !
(1) 1 1 1 (2) 1 1 1
Tn,k := k log + z , Tn,k := (n k) log z .
2 2 n 2 2 n
q
Now, applying (3.34) with a = 12 and x = 12 z n1 , we see
Z z n1 z 1 s
q 2
r
2
(1) 1 1 k z n
Tn,k = k log + kz +k ds.
2 n 2 n 0 (1 + s)3
Let n,k denote the last integral term (Remark 3.6). Then,
Z 1 q 2

1 r 3 r !3
z n

z n 1 1
|n,k | 5 k ds = k z 1 z

q 3
n n

0 1
1 z n

r r !3
k 3 1 1
= |z| 1 z . (3.54)

n n n

10 The proof of (3.53) needs the knowledge of continuity of functions of several variables.
See Sec. A.3.3.

Under the condition (3.47), we have |z| 5 C := max{|A|, |B|} so that

1
1
r r !3
n + B n 3 1 1
max |n,k | 5 2 2
C 1C 0, n .
k ; (3.47) n n n
q
Similarly, applying (3.34) with a = 12 and x = 12 z n1 , we see
r
(2) 1 1 n k z2
Tn,k = (n k) log (n k)z
2 n 2 n
q
Z z n1 z 1 s
2
n
+(n k) ds.
0 (1 + s)3
0
Let n,k denote the last integral term. Then, we also have
0
max |n,k | 0, n .
k ; (3.47)
00 0
Putting n,k := n,k n,k , it holds that
r !
1 1 k z2
Tn,k = k log + kz + n,k
2 n 2 n
r !
1 1 n k z2 0 1
(n k) log (n k)z + n,k + n log
2 n 2 n 2
r
1 z2 00
= (n 2k)z + + n,k .
n 2

Recall that k = 12 z n + 12 n, and we see
z2 00
Tn,k = + n,k
2
with
00
max |n,k | 0, n . (3.55)
k ; (3.47)
From all the above,

1
b(k; n) = q exp (Tn,k ) (1 + rn,k )
1
2 n
2
1 z 00
= q exp + n,k (1 + rn,k )
1
n 2
2
(k 12 n)2

1 00
= q exp 1 + n,k (1 + rn,k ).
1
n 2n
2
78 Limit theorem
On the other hand, by (3.48),

b(k; n)
rn (k) = 1
(k 12 n)2

1
q exp 1
1
n 2n
2
and hence
00

rn (k) = exp n,k (1 + rn,k ) 1.
From this, (3.53) and (3.55), the desired (3.49) follows. 11
Proof of Theorem 3.6. By Theorem 3.5 and Lemma 3.4,

n A n B

Pn (A 5 Zn 5 B) = Pn + n 5 1 + + n 5 + n
2 2 2 2
X
= b(k; n)
k ; (3.47)
(k 12 n)2

X 1
= q exp 1 (1 + rn (k)),
1 2n
k ; (3.47) 2 n
X
where denotes the sum over k satisfying the condition (3.47) with
k ; (3.47)
n fixed. For each k N+ , let
k 12 n
zk := 1 .
2 n

Noting that the length between two adjacent zk and zk+1 is 1/ 12 n , we
rewrite Pn (A 5 Zn 5 B) as a Riemann sum and a remainder:
2
1 X 1 z
Pn (A 5 Zn 5 B) = 1 exp k (1 + rn (k))
2 n 2 2
A 5 zk 5 B
2
1 X 1 z
= 1 exp k
2 n 2 2
A 5 zk 5 B
zk2

1 X 1
+ 1 exp rn (k).
2 n 2 2
A 5 zk 5 B
The limit of the first term, the Riemann sum, is

2 Z B 2
1 X 1 z 1 z
lim 1 exp k = exp dz.
n
2 n 2 2 A 2 2
A5z 5B k
11 The proof of (3.49) needs the knowledge of continuous function of several variables.
See Sec. A.3.3.

As
for the second term, since the absolute value of the first term
1
P
1 n A5zk 5B is bounded by a constant M > 0 that is independent

2
of n (Proposition A.1), we see

2
1 X 1 zk

1 n exp rn (k) 5 M max |rn (k)| 0,
2 2 2 k ; (3.47)
A 5 zk 5 B
n .
This completes the proof of de MoivreLaplaces theorem.
Example 3.9. Applying de MoivreLaplaces theorem, let us calculate

an approximate value of the following probability.
100
X 100 100
P100 ( 1 + + 100 = 55 ) = 2 . (3.56)
k
k=55
This is the probability that the number of Heads is more than or equal to 55
among 100 coin tosses. Although the random variable 1 + + 100 takes
only integers 0, . . . , 100, just as we considered the event {1 + +100 = k}
as {k(1/2) 5 1 + +100 < k+(1/2)} in the histogram of Fig. 3.3(right),
if we consider (3.56) as
P100 ( 1 + + 100 = 54.5 ) ,
and apply de MoivreLaplaces theorem, the accuracy of approximation
will be improved. This is called the continuity correction. By this method,
we obtain
!
1 + + 100 50 54.5 50
P100 ( 1 + + 100 = 54.5 ) = P100 1
= 1
2 100 2 100
!
1 + + 100 50
= P100 1
= 0.9
2 100
Z 2
1 x
exp dx = 0.18406.
0.9 2 2
Since the true value of (3.56) is
100
X 100 100
2 = 233375500604595657604761955760 2100
k
k=55
= 0.1841008087,
the approximation error is only 0.00004.
80 Limit theorem
Example 3.10. After making a continuity correction, we applied de

MoivreLaplaces theorem to the following case.

1 + + 108

1 1
P108 =
108 2 2000
Z 2
1 x
2 exp dx = 1.52551 1023 .
9.9999 2 2
This is even more precise than the inequality (3.22). Recalling that the
Avogadro constant is 6.021023 helps us imagine how small this probability
is. The same probability was estimated as 5 1/100 by Chebyshevs
inequality (3.15). This fact tells us Chebyshevs inequality gives very loose
bounds. However, do not jump to the conclusion that it is therefore a
bad inequality. Rather, we should admire Chebyshevs great insight that
realized this loose inequality is enough to prove the law of large numbers.
Do readers remember that the inequality (3.2) used in the proof of
Theorem 3.1 was also very loose?
3.4 Central limit theorem
The assertion of de MoivreLaplaces theorem is valid not only for coin

tosses but also for general sequences of i.i.d. random variables.
Suppose that for each n N+ , a probability space (n , P(n ), n ), and
i.i.d. random variables Xn,1 , Xn,2 , . . . , Xn,n are given. Let their common
mean be
E[Xn,k ] = mn .
In addition, suppose that there exist constants 2 > 0 and R > 0 indepen-
dent of n, such that
V[Xn,k ] = n2 = 2 , (3.57)
max |Xn,k () mn | 5 R. (3.58)
n
Consider the standardization of Xn,1 + + Xn,n :

(Xn,1 mn ) + + (Xn,n mn )
Zn := p . (3.59)
n2 n
Theorem 3.9. For any A < B, it holds that

Z B 2
1 x
lim n ( A 5 Zn 5 B ) = exp dx.
n A 2 2
3.4. Central limit theorem 81
(p)
Example 3.11. Let 0 < p < 1 and let ({0, 1}n , P({0, 1}n ), Pn ) be the
probability space of Example 3.2. Then, the coordinate functions {k }nk=1
is unfair coin tosses if p 6= 1/2. Since E[k ] = p and V[k ] = p(1 p),
Theorem 3.9 implies
!
B
(1 p) + + (n p) 1
Z
2
lim Pn(p) A5 p 5B = ex /2 dx.
n p(1 p)n A 2
In general, a theorem such as Theorem 3.9 that asserts the distribution

of the standardized sum of n random variables converges to the standard
normal distribution as n is called the central limit theorem. Among
numerous limit theorems, it is the most important and hence Polya named
it central. Proving Theorem 3.9 rigorously exceeds the level of this book,
so that we here give a convincing explanation that it must be true.
As the proof of Theorem 3.5 somewhat implies, the moment generating

function MX (t) of a random variable X determines the distribution of X.
Proposition 3.6. Let X and Y be random variables. If MX (t) = MY (t),

t R, then X and Y are identically distributed.
Proof. Let a1 < < as be all possible values that X takes. We have
1
lim log MX (t)
t t
s
1 X
= lim log exp(tai )P (X = ai )
t t
i=1
s
!
1 X
= lim log exp(tas ) exp(t(ai as ))P (X = ai )
t t
i=1
s
!
1 1 X
= lim log exp(tas ) + log exp(t(ai as ))P (X = ai )
t t t i=1
s1
!
1 X
= as + lim log P (X = as ) + exp(t(ai as ))P (X = ai )
t t
i=1
= as .
82 Limit theorem
After knowing as ,
s
X
lim MX (t) exp(tas ) = lim exp(t(ai as ))P (X = ai )
t t
i=1
s1
X
= P (X = as ) + lim exp(t(ai as ))P (X = ai )
t
i=1
= P (X = as ).
Thus we can obtain as and P (X = as ) from MX (t). If we apply this
procedure to
MX (t) exp(tas )P (X = as ),
then we obtain as1 and P (X = as1 ). Repeating this procedure, we
obtain the distribution of X from MX (t). In particular, MX (t) = MY (t),
t R, implies that X and Y are identically distributed.
Now, let us calculate the moment generating function of the standard-

ized sum (3.24) of n coin tosses {i }ni=1 , and observe its asymptotic behav-
ior.
" !#
1 21 + + n 12

E exp t 1
2 n
1 12 n 12

= E exp t 1 exp t 1
2 n 2 n
1
1 2 n 12

= E exp t 1 E exp t 1
2 n 2 n
n
1 1

= E exp t 1 2
n
2 n
1 t 1 t
= exp + exp
2 n 2 n
2 !n
1 t t
= 1+ exp exp
2 2 n 2 n
n
cn (t)
= 1+ , (3.60)
n
where
2
t2

n t t
cn (t) := exp exp , n . (3.61)
2 2 n 2 n 2
The last convergence can be proved in various ways. Here we prove it by

Taylors formula (3.31):
Z x
x
e = 1+x+ (x s)es ds, x R.
0
According to this formula, if |x| 1, we have ex 1 + x, and hence

2 2
t2

n t t n t
cn (t) 1+ 1 = = .
2 2 n 2 n 2 n 2
To prove (3.61), we have to show that the remainder term converges to 0
as n , which can be done in the same way as (3.54).
Finally, from (3.60) and (3.61), and Lemma 3.5 below, it follows that
for each t R, we have
" !#
1 12 + + n 12
2
t
E exp t 1 exp , n . (3.62)
2 n 2
Lemma 3.5. If a sequence of real numbers {cn }n=1 converges to c R,

then
cn n
lim 1 + = ec .
n n
Proof. Since cn /n 0, n , we may assume |cn /n| < 1 for n 1. By
(3.33), using the estimation that led to (3.54), we get the following.
cn cn
n log 1 + c 5 n log 1 + cn + |cn c|

n n
1 c 2 Z cn /n cn t2

n n
= n + dt + |cn c|

2 n (1 + t)3
0
c n 2
Z
c2 cn /n
5 n + n n
c 3 dt + |cn c|

2n 0 1 n n
cn 2

c2 c
n
= n +n n
+ |cn c|
2n n 1 cn 3
n
c 2
2
c n
= n + |cn | n
3 + |cn c| 0, n 0.
2n 1 cn
n
Therefore
cn
lim n log 1 + = c.
n n
84 Limit theorem
By the continuity of the exponential function,

cn n cn
lim 1 + = lim exp n log 1 +
n n n n
cn
= exp lim n log 1 + = ec .
n n
The function exp(t2 /2), which is the limit in (3.62), has a deep relation
with the density function (3.29) of the standard normal distribution.
Lemma 3.6. For each t R,

Z 2 2
1 x t
etx exp dx = exp .
2 2 2
Proof. We complete the square for the quadratic function of x in the ex-
ponent:
Z 2
1 x
etx exp dx
2 2
Z
1 1
= exp tx x2 dx
2 2
Z
1 1 2 1 2
= exp (x t) + t dx
2 2 2
!
Z 2 2
1 (x t) t
= exp dx exp
2 2 2
Z 2 2
1 x t
= exp dx exp
2 2 2
2
t
= exp .
2
The last = is due to Remark 3.7.
Considering Proposition 3.6, the convergence (3.62), and Lemma 3.6,

let us show, as a substitute of the proof of Theorem 3.9, that the moment
generating function of Zn defined by (3.59) satisfies
2
t
lim MZn (t) = exp , t R.
n 2
We calculate MZn (t)" explicitly: !#

(Xn,1 mn ) + + (Xn,n mn )
MZn (t) = E exp t p
n2 n
" !#n
Xn,1 mn
= E exp t p .
n2 n
By Taylors
" formula (3.35), we
!#see
Xn,1 mn
E exp t p
n2 n
" #
Xn,1 mn t2 (Xn,1 mn )2 r(n, t)
= E 1+t p + +
n2 n 2 n2 n n
2 2
(Xn,1 mn )

t r(n, t)
= 1+ E 2
+E
2 n n n
t2 E [r(n, t)]
= 1+ + ,
2n n
where the remainder term r(n, t) is given by
Z t X n,1 mn
!2
2n 1 Xn,1 mn
s es ds,
n
r(n, t) = n t p
0 2 n2 n
whose mean is then estimated by (3.57) and (3.58) as
|E [r(n, t)]|
!
X m 1 X m 2

X m
n,1 n n,1 n n,1 n
5 E n t p t p 2 exp t p 2

2 2

n n n n n n
" 3 #
|t|3 R

R
5 E n exp t
2 2 n 2
n
3
1 3 R R
5 |t| exp t
0, n .
2 n 2 2 n
Therefore" !#
Xn,1 mn cn (t)
E exp t p = 1+ ,
n2 n n
t2 t2
+ E [r(n, t)] , n .
cn (t) =
2 2
" 3.5, as n , we!#
Hence by Lemma see
n n 2
Xn,1 mn

cn (t) t
MZn (t) = E exp t p = 1+ exp .
2
n n n 2

86 Limit theorem
3.5 Mathematical statistics
3.5.1 Inference
Let us look at applications of limit theorems in mathematical statistics.
We begin with the statistical inference. It provides guidelines to construct
stochastic modelsprobability spaces, random variables, etc., from given
statistical data. Consider the following exercise. 12
Exercise II 1,000 thumbtacks were thrown on a flat floor, and

400 of them landed point up. From this, estimate the probability
that the thumbtack lands point up.
We consider Exercise II using the mathematical model of unfair coin

(p)
tosses, i.e., the probability space ({0, 1}n , P({0, 1}n ), Pn ) introduced in
Example 3.2. The situation of Exercise II is interpreted that n = 1000,
0 < p < 1 is the probability that we want to know, and that the coordinate
function i : {0, 1}1000 {0, 1} is the state (point up(= 1) or point down(=
0)) of the i-th thumbtack for each i. Then, a random variable
S := 1 + + 1000
is the number of thumbtacks that land point up. The experiment is in-
terpreted as choice of an from {0, 1}1000 , and its outcome is interpreted
as
S() = 1 () + + 1000 () = 400.
Since E[S] = 1000p and V[S] = 1000p(1 p), Chebyshevs inequality
implies

(p) S (p)
P1000 p = = P1000 ( |S 1000p| = 1000 )

1000
V[S] p(1 p) 1
5 2
= 2
5 .
(1000) 1000 40002

Here putting := 10/20 = 0.158, we obtain

(p) S 1
P1000 p = 0.158 5
.
1000 100
Therefore

(p) S 99
P1000 1000 p < 0.158 = 100 .

12 Here we deal with interval estimation of population proportion.

3.5. Mathematical statistics 87
(p)
Solving in p in P1000 ( ) of the left-hand side, we see the probability that
S S
0.158 < p < + 0.158
1000 1000
is not less than 0.99. We now consider that S() = 400 is an outcome of
the occurrence of the above event. Then, we obtain
0.242 < p < 0.558.
This estimate is not so good because Chebyshevs inequality is loose.
The central limit theorem (Example 3.11) gives much more precise es-
timate:
! Z 2
S 1000p 1 x
(p)
P1000 p =z 2 exp dx, z > 0.

1000p(1 p) z 2 2
Now, for 0 < < 1/2, let z() denote a positive real number z such that
Z 2
1 x
2 exp dx = .
z 2 2
z() is called the 100 % point of the standard normal distribution
(Fig. 3.8).
0.4
0.3
z()
0.2
/2
0.1
-4 -2 2 4
Fig. 3.8 The 100 % point z() of the standard normal distribution
Now, we have
!
S 1000p
(p)
P1000 < z() 1 ,

p
1000p(1 p)
88 Limit theorem
i.e., the probability that

r r
S p(1 p) S p(1 p)
z() < p < + z()
1000 1000 1000 1000
is approximately 1 . Let 0 < 1. As before, we consider that
S() = 400 is an outcome of the occurrence of the above event. Namely,
we judge that
r r
p(1 p) p(1 p)
0.4 z() < p < 0.4 + z() .
1000 1000
We have to solve the above inequality in p, but here, we approximate it by
assuming p 0.4 due to the law of large numbers:
r r
0.4 (1 0.4) 0.4 (1 0.4)
0.4 z() < p < 0.4 + z() .
1000 1000
For example, if = 0.01, then z(0.01) = 2.58, and hence the above inequal-
ity is now
0.36 < p < 0.44.
We express this result as The 99% confidence interval of p is [0.36, 0.44].
The greater the confidence level 1 is, the wider the confidence interval
becomes.
Let us present Exercise II in another form.
Exercise II0 In some region, to make an audience rating survey

of a certain TV program, 1000 people were chosen at random,
and 400 of them said they watched the program. Estimate the
audience rating.
In the same way as above, we conclude that the 99% confidence interval of
the audience rating is [0.36, 0.44].
3.5.2 Test
The statistical test provides guidelines to judge whether or not given
stochastic modelsprobability spaces, random variables, etc., do not con-
tradict with observed statistical data. Consider the following exercise. 13
13 Here we deal with test of population proportion.

3.5. Mathematical statistics 89
Exercise III A coin was tossed 200 times, and it came up Heads
115 times. Is it a fair coin?
First, we state a hypothesis H.
H : The coin is fair.
Under the hypothesis H, we consider the probability space
({0, 1}200 , P({0, 1}200 ), P200 )
and the coordinate functions {i }200

i=1 . The number of Heads in 200 coin
tosses {0, 1}200 is S() := 1 () + + 200 (). Let us then calculate
the probability
P200 ( |S 100| = 15 ) .
After making a continuity correction, we apply de MoivreLaplaces theo-

rem:
!
S 100 14.5
P200 ( |S 100| = 15 ) = P200 1 = 1

2 200 2 200
Z 2
1 x
2
exp dx
14.5/( 2 200)
1 2 2
Z 2
1 x
=2 exp dx
2.05061 2 2
= 0.040305.
Therefore, under the hypothesis H, the event of Exercise III is of probability

0.040305. Since it is a rare event, the hypothesis H may not be true.
To discuss things without ambiguity, we use the following terms. Fix
0 < 1. If an event of probability less than under the hypothesis H
occurs, we say that H is rejected at the significance level . If it does not
occur, we say that H is accepted at the significance level . The smaller
the significance level is, the more difficult to reject H. With these terms,
possible answers to Exercise III are the hypothesis H is rejected at the
significance level 5% and the hypothesis H is accepted at the significance
level 1%.
Let us present Exercise III in another form.

90 Limit theorem
Exercise III0 In a certain region, 200 newborn babies were chosen

at random, and 115 of them were boys. May we say that the ratio
of boys and girls of newborn babies in this region is 1:1?
In the same way as above, one of the possible answers to Exercise III0 is
that the hypothesis the ratio of boys and girls of newborn babies in this
region is 1:1 is rejected at the significance level 5%.
Chapter 4
Monte Carlo method
The history of the Monte Carlo method started when Ulam, von Neumann
and others applied it to the simulation of nuclear fissions 1 by a newly
invented computer in 1940s, i.e., in the midst of World War II. Since then,
along with the development of computer, the Monte Carlo method has
been used in various fields of science and technology, and has produced
remarkable results. The development of the Monte Carlo method will surely
continue.
In this chapter, however, we do not mention such brilliant applications
of the method, but we discuss its theoretical foundation. More concretely,
through a solving process of Exercise I in Sec. 1.4, we study the basic theory
and implementation of sampling of random variables by computer.
4.1 Monte Carlo method as gambling
It is of course desirable to solve mathematical problems by sure methods,

but some problems such as extremely complicated ones or those that lack a
lot of information can only be solved by stochastic methods. Those treated
by the Monte Carlo method are such problems.
4.1.1 Purpose
The Monte Carlo method is a kind of gambling as its name indicates. The
purpose of the player Alice, is to get a generic valuea typical value or not
an exceptional valueof a given random variable by sampling. Of course,
the mathematical problem in question is assumed to be solved by a generic
value of the random variable. The sampling is done by her will, but she has
1 In order to make the atomic bombs.
91
92 Monte Carlo method
a risk to get an exceptional value, which risk should be measured in terms

of probability.
An exceptional value A generic value
Fig. 4.1 The distribution and a generic value of a random variable (Conceptual figure)
The following is a very small example of the Monte Carlo method (with-
out computer).
Example 4.1. An urn contains 100 balls, 99 of which are numbered r and
one of which is numbered r+1. Alice draws a ball from the urn, and guesses
the number r to be the number of her ball. The probability that she fails
to guess the number r correctly is 1/100.
If we state Example 4.1 in terms of gambling, it should be Alice wins if

she draws a generic ball, i.e., a ball numbered r, and loses otherwise. The
probability that she loses is 1/100.
In general, the player cannot know whether the aim has been attained
or not even after the sampling. Indeed, in the above example, although the
risk is measured, Alice cannot tell if her guess is correct even after drawing
a ball. 2
2 Many choices in our lives are certainly gambles. It is often the case that we do not
know whether they were correct or not . . . .

4.1. Monte Carlo method as gambling 93
4.1.2 Exercise I, revisited

To implement Example 4.1, we need only an urn and 100 numbered balls
and nothing else, but actual Monte Carlo methods are implemented on such
large scales that we need computers. Let us revisit Exercise I in Sec. 1.4.
Exercise I When we toss a coin 100 times, what is the probability

p that it comes up Heads at least 6 times in succession?
We apply the interval estimation in mathematical statistics. Repeat inde-

pendent trials of 100 coin tosses N times, and let SN be the number of
the occurrences of the coin comes up Heads at least 6 times in succession
among the trials. Then, by the law of large numbers, the sample mean
SN /N is a good estimator for p when N is large. More concretely, we put
N := 106 .
8 8
Example 4.2. On the probability space ({0, 1}10 , P({0, 1}10 ), P108 ), we
construct the random variable S106 in the following way.
First, we define a function X : {0, 1}100 {0, 1} by
r+5
Y
X(1 , . . . , 100 ) := max i , (1 , . . . , 100 ) {0, 1}100 .
1 5 r 5 1005
i=r
This means that X = 1 if there are 6 successive 1s in (1 , . . . , 100 ) and

8
X = 0 otherwise. Next, we define Xk : {0, 1}10 {0, 1}, k = 1, 2, . . . , 106 ,
8
for = (1 , . . . , 108 ) {0, 1}10 by
X1 () := X(1 , . . . , 100 ),
X2 () := X(101 , . . . , 200 ),
..
.
i.e., for each k = 1, . . . , 106 ,
Xk () := X(100(k1)+1 , . . . , 100k ).
6
{Xk }10
k=1 are i.i.d. under P108 , and
P108 (Xk = 1) = p, P108 (Xk = 0) = 1 p,

6
i.e., {Xk }10
k=1 are nothing but unfair coin tosses. Hence as in Example 3.4,
we have
E[Xk ] = p, V[Xk ] = p(1 p).
8
Finally, we define S106 : {0, 1}10 R by
6
10
X 8
S106 () := Xk (), {0, 1}10 .
k=1
Then, the mean and the variance of S106 /106 are, by (3.12) and (3.13),
p(1 p)

S106 S106 1
E 6
= p, V 6
= 6
5 .
10 10 10 4 106
Let U0 be a set of that give exceptional values to S106 :

8 S 6 () 1
U0 := {0, 1}10 10 6 p =

. (4.1)
10 200
Now, Chebyshevs inequality shows
1 1
P108 ( U0 ) 5 2002 = . (4.2)
4 106 100
Namely, a generic value of S106 /106 will be an approximate value of p.
Example 4.2 is regarded as gambling in the following way. Alice chooses

8
an {0, 1}10 . If 6 U0 , she wins and if U0 , she loses. The
probability that she loses is at most 1/100.
The risk estimate (4.2) assumes that Alice is equally likely to choose
8
{0, 1}10 . This assumption, however, can never be satisfied because
she cannot choose from random numbers, which account for nearly all
sequences. This is the most essential problem of sampling in the large-scale
Monte Carlo method.
4.2 Pseudorandom generator
4.2.1 Definition
To play the gamble of Example 4.2, anyhow, Alice has to choose an
8
{0, 1}10 . Let us suppose that she uses the most used device to do it, i.e.,
a pseudorandom generator.
Definition 4.1. A function g : {0, 1}l {0, 1}n is called a pseudorandom

generator if l < n. The input 0 {0, 1}l of g is called a seed, and the
output g( 0 ) {0, 1}n a pseudorandom number.
To produce a pseudorandom number, we need to choose a seed 0

{0, 1}l of g, which procedure is called initialization (or randomization). For
practical use, l should be so small that Alice can input any seed 0 {0, 1}l
4.2. Pseudorandom generator 95
directly from a keyboard, and the program of the function g should work
sufficiently fast.
Example 4.3. In Example 4.2, suppose that Alice uses a pseudorandom

8
generator g : {0, 1}238 {0, 1}10 . 3 She chooses a seed 0 {0, 1}238
of g and inputs it from a keyboard to a computer. Since 0 is only a 238
bit data ( 30 letters of alphabet), it is easy to input from the keyboard.
Then, the computer produces S106 (g( 0 )).
The reason why Alice uses a pseudorandom generator is that

8
{0, 1}10 she has to choose is too long. If it is short enough, no pseu-
dorandom generator is necessary. For example, when drawing a ball from
the urn in Example 4.1, who on earth uses a pseudorandom generator?
4.2.2 Security
Let us continue to consider the case of Example 4.3. Alice can choose any
seed 0 {0, 1}238 of g freely of her own will. Her risk is now estimated by
S 6 (g( 0 ))

1
P238 10 6

p = . (4.3)
10 200
Of course, the probability (4.3) depends on g. If this probabilitythe
probability that her sample S106 (g( 0 )) is an exceptional value of Sis
large, then it is difficult for her to win the game, which is not desirable.
Now, we give the following (somewhat vague) definition: we say a pseu-
dorandom generator g : {0, 1}l {0, 1}n , l < n, is secure against a set
U {0, 1}n if it holds that
Pn ( U ) Pl (g( 0 ) U ).
8
In Example 4.3, if g : {0, 1}238 {0, 1}10 is secure against U0 in (4.1),
for the majority of the seeds 0 {0, 1}238 that Alice can choose of her
own will, the samples S106 (g( 0 )) will be generic values of S106 . In this
case, no random number is needed. In other words, in sampling a value
of S106 , using g does not make Alices risk high, and hence g is said to be
secure. The problem of sampling in a Monte Carlo method will be solved
by finding a suitable secure pseudorandom generator.
In general, a pseudorandom generator g : {0, 1}l {0, 1}n is desired
to be secure against as many subsets of {0, 1}n as possible, but there is
3 The origin of the number 238 will soon be clear in Example 4.4.
no pseudorandom generator that is secure against all subsets of {0, 1}n .

Indeed, for any g : {0, 1}l {0, 1}n , let
Ug := { g( 0 ) | 0 {0, 1}l } {0, 1}n .
Then, we have Pn ( Ug ) 2ln but Pl (g( 0 ) Ug ) = 1, which means
that g is not secure against Ug .
4.3 Monte Carlo integration
For a while, leaving Exercise I aside, let us consider a general problem.

Let X be a function of m coin tosses, i.e., X : {0, 1}m R, and consider
calculating the mean
1 X
E[X] = m X()
2 m {0,1}
of X numerically. When m is small, we can directly calculate the finite

sum of the right-hand side, but when m is large, e.g., m = 100, the direct
calculation becomes impossible in practice because of the huge amount of
computation. In such a case, we estimate the mean of X applying the
law of large numbers, which is called the Monte Carlo integration (Exam-
ple 4.2). Most of scientific Monte Carlo methods aim at calculating some
characteristics of distributions of random variables, e.g., means, variances,
etc., which are actually Monte Carlo integrations.
4.3.1 Mean and integral

In general, mean is considered to be integral. Indeed, for a function X of
m coin tosses, we define
X(x) := X(d1 (x), . . . , dm (x)), x [0, 1), (4.4)
where {di (x)}m
i=1 are Borels model of m coin tosses (Example 1.2). Then,
we have
Z 1
E[X] = X(x)dx. (4.5)
0
The Monte Carlo integration is named after this fact.

Let us show (4.5). Note that X : [0, 1) R is a step function (Fig. 4.2):
m
2X 1
X(x) = X(d1 (2m i), . . . , dm (2m i))1[2m i, 2m (i+1)) (x), x [0, 1).
i=0
4.3. Monte Carlo integration 97
r b
r b r b

r b i
r b X
r b 2m
0 i i+1 1
2m 2m
Fig. 4.2 The graph of X(x) (Example)
Here 1[2m i, 2m (i+1)) (x) is the indicator function of the interval

[2m i, 2m (i + 1)).
Then, we have
m
Z 1 2X 1 Z 1
m m
X(x)dx = X(d1 (2 i), . . . , dm (2 i)) 1[2m i, 2m (i+1)) (x)dx
0 i=0 0
m
2X 1
= X(d1 (2m i), . . . , dm (2m i)) 2m
i=0
1 X
= X() = E[X].
2m
{0,1}m
Lemma 4.1. Let X : {0, 1}m R and let X : [0, 1) R be the corre-
sponding function defined by (4.4). Then, for any j N+ , we have
Z 1 2m+j
X1 q
1
X(x)dx = m+j X m+j .
0 2 q=0
2
Proof. It is enough to prove the lemma for X(x) = 1[2m i, 2m (i+1)) (x).
2m+j 2j (i+1)1
1 X1 q 1 X
1[2m i, 2m (i+1)) = 1
2m+j q=0
2m+j 2m+j
q=2j i
1
= .
2m
4.3.2 Estimation of mean

We state Example 4.2 in a more general setup. Let SN be the sum of i.i.d.
random variables {Xk }N
k=1 whose common distribution is identical to that
of X : {0, 1}m R. Then, SN is a function of N m coin tosses, more

concretely,
Xk () := X(k ), k {0, 1}m , = (1 , . . . , N ) {0, 1}N m ,
N
X
SN () := Xk (). (4.6)
k=1
By (3.12) and (3.13), the means of SN /N and X under PN m and Pm ,

respectively, are equal (E[SN /N ] = E[X]), and as for variance, we have
V[SN /N ] = V[X]/N .
For N 1, a generic value of SN /N will be an approximate value of
E[X] by the law of large numbers. Let us estimate the risk by Chebyshevs
inequality:

SN () V[X]
PN m E[X] = 5
. (4.7)
N N 2
This means that putting

SN ()
U1 := {0, 1}N m

N E[X] = , (4.8)

we are considering a gamble if Alices choice {0, 1}N m is not in U1 ,

she wins, if U1 , she loses.
In general, as E[X] is unknown, V[X] is also unknown. In this sense,
the risk estimate (4.7) is not complete. However, if we can find a constant
M > 0 such that V[X] 5 M , the risk estimate then becomes complete
(Example 4.2).
Of course, the Monte Carlo integration can be applied to not only ran-
dom variables but also functions that have no relation to probability. If a
given integrand is not so complicated as X in Example 4.2, then deter-
ministic sampling methods, such as the quasi-Monte Carlo method, may be
applicable. See [Bouleau and Lepingle (1994)] for details.
4.3.3 Random Weyl sampling

In the Monte Carlo integration, the random variable SN in (4.6) has a
special form. Using this fact, we can construct a pseudorandom generator
that is secure against U1 in (4.8).
First, we introduce some notations. For each m N+ , we define

Dm := { 2m i | i = 0, . . . , 2m 1 } [0, 1).
Let P(m) denote the uniform probability measure on Dm . For each x = 0,

we define
bxcm := b2m (x bxc)c 2m Dm .
Namely, bxcm denotes the truncation of x to m decimal place in its binary
expansion. By a one-to-one correspondence
Dm 3 2m i (d1 (2m i), . . . , dm (2m i)) {0, 1}m ,
we identify Dm with {0, 1}m , and write it as Dm
= {0, 1}m .
Definition 4.2. Let j N+ . We define

Zk ( 0 ) := bx + kcm Dm ,
0 = (x, ) Dm+j Dm+j
= {0, 1}2m+2j ,
k = 1, 2, 3, . . . , 2j+1 .
Then, for N 5 2j+1 , we define a pseudorandom generator
g : {0, 1}2m+2j {0, 1}N m (4.9)
by
g( 0 ) := (Z1 ( 0 ), . . . , ZN ( 0 )) Dm N
= {0, 1}N m .
Lemma 4.2. Under the product (uniform) probability measure P(m+j)

j+1
P(m+j) on Dm+j Dm+j , the random variables {Zk }2k=1 are pairwise in-
dependent (Remark 3.4), and each Zk is uniformly distributed in Dm .
Proof. Take arbitrary 1 5 k < k 0 5 2j+1 and arbitrary t, t0 Dm . Then,

it is enough to show that
1
P(m+j) P(m+j) ( Zk = t, Zk0 = t0 ) = 2m . (4.10)
2
Indeed, (4.10) implies
X
P(m+j) P(m+j) ( Zk = t ) = P(m+j) P(m+j) ( Zk = t, Zk0 = t0 )
t0 Dm
1 1
= 2m
= m,
22m 2
and hence each Zk is uniformly distributed in Dm . Then, since
P(m+j) P(m+j) ( Zk = t, Zk0 = t0 )
= P(m+j) P(m+j) ( Zk = t ) P(m+j) P(m+j) ( Zk0 = t0 )
Zk and Zk0 are independent.
We rewrite (4.10) in an integral form: defining two periodic functions

F, G : [0, ) {0, 1} with period 1 by
F (x) := 1[t0 ,t0 +2m ) (x bxc),
G(x) := 1[t,t+2m ) (x bxc),
we will show the following equality, which is equivalent to (4.10).
Z 1 Z 1
E[F (Zk0 )G(Zk )] = F (u)du G(v)dv. (4.11)
0 0
Here E stands for the mean under P(m+j) P(m+j) .
Since Zk0 and Zk are random variables on (Dm+j Dm+j , P(Dm+j

Dm+j ), P(m+j) P(m+j) ), the mean of F (Zk0 )G(Zk ) is calculated as follows.
E[F (Zk0 )G(Zk )]
1 X 1 X
= m+j F (bx + k 0 cm ) G (bx + kcm )
2 2m+j
Dm+j xDm+j
1 X 1 X
= F (x + k 0 ) G (x + k)
2m+j 2m+j
Dm+j xDm+j
2m+j m+j
X1 2 X1 k0 q

1 p p kq
= F + G +
22m+2j q=0 p=0
2m+j 2m+j 2m+j 2m+j
m+j
2m+j
X1 2 X +kq1
(k 0 k)q

1 p p
= F + G
22m+2j q=0
2m+j 2m+j 2m+j
p=kq
2m+j m+j
X1 2 X1 (k 0 k)q

1 p p
= F + G . (4.12)
22m+2j q=0 p=0
2m+j 2m+j 2m+j
We here assume that 0 < k 0 k = 2i s 5 2j+1 1, where 0 5 i 5 j and s is

an odd integer. Then, we see
2m+j
X1 p 2m+j
X1 p
(k 0 k)q

1 1 sq
F + = F + .
2m+j q=0 2m+j 2m+j 2m+j q=0 2m+j 2m+ji
(4.13)
Now, we need an algebraic argument. For q, q 0 {0, 1, 2, . . . , 2m+ji
1}, if we have
sq mod 2m+ji = sq 0 mod 2m+ji 4
4a mod m stands for the remainder on division of a by m.

then
s (q q 0 ) mod 2m+ji = 0,
i.e., s (q q 0 ) is divisible by 2m+ji . Since s is odd, q q 0 is divisible by
2m+ji , but q, q 0 {0, 1, 2, . . . , 2m+ji 1}, it means q = q 0 . Therefore a
correspondence
{0, 1, 2, . . . , 2m+ji 1} 3 q sq mod 2m+ji {0, 1, 2, . . . , 2m+ji 1}
is one-to-one. Let qr {0, 1, 2, . . . , 2m+ji 1} be a unique solution to
sqr mod 2m+ji = r, r {0, 1, 2, . . . , 2m+ji 1}.
Then, for each r, we have
#{0 5 q 5 2m+j 1 | sq mod 2m+ji = r}
= #{0 5 q 5 2m+j 1 | sq mod 2m+ji = sqr mod 2m+ji }
= #{0 5 q 5 2m+j 1 | q qr is divisible by 2m+ji } = 2i .
From this, (4.13) is calculated as
2m+j
X1 2m+ji
X 1 p
1 p sq 2i r
F + = F +
2m+j q=0
2m+j 2m+ji 2m+j r=0 2m+j 2m+ji
2m+ji
X 1
1 r
= F
2m+ji r=0
2m+ji
Z 1
= F (u)du. (4.14)
0
The last = is due to Lemma 4.1. From (4.12), (4.13), and (4.14), it follows
that
E[F (Zk0 )G(Zk )]

2m+j
X1 2m+j
X1 0
(k k)q p

1 1 p
= F + G m+j
2m+j p=0
2m+j q=0
2m+j 2m+j 2

2m+j
X1 2m+j
X1
1 1
p sq p
= F + m+ji G m+j
2m+j p=0
2m+j q=0
2m+j 2 2
m+j
1 2 X1
1 p
Z
= F (u)du G
0 2m+j p=0
2m+j
Z 1 Z 1
= F (u)du G(v)dv.
0 0
This completes the proof of (4.11).
Theorem 4.1. The pseudorandom generator g : {0, 1}2m+2j {0, 1}N m

in (4.9) satisfies that for SN in (4.6),
E[SN (g( 0 ))] = E[SN ()] (= N E[X]),
V[SN (g( 0 ))] = V[SN ()] (= N V[X]).
Here 0 and are assumed to be uniformly distributed in {0, 1}2m+2j and
in {0, 1}N m , respectively. From this, just like (4.7), Chebyshevs inequality
SN (g( 0 ))

V[X]
P2m+2j (g( 0 ) U1 ) = P2m+2j

E[X]
N N 2
follows. In this sense, g is secure against U1 in (4.8).
The Monte Carlo integration method using the secure pseudorandom

generator (4.9) is called the random Weyl sampling. 5
Proof of Theorem 4.1. First, since each Zk ( 0 ) is uniformly distributed

in {0, 1}m , we see
"N #
X
0 0
E[SN (g( ))] = E X(Zk ( )) = N E[X].
k=1
Next, as we mentioned in Remark 3.4, the pairwise independence of
{Zk }N
k=1 implies that

N
!2
X
V[SN (g( 0 ))] = E (X(Zk ( 0 )) E[X])
k=1
N X
X N
= E [(X(Zk ( 0 )) E[X]) (X(Zk0 ( 0 )) E[X])]
k=1 k0 =1
N h i
2
X
= E (X(Zk ( 0 )) E[X])
k=1
X
+2 E [(X(Zk ( 0 )) E[X]) (X(Zk0 ( 0 )) E[X])]
1k<k0 N
N h i
2
X
= E (X(Zk ( 0 )) E[X])
k=1
= N V[X].
Thus we know that g has the required properties.
5 When 0 < < 1 is an irrational number, the transformation [0, 1) 3 x 7 x + bx +
c [0, 1) is called the Weyl transformation, after which this pseudorandom generator
is named.
Example 4.4. Applying the random Weyl sampling, we can solve Exer-
cise I. Let S106 be the random variable defined in Example 4.2. We take
m := 100 and N := 106 in Definition 4.2. In order to let N 5 2j+1 , it is
enough to take j := 19. Then, we have 2m + 2j = 238 so that the pseu-
8
dorandom generator (4.9) is now a function g : {0, 1}238 {0, 1}10 . The
risk (4.3) is estimated, by Theorem 4.1, as
S106 (g( 0 ))

= 1 1

P238 p 200 5 100 . (4.15)
106
Thus g is secure against U0 in (4.1). Since Alice can freely choose any seed
0 {0, 1}238 , this risk estimate has a practical meaning and she no longer
needs a long random number.
Here is a concrete example. Instead of her, the author chose the fol-
lowing seed 0 = (x, ) D119 D119 = {0, 1}238 written in the binary
numeral system:
x = 0.1110110101 1011101101 0100000011 0110101001 0101000100

0101111101 1010000000 1010100011 0100011001 1101111101
1101010011 111100100,
= 0.1100000111 0111000100 0001101011 1001000001 0010001000

1010101101 1110101110 0010010011 1000000011 0101000110
0101110010 010111111.
Then, we obtained S106 (g( 0 )) = 546, 177 by a computer (see Sec. A.5). In
this case,
S106 (g( 0 ))
= 0.546177 (4.16)
106
is the estimated value of the probability p. This result with the risk estimate
(4.15) is expressed in statistical terms as The 99% confidence interval of p
is 0.546177 0.05, i.e., 0.541 < p < 0.551. Indeed, the true value of p is
692255904222999797557597756576 2100 = 0.5460936192.
Thus the error of the estimated value (4.16) is only 0.00008.
In a practical Monte Carlo integration, usually, the sample size N is not
determined in advance, but it is determined in doing numerical experiments.
To be ready for such a situation, we should take j somewhat large.
Remark 4.1. We can advise Alice a little in choosing a seed 0 = (x, )
{0, 1}2m+2j for the random Weyl sampling: she should not choose an ex-
tremely simple . Indeed, if she chooses = (0, 0, . . . , 0) {0, 1}m+j , the
sampling will certainly end in failure.
4.4 From the viewpoint of mathematical statistics
We have formulated the Monte Carlo method as gambling, and we have

assumed that Alice chooses a seed 0 {0, 1}l of a pseudorandom generator
g : {0, 1}l {0, 1}n of her own will. From the viewpoint of mathematical
statistics, this is not a good formulation because sampling should be done
randomly in order to guarantee the objectivity of the result. Indeed, in
the case of the random Weyl sampling, as we mentioned in Remark 4.1,
Alice can choose a bad seed on purpose, i.e., the result may depend on the
players will.
Of course, it is impossible to discuss the objectivity of sampling rigor-
ously. We here simply assume that Heads or Tails of coin tosses do not
depend on anyones will (Remark 1.3). Then, for example, when we choose
a seed 0 {0, 1}l , we toss a coin l times, record 1 if it comes up Heads
and 0 if it comes up Tails at each coin toss, and define 0 as the recorded
{0, 1}-sequence, which completes an objective sampling. As a matter of
fact, the seed 0 {0, 1}238 in Example 4.4 was chosen in this way.
This method cannot be used to choose a very long {0, 1}n . The
point is that keeping the risk low, the random Weyl sampling makes the
input 0 much shorter so that this method may become executable.
Appendix A
A.1 Symbols and terms
A.1.1 Set and function
Definition A.1. For two sets A and B, the set of all ordered pairs (x, y)
of x A and y B is written as
A B := {(x, y) | x A, y B},
and it is called the direct product of A and B.
Example A.1. For two intervals A = [a, b] := { x | a 5 x 5 b } and B =

[c, d] of real line, their direct product A B is a rectangle in the coordinate
plane.
y
(a, c) (b, c)
B AB
(a, d) (b, d)
O A x
For more than two sets, the direct product is similarly defined. For
example,
R3 = R R R := {(x, y, z) | x R, y R, z R}
is the set of all ordered triplets of real numbers, i.e., the set of all points in
3-dimensional space. {0, 1}3 in Example 1.1 is nothing but {0, 1} {0, 1}
{0, 1}.
105
106 Appendix A
Definition A.2. Let E and F be non-empty sets. If for each element of E,

there corresponds an element of F , we call the correspondence a function
(or a mapping), and write it as f : E F . When E = F , we also call it a
transformation. The corresponding element of F to an individual a E is
written as f (a), and this individual correspondence is written as a 7 f (a).
Let 6= be a set. 6 For A P(), define a function 1A : {0, 1}

by

1 ( A),
1A () =
0 (
6 A),
which is called the indicator function of A. Relations and operations for
sets are realized as relations and calculations of indicator functions. For
example, A B is equivalent to 1A () 5 1B (), . Among
others,
1AB () = max{1A (), 1B ()},

1AB () = min{1A (), 1B ()},
= 1A () 1B (),
1 Ac () = 1 1A ().
In the last expression, Ac := { | 6 A } is the complement of A.
A.1.2 Symbols for sum and product

A function a : {1, 2, . . . , n} R can be described as a sequence
{a(1), a(2), . . . , a(n)} of length n. This is usually written as {ai }ni=1 . Sim-
ilarly, a function b : {1, 2, . . . , m} {1, 2, . . . , n} R is usually written
as
{bij }i=1,2,...,m, .
j=1,2,...,n
This is called a double sequence. As we write the sum of a sequence {ai }ni=1
Pn
as i=1 ai , the sum of a double sequence {bij }i=1,2,...,m, is written as
j=1,2,...,n
X
bij .
i=1,2,...,m,
j=1,2,...,n
6 denotes the empty set.

A.1. Symbols and terms 107
This is called a double sum. Obviously,

m n n m
!
X X X X X
bij = bij = bij .
i=1,2,...,m, i=1 j=1 j=1 i=1
j=1,2,...,n
Similarly, triple sequence and triple sum (more generally, multiple sequence
and multiple sum) are defined.
The product a1 a2 an of a sequence {ai }ni=1 is written as
n
Y
ai .
i=1
Similarly, the product of a double sequence {bij }i=1,2,...,m, j=1,2,...,n is writ-

ten as
Y
bij .
i=1,2,...,m,
j=1,2,...,n
Obviously,

m n n m
!
Y Y Y Y Y
bij = bij = bij .
i=1,2,...,m, i=1 j=1 j=1 i=1
j=1,2,...,n
Example A.2. Let us look at a little complicated double sequence. Sup-

pose that for each i = 1, . . . , m, a sequence {aij }nj=1
i
of length ni is given.
Then, the product
(a11 +a12 + +a1n1 )(a21 +a22 + +a2n2 ) (am1 +am2 + +amnm )

P Q
can be written in terms of and as
ni
m X
Y
aij . (A.1)
i=1 j=1
Developing it, we obtain

n1
X nm Y
X m
aiji . (A.2)
j1 =1 jm =1 i=1
Conversely, resolving (A.2) into factors, we obtain (A.1).

108 Appendix A
P
The summation symbol is applied not only to sequences but also to
any finite set of numbers. For example, suppose that for each element of
a finite set there corresponds a number p R. Then, the total sum of
such p is written as
X
p .

The sum of p over that satisfies a condition X() = ai is written
as
X
p .
; X()=ai
Q
The product symbol is used similarly.
A.1.3 Inequality symbol

a b as well as b a means that a is much greater than b. Without
making it clear how much a is greater than b, we use the symbol to
express a is much greater than b. It may not sound mathematical, but
if you look at how it is used you will accept it: Since limx x100 /ex = 0
(Proposition A.3 (i)), we have x100 ex for x 1.
A.2 Binary numeral system
To describe numbers, we usually use the decimal numeral system (or the
base-10 numeral system). It is a positional numeral system employing 10
as the base and requiring 10 different numerals, the digits 0, 1, 2, 3, 4, 5, 6,
7, 8, 9. It also requires a dot (decimal point) to represent decimal fractions.
The same thing can be done with only two different numerals , the digits 0,
1. This is called the binary numeral system (or the base-2 numeral system).
It is Leibniz who first systematized it in mathematics.
The binary numeral system is the simplest positional numeral system,
which can be expressed by ON(= 1) and OFF(= 0) of electronic circuits,
so that it is now a mathematical basis of all digital technologies.
A.2.1 Binary integers

(10)
Let Di (n) {0, 1, . . . , 9} denote the i-th digit of n N in the decimal
numeral system. Then, we have

(10)
X
n = 10i1 Di (n),
i=1
A.2. Binary numeral system 109
which is actually a finite sum for each n. For example,

563 = 102 5 + 101 6 + 100 3.
Note that
(10)
Di (n) := b10i+1 nc 10 b10i nc, i N+ ,
where btc denotes the integer part of t = 0. For example,
(10)
D2 (563) = b101 563c 10 b102 563c
= b56.3c 10 b5.63c = 56 50
= 6.
Similarly, for each n N, we can find Di (n) {0, 1}, i N, so that

X
n = 2i1 Di (n),
i=1
which is actually a finite sum. Indeed, they are given by
Di (n) := b2i+1 nc 2 b2i nc, i N.
Example A.3. For example,
D1 (563) = b563c 2 b21 563c = 563 562
= 1,
D2 (563) = b21 563c 2 b22 563c
= b281.5c 2 b140.75c = 281 280
= 1.
This computation needs some patience. There is an easier method to obtain
binary integers from decimal integers: repeating the division by 2, and line
the remainders up in the reverse order.

2 563
2 281 . . . 1
2 140 . . . 1
2 70 . . . 0
2 35 . . . 0
2 17 . . . 1
2 8 ... 1
2 4 ... 0
2 2 ... 0
2 1 ... 0
2 0 ... 1
Thus the binary representation of 563 is 1000110011.

110 Appendix A
A.2.2 Binary fractions

Next, let us consider how to describe fraction x [0, 1). In the decimal
numeral system,

(10) (10) (10)
X
x = 0.d1 (x)d2 (x) . . . = 10i di (x).
i=1
(10)
Here {0, 1, . . . , 9} is the i-th digit of x in its decimal expansion.
di (x)
For example,
0.563 = 101 5 + 102 6 + 103 3 + 104 0 + 105 0 + .
(10)
di (x) is described as
(10)
di (x) := b10i xc 10 b10i1 xc, i N. (A.3)
For example,
(10)
d2 (0.563) = b100 0.563c 10 b10 0.563c
= b56.3c 10 b5.63c
= 56 10 5 = 6.
In the case of the binary numeral system, like (A.3), define
di (x) := b2i xc 2 b2i1 xc, x [0, 1), (A.4)
and we have

X
x = 2i di (x), x [0, 1).
i=1
Example A.4. How is 0.563 in the decimal numeral system described in

the binary numeral system? Following (A.4), we see
d1 (0.563) = b2 0.563c 2 b0.563c = 1 0 = 1,
d2 (0.563) = b4 0.563c 2 b2 0.563c = 2 2 = 0,
d3 (0.563) = b8 0.563c 2 b4 0.563c = 4 4 = 0,
d4 (0.563) = b16 0.563c 2 b8 0.563c = 9 8 = 1,
..
.
and hence it is an infinite fraction 0.1001 . . .. An equivalent easier method
is
2 0.563 = 1.126 = 1 + 0.126,
2 0.126 = 0.252 = 0 + 0.252,
2 0.252 = 0.504 = 0 + 0.504,
2 0.504 = 1.008 = 1 + 0.008,
2 0.008 = 0.016 = 0 + 0.016,
A.3. Limit of sequence and function 111
lining up underlined digits, we get the binary representation 0.10010 . . ..

Alternatively, describing 563/1000 in the binary numeral system as
1000110011
,
1111101000
we get the ratio using long division:
0.1001 . . .
1111101000 1000110011.0
111110100 0
111111 0000
111110 1000
Remark A.1. Recall that 1 = 0.9999 . . . in the decimal numeral system.

Just like this, description of fractions in the binary numeral system is not
necessarily unique. For example, 1/2 = 0.1 = 0.01111 . . .. Adopting Defini-
tion (A.4) means that 1/2 should be described as 0.1 in the binary numeral
system.
A.3 Limit of sequence and function
At high school, about the limit of sequences and functions, students learn,
for example, limn an = a as If n gets larger and larger, an gets closer
and closer to a. This description is too vague for advanced mathematics.
Here we introduce the rigorous treatment of limit that was established by
Cauchy, Weirestrass and others in the 19-th century.
A.3.1 Convergence of sequence

Let us consider quantitatively the situation If n gets larger and larger, an
gets closer and closer to a. Namely, we ask How large shall we take n so
that an can be how close to a? The answer should be If we take n greater
than N , then the difference between an and a is less than > 0. Since
the difference between an and a can be as small as we want, we may take
any provided that it is positive. Therefore for any given > 0, if we can
take an N that satisfies the condition If n is greater than N , the difference
between an and a is less than , we may well say that an converges to a
as n tends to infinity.
Describing the above idea with only strictly selected words, we reach
the following sophisticated definition.
112 Appendix A
Definition A.3. We say a real sequence {an }n=1 is convergent to a R if

for any > 0, there exists an N N+ such that |an a| < holds for any
n > N . This is written as limn an = a.
In this definition, there is no phrase like n gets greater and greater

or an gets closer and closer to a, but it describes the situation admitting
no possibility of misunderstanding.
Example A.5. Based on Definition A.3, let us prove

1
lim = 0. (A.5)
n n
First, take any > 0. We want to see

1
0 < (A.6)
n
for large n. Solving this inequality in n, we get

1
n> .
2
Now, let N := b1/2 c + 1. Then, for any n > N , (A.6) holds. Thus we have
proved (A.5).
Proposition A.1. A convergent sequence {an } n=1 is bounded, i.e., there

exists an M > 0 such that |an | 5 M for any n N+ .
Proof. Let limn an = a. Then, (taking = 1) there exists an N N+

such that |an a| < 1 for any n > N , in particular, |an | < |a| + 1 for any
n > N . Now, let M > 0 be
M := max{ |a1 |, |a2 |, . . . , |aN |, |a| + 1 }.
Then, for any n N+ , we have |an | 5 M .
Proposition A.2. Assume that {an }

n=1 converges to 0. Then, any A <
B, it holds that
lim max |ak | = 0. (A.7)
n n+A n 5 k 5 n+B n
Proof. By the assumption, for any > 0, there exists an N N+ such

that for any k > N , we have |ak | < . If A = 0, then for n > N , we see
max |ak | 5 max |ak | < ,

n+A n 5 k 5 n+B n n 5 k 5 n+B n
A.3. Limit of sequence and function 113
which implies (A.7). If A < 0, when n > N 0 := b4A2 c + 1, we see

A A A A 1
> > = = ,
n 2
4A + 1 4A 2 2|A| 2
and hence

A n
n+A n = n 1+ . >
n 2

Therefore for any n > max{N 0 , 2N }, we have n + A n > N , which implies
that
max |ak | 5 max |ak | < .

n+A n 5 k 5 n+B n N +1 5 k 5 n+B n
Thus (A.7) holds.
Example A.6. Let us show that limn an = a implies

a1 + a2 + + an
lim = a. (A.8)
n n
First, take any > 0. By the assumption, there exists an N1 N+ such
that for any n > N1 , we have

|an a| < . (A.9)
2
On the other hand, since {an a} n=1 converges (to 0), it is bounded (Propo-
sition A.1). Namely, there exists an M > 0 such that for any n N+ , we
have |an a| 5 M . Then, putting N2 := b2N1 M/c + 1, we see for any
n > N2 ,

(a1 a) + + (aN1 a)
5 N1 M < N1 M < . (A.10)
n n N2 2
Finally, putting N := max{N1 , N2 }, it follows from (A.9) and (A.10) that
for any n > N , we have

a1 + a2 + + an
a
n

(a1 a) + + (aN1 a) (aN1 +1 a) + + (an a)
5 +
n n
|aN1 +1 a| + + |an a|
< +
2 n
n N1
< + < .
2 2 n
This completes the proof of (A.8).
114 Appendix A
A.3.2 Continuity of function of one variable
Definition A.4. (i) We say a function f is convergent to r as x tends

to a if for any > 0, there exists a > 0 such that for any x satisfying
0 < |x a| < , we have |f (x) r| < . This is written as limxa f (x) = r.
(ii) We say f is continuous at x = a if limxa f (x) = f (a).
(iii) We say f is continuous in an interval (a, b) := { x | a < x < b } if for
any c (a, b), it is continuous at x = c.
These rigorous definitions of limit (Definition A.3 and Definition A.4)

are called (, )-definition.
Example A.7. Let us show that f (x) := x2 is continuous in the whole

interval R. First, take any c R and any > 0. We have to show the
existence of a > 0 such that
|f (x) f (c)| = |x2 c2 | = |x c||x + c|
becomes smaller than if |x c| < . Now, if |x c| < , we have
|x c||x + c| = |x c||(x c) + 2c|

5 |x c|(|x c| + 2|c|) < ( + 2|c|),
and hence it is enough to find a > 0 such that
( + 2|c|) < .
Let us solve this inequality in ; adding |c|2 to the both sides, we get
2 + 2|c| + |c|2 < |c|2 + ,
from which we derive

p
0<< |c|2 + |c|. (A.11)
Conversely, for any satisfying (A.11) and any x satisfying |x c| < , we

see
|x2 c2 | < .
Thus f (x) = x2 is continuous at x = c. Since c is any element of R, it is

continuous in R.
A.4. Limits of exponential function and logarithm 115
A.3.3 Continuity of function of several variables

Definition A.4 naturally extends to the continuity of functions of several
variables.
Definition A.5. We say a function of d variables f : Rd R is continuous

at a point (a1 , . . . , ad ) Rd if for any > 0, there exists a > 0 such that
for any (x1 , . . . , xd ) Rd satisfying |x1 a1 | + + |xd ad | < , we have
|f (x1 , . . . , xd ) f (a1 , . . . , ad )| < .
We say f is continuous in a domain D Rd if it is continuous at each point

of D.
For example, in the proof of Lemma 3.4, (3.53) is proved by the conti-
nuity of a function of 5 variables
1 1 + x3
f (x1 , x2 , x3 , x4 , x5 ) :=
4x1 x2 (1 + x4 )(1 + x5 )
at a point ( 12 , 12 , 0, 0, 0). Similarly, for the proof of (3.49), we use, in the
last paragraph of the proof of the lemma, the continuity of a function of
two variables
g(x1 , x2 ) := exp(x1 )(1 + x2 ) 1
at the origin (0, 0).
A.4 Limits of exponential function and logarithm
Proposition A.3.
(i) lim xa bx = 0, a > 0, b > 1.
x
(ii) lim xa log x = 0, a > 0.

x
(iii) lim x log x = 0. 7

x+0
Proof. (i) Let c(x) := xa bx . We can take a large x0 > 0 so that

a
c(x0 + 1) 1
0< = 1+ b1 =: r < 1.
c(x0 ) x0
7 This is called the right-sided limit. It means that the limit as x tends 0 from above.
More precisely, it means that for any > 0, there exists a > 0 such that for any x
satisfying 0 < x < , we have |x log x| < .
116 Appendix A
Then,
a
c(x + 1) 1
0< = 1+ b1 < r, x > x0 ,
c(x) x
and hence
c(x + n) c(x + n) c(x + n 1) c(x + 1)
0< = < rn , x > x0 .
c(x) c(x + n 1) c(x + n 2) c(x)
We therefore have
0 < c(x) < rbxx0 c max c(y) 0, x .
x0 5 y 5 x0 +1
(ii) By putting y := log x, (i) implies that

lim xa log x = lim eay y = 0.
x y
(iii) By putting y := 1/x, (ii) implies that
1 1
lim x log x = lim log = lim y 1 log y = 0.
x+0 y y y y
Proposition A.4. For any x > 0,

xn
lim = 0.
n n!
Proof. Take N N+ so that x < N/2. Then, for n > N , we see

xn xN x x x
=
n! N! N +1 N +2 n
xN x nN
<
N! N
N
nN
x 1
< ,
N! 2
which converges to 0 as n .
A.5 C language program
Today, by the spread of computers, we can easily execute large-scale com-

putations. It has led to an explosive expansion of concretely computable
areas of mathematical theory. To make mathematical theory more useful
in practice, readers should study computer basics.
The calculation of Example 4.4 was done by the following C-program
([Kernighan and Ritchie (1988)]). It outputs 546177 as the value of
S106 (g( 0 )) and the estimated value 0.546177 of the probability p.
A.5. C language program 117
/*==========================================================*/
/* file name: example4_4.c */
/*==========================================================*/
#include <stdio.h>
#define SAMPLE_NUM 1000000

#define M 100
#define M_PLUS_J 119
/* seed */
char xch[] =
"1110110101" "1011101101" "0100000011" "0110101001"
"0101000100" "0101111101" "1010000000" "1010100011"
"0100011001" "1101111101" "1101010011" "111100100";
char ach[] =
"1100000111" "0111000100" "0001101011" "1001000001"
"0010001000" "1010101101" "1110101110" "0010010011"
"1000000011" "0101000110" "0101110010" "010111111";
int x[M_PLUS_J], a[M_PLUS_J];
void longadd(void) /* x = x + a (long digit addition) */

{
int i, s, carry = 0;
for ( i = M_PLUS_J-1; i >= 0; i-- ){
s = x[i] + a[i] + carry;
if ( s >= 2 ) {carry = 1; s = s - 2; } else carry = 0;
x[i] = s;
}
}
int maxLength(void) /* count the longest run of 1s */

{
int len = 0, count = 0, i;
for ( i = 0; i <= M-1; i++ ){
if ( x[i] == 0 )
{ if ( len < count ) len = count; count = 0;}
else count++; /* if x[i]==1 */
}
118 Appendix A
if ( len < count ) len = count;

return len;
}
int main()
{
int n, s = 0;
for( n = 0; n <= M_PLUS_J-1; n++ ){
if( xch[n] == 1 ) x[n] = 1; else x[n] = 0;
if( ach[n] == 1 ) a[n] = 1; else a[n] = 0;
}
for ( n = 1; n <= SAMPLE_NUM; n++ ){
longadd();
if ( maxLength() >= 6 ) s++;
}
printf( "s=%6d, p=%7.6f\n", s, (double)s/(double)SAMPLE_NUM);
return 0;
}
/*================ End of example4_4.c =====================*/
List of mathematicians
Mathematician BirthDeath Related subject in this book

Euclid B.C.3 s theorem
I. Newton 16421727 Motion equation
G.W. Leibniz 16461716 Binary numeral system
J. Bernoulli 16541705 s theorem
A. de Moivre 16671754 Limit of binary distribution
B. Taylor 16851731 s formula
C. Goldbach 16901764 s conjecture
J. Stirling 16921770 s formula
L. Euler 17071783 s integral
P.S. Laplace 17491827 Limit of binary distribution
A.-M. Legendre 17521833 transformation
C.F. Gauss 17771855 ian distribution
S.D. Poisson 17811840 Namer of law of large numbers
A.L. Cauchy 17891857 (, )-definition
K. Weierstrass 18151897 (, )-definition
P.L. Chebyshev 18211894 s inequality
A.A. Markov 18561922 s inequality
D. Hilbert 18621943 s 6-th problem
E. Borel 18711956 Normal number theorem
H. Lebesgue 18751941 Measure theory
H. Weyl 18851955 transformation
G. Polya 18871985 Namer of central limit theorem
H. Cramer 18931985 Chernoffs inequality
A.N. Kolmogorov 19031987 Axiom of probability theory
J. von Neumann 19031957 Monte Carlo method
K. Godel 19061978 number
S. Kleene 19091994 s normal form
S.M. Ulam 19091984 Monte Carlo method
A.M. Turing 19121954 machine
H. Chernoff 1923 Cramer s inequality
R. Solomonoff 19262009 Random number
G. Chaitin 1947 Random number
119
Further reading
This book exclusively dealt with limit theorems to emphasize the most im-
portant mission of probability theoryanalysis of randomness. Of course,
there are many other limit theorems not presented in this book, not neces-
sarily determining events of probability close to 1. Anyhow, limit theorem
is not the only theme of probability theory. To learn richness of probability
theory, [Feller (1968)] and [Sinai (1992)] are recommended to read.
Before then, readers should master calculus and linear algebra. What
follows are hints for those who have completed them.
In this book, we restricted the sample space to be a finite set. To
deal with infinite sample space, we need measure theory (1.5.1). To study
it, [Bilingsley (2012)] is recommended to read. Random number is merely
an item of computation theory. Including it, to study computation theory,
[Sipser (2012)] is recommended to read. About recursive function and algo-
rithmic randomness, [Rogers (1967)], [Downey and Hirschfeld (2010)] and
[Nies (2009)] are textbooks for graduate students and researchers. Books
about the Monte Carlo method are really numerous. To study rigorous
basic theory of it, [Sugita (2011)] is recommended to read. For advanced
theory of it, [Bouleau and Lepingle (1994)] is recommended to read.
120
Bibliography
Billingsley, P. (2012). Probability and Measure, Anniversary edn. (Wiley).

Bouleau, N. and Lepingle, D. (1994). Numerical methods for stochastic processes,
Wiley Series in Probability and Statistics (Wiley).
Downey, R. G. and Hirschfeld, D. R. (2010). Algorithmic Randomness and Com-
plexity (Springer).
Feller, W. (1968). An Introduction of Probability Theory and its Applications,
Vol. 1, 3rd edn. (Wiley).
Hardy, G. H. and Wright, E. M. (1979). An introduction to the theory of numbers,
5th edn. (Oxford Science Publications).
Kernighan, B. W. and Ritchie, D. M. (1988). The C Programming Language, 2nd
edn. (Prentice Hall).
Kolmogorov, N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung, Ergeb-
nisse der Mathematik und Ihrer Grenzgebiete Vol. 2. English translation:
Foundation of the probability theory, 2nd edn. (Chelsea (1956)).
Laplace, P. S. (1812). Theorie analytique des probabilites.
Li, M. and Vitanyi, P. (2008). An Introduction to Kolomogorov Complexity and
Its Applications, 3rd edn. (Springer).
Nies, A. (2009). Computability and Randomness, Oxford Logic Guides Vol. 51.
(Oxford Univ. Press).
Rogers, H. Jr. (1967). Theory of recursive functions and effective computability
(McGraw-Hill).
Sinai, Y. G. (1992). Probability TheoryAn Introductory Course (Springer).
Sipser, M. (2012). Introduction to the Theory of Computation, 3rd edn. (Course
Technology Inc.).
Sugita, H. (2011). Monte Carlo Method, Random Number, and Pseudorandom
Number, MSJ Memoirs Vol. 25. (World Scientific).
Takahashi, M. (1991). Theory of computation (Computability and -calculus),
(Kindai-Kagaku sha (Japanese)).
121
b2530 International Strategic Relations and Chinas National Security: World at the Crossroads
This page intentionally left blank
b2530_FM.indd 6 01-Sep-16 11:03:06 AM

Index
Symbols N, viii, 24
#, viii, 2 N+ , viii, 34
, viii, 4, 106 Pn , 10
:=, viii P(), viii, 2
, , viii, 10 P, 3
=, viii
P viii P
R,
, viii , ; X()=ai , 108

=, 99 V, 51
7, 106 i , 13
,
Qn51 Q Terms
i=1 , iI , viii, 48, 107 algorithm, 35
, viii, 39, 67 Alice, 10
, viii, 11 Bernoullis theorem, 45
hx1 , x2 , . . . , xn i, 34 Borel
(u)ni , 34 s model of coin tosses, 3
{0, 1} , 34 s normal number theorem,
btc, viii, 34, 109 19
1A (x), viii, 106 Brownian motion, 21
Ac ,!viii, 106 canonical
n order, 34
, viii, 44
k realization, 9
di (x), 3, 110 central limit theorem, 81
E, 51 Chebyshevs inequality, 55
exp(x), 14 coin tosses
H(p), 43 infinite , 18
K(x), 36 n , 10
KA (x), 35 unfair , 51
L(q), 34 complexity
max[min], viii, 58 computational depending
mod, 100 on algorithm, 35
y (p(x1 , . . . , xn , y)), 28 Kolmogorov , 36
123
124 Index
computation theory, 23 gambling, 91

confidence Gamma function, 68
interval, 88 Gauss
level, 88 ian integral, 74
continuity correction, 79 standard ian distribution,
coordinate function, 3 20, 64
countable set, 25 generic value, 91
un, 25 Godel number(index), 31
CramerChernoffs inequality, 58 halting problem, 33
data compression, 38 hypothesis, 89
de MoivreLaplaces theorem, 64 i.i.d., 50
diagonal method, 25 improper integral, 14, 19
direct product, viii, 105 independent
disjoint, 5 identically distributed, 50
distribution, 4, 6 events are , 48
of prime numbers, 39 pairwise , 54
binomial , 60 random variables are , 49
identically distributed, 49 indicator function, viii, 106
independent identically inequality
distributed, 50 Chebyshevs , 55
joint , 7 CramerChernoffs , 58
marginal , 7 Markovs , 54
probability , 4, 6 integral
standard normal(Gaussian) Eulers , 68
, 20, 64 Gaussian , 74
uniform , 6 improper , 14, 19
empirical probability, 57 Kleenes normal form, 29
empty Kolmogorov complexity, 36
event, 5 law of large numbers, 55
set, viii, 4, 106 Lebesgue measure, 3
word, 34 Legendre transformation, 58
entropy function, 43 limit theorem, 13
enumerating function, 31 central , 81
enumeration theorem, 31 loop, 28
(, )-definition, 114 infinite , 28
Euclids theorem, 39 Markovs inequality, 54
Eulers integral, 68 mathematical model, 1
event, 5 mean(expectation), 51
elementary , 5 measure theory, 6, 19
empty , 5 moment, 58
whole , 5 generating function, 58
formula Monte Carlo
Stirlings , 67 method, 16, 91
Taylors , 65 integration, 96
Wallis , 68 -operator, 28
function(mapping), 106 pairwise independent, 54
Index 125
partial function, 27 space, 5

partial recursive fucntion, 28 value, 8
power set, viii, 2 sampling, 8
prime number random , 15
theorem, 39 random Weyl , 102
distribution of , 39 secure against U , 95
probability, 5 semi-open interval, 3
distribution, 4, 6 significance level, 89
measure, 5 standardization, 62
of A, 5 statistical inference, 86
space, 5 statistical test, 88
that X = ai , 6 Stirlings formula, 67
theory, 4 Taylors formula, 65
conditional , 48 theorem
empirical , 57 Bernoullis , 45
product measure, 51 Borels normal number , 19
uniform measure, 5 central limit , 81
pseudorandom de MoivreLaplaces , 64
generator, 17, 94 Euclids , 39
number, 17, 94 prime number , 39
seed of number, 17, 94 theory
random computation , 23
{0, 1}-sequence, 11 measure , 6, 19
ness, v, 12 probability , 4
number, 11, 37 total function, 27
sampling, 15 total recursive function, 28
variable, 6 transformation, 106
Weyl sampling, 102 Legendre , 58
algorithmic ness, 12 Weyl , 102
range, 6 Turing machine, 26
rate function, 58 uncountable set, 25
recursive universal
function, 26 algorithm, 35
partial function, 28 function, 31
primitive function, 26 variance, 51
total function, 28 Wallis formula, 68
relative frequency, 14 Weyl transformation, 102
remainder term, 67 word, 34
sample empty , 34

Probability and Random Number A First Guide To Randomness

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Probability and Random Number A First Guide To Randomness

Caricato da

Copyright:

Formati disponibili

10662_9789813228252_tp.

indd 1 18/9/17 12:25 PM

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM

British Library Cataloguing-in-Publication Data

(KAKURITHU TO RANSU) by Hiroshi Sugita

PROBABILITY A ND RANDOM NUMBER

Rok Ting - 10662 - Probability and Random Number.indd 1 07-09-17 3:06:40 PM

At high school, students calculate probabilities by counting the number

calculations and prove theorems about probability, but to understand the

The prerequisite for this book is first-year university calculus. University

This is an English translation of my Japanese book Kakuritsu to ransu

Tetsuya Hattori gave me suggestions for better future publication. Indeed,

Osaka, September 2017 Hiroshi Sugita

Notations and symbols

btc := the largest integer not exceeding t = 0 (rounding down).

Table of Greek letters

1. Mathematics of coin tossing 1

2.2.1 Kolmogorov complexity . . . . . . . . . . . . . . . 34

4. Monte Carlo method 91

A.2 Binary numeral system . . . . . . . . . . . . . . . . . . . . 108

List of mathematicians 119

Mathematics of coin tossing

Tossing a coin many times, record 1 if it comes up Heads and record 0 if it

1.1 Mathematical model

For example, the concept circle is obtained by abstracting an essence from

2 Mathematics of coin tossing

n times, but it means a mathematical model of it, which is described by

Fig. 1.1 Heads and Tails of 1 JPY coin

1.1. Mathematical model 3

Each i is called a coordinate function. Then, we have

4 Mathematics of coin tossing

0 0.001 0.01 0.011 0.1 0.101 0.11 0.111 1

Under the correspondence

P P, {Xi }3i=1 {di }3i=1 ,

P and {di }3i=1 are also a mathematical model of 3 coin tosses.

1.1.1 Probability space

Definition 1.1. (Probability distribution) Let be a non-empty finite

(Sec. A.1.2 ). Then, we call the set of all pairs and p

a probability distribution (or simply, a distribution) in .

Definition 1.2. (Probability space) Let be a non-empty finite set and

1.1. Mathematical model 5

(iii) A, B P() are disjoint, i.e., A B =

For a non-empty finite set , to give a distribution in it and to give a

a triplet (, P(), P ) becomes a probability space. Conversely, if a proba-

A triplet (, P(), P ) is a probability space provided that the conditions

6 Mathematics of coin tossing

is called a uniform distribution. Setting the uniform distribution means

Each assertion of the following proposition is easy to derive from Defi-

Proposition 1.1. Let (, P(), P ) be a probability space. For A, B

1.1.2 Random variable

Definition 1.3. Let (, P(), P ) be a probability space. We call a function

1.1. Mathematical model 7

For several random variables X1 , . . . , Xn , let {ai1 , . . . , aisi } R be the

P ( X1 = a1j1 , . . . , Xn = anjn ) =: pj1 ,...,jn ,

Then, the set

{ ((a1j1 , . . . , anjn ), pj1 ,...,jn ) | j1 = 1, . . . , s1 , . . . , jn = 1, . . . , sn } (1.7)

is called the joint distribution of X1 , . . . , Xn . Of course, the left-hand side

P ({ | X1 () = a1j1 , X2 () = a2j2 , . . . , Xn () = anjn }) .

(Sec. A.1.2), (1.7) is a distribution in the direct product of the ranges of

{a11 , . . . , a1s1 } {an1 , . . . , ansn }.

In contrast with joint distribution, the distribution of each individual

Example 1.3. Let us look closely at joint distribution and marginal

When n 1, if each {0, 1}n is chosen with probability Pn ({}) =