Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
course
Jonathan G. Campbell
Department of Computing,
Letterkenny Institute of Technology,
Co. Donegal, Ireland.
Revision 0.3
1 Introduction 1
1.1 Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why use R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Relevant textbooks and web sources . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 General Books on Probability and Statistics . . . . . . . . . . . . . . . . . 2
1.3.2 Books on R and Statistics using R . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.4 Web Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Averages 1
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3.2 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3.2.1 Arithmetic Mean using Frequencies . . . . . . . . . . . . . . . . . . . . . 2
3.3 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.4 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.5 Other Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0–1
5.2.5 Finite Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.4 Computing probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.5 Enumerating more complex events and sample spaces . . . . . . . . . . . . . . . . 4
5.5.1 Multiplication of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.5.2 Addition of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.5.3 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.5.4 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.6.1 Venn diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.6.2 Probability Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.6.3 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.8 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.9 Betting and Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.10 Classical versus Bayesian Interpretations of Probability . . . . . . . . . . . . . . . 13
0–2
7.6 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
7.7 Two-dimensional (Bivariate) Normal Distribution . . . . . . . . . . . . . . . . . . 5
10 Statistical Inference 1
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
11 Statistical Estimation 1
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
11.2 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
11.3 Estimating the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
11.4 Estimating the Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . 2
11.5 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
11.5.1 Sampling Distribution of the mean . . . . . . . . . . . . . . . . . . . . . . 3
11.5.2 Sampling Distribution for Estimates of the Standard Deviation . . . . . . . 4
11.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
12 Hypothesis Testing 1
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
13 Sampling 1
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0–3
15.7 Template Matching and Discriminants . . . . . . . . . . . . . . . . . . . . . . . . 7
15.8 Nearest neighbour methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
20 Regression 1
20.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0–4
B.5.4 Transpose of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
B.6 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
B.7 Multidimensional (Multivariate) Random Variables . . . . . . . . . . . . . . . . . . 8
0–5
Chapter 1
Introduction
This report is written as the basis for a short course on statistics to be presented for postgraduate
students at Letterkenny Institute of Technology.
The notes have a mixed objective. I started writing a set of notes based on the traditional approach
to probability and statistics, namely: basic probability, up to and including conditional probability,
independence, Bayes’ Law; then some one-dimensional discrete and continuous distributions and
some of the properties. Et cetera. And the on to sampling, parameter estimation, point estimates,
confidence intervals, and hypothesis testing.
However, after discussion with someone who knows potential consumers of the course, I was
persuaded to start with a more gentle introduction. Hence I start off with simple visualisation, the
look at averages (central tendency), then variance, and then back to the main line.
As I say, the notes have a mixed objective. One objective is as notes for a gentle introduction to
statistics; another is to include a set of reference results that one would refer to during a course;
that is a course presenter might not want to spend time of the details of, for example, the Binomial
distribution, or even full details of the Normal, but it would be useful for students to have access
to some of these details without having to access one or more textbooks.
When I give a course, I may give attendees a printout of all the notes — including an outline of
the objective of the course and the plan of coverage, mentioning the chapters that will be used.
Or, alternatively, I may do a specialised printout that includes only the chapters to be covered.
1–1
Bell Laboratories (formerly AT &T, now Lucent Technologies) by John Chambers and
colleagues. R can be considered as a different implementation of S. There are some
important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statisti-
cal tests, time-series analysis, classification, clustering, . . . ) and graphical techniques,
and is highly extensible. The S language is often the vehicle of choice for research in
statistical methodology, and R provides an Open Source route to participation in that
activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can
be produced, including mathematical symbols and formulae where needed. Great care
has been taken over the defaults for the minor design choices in graphics, but the user
retains full control.
When I have to choose a software package for teaching or for practical use (I mean generally, it
could be a development system for a programming language, a computer games engine, a statistics
package, . . . ) I look primarily at the following criteria:
• Is it easily available, i.e. is it already installed in our laboratory machines, or is easy (and
cheap) to acquire?
R does well on this criterion — it is free to download and install, see 2.1.1.
https://stat.ethz.ch/mailman/listinfo/r-help
Via that mailing list, I have received assistance from world-class statisticians.
These notes are mostly based on (Meyer 1966) (which was used for a college course on statistics
that I attended), (Wasserman 2004), which is a good summary of all the statistics you might
ever need, but is not an introduction, (Griffiths 2009) and (Milton 2009) which are excellent
introductions though very wordy, (Crawley 2005), (Spiegel & Stephens 2008). The latter, (Spiegel
& Stephens 2008), has plenty of examples including some examples on the use of the Excel
spreadsheet.
1–2
(Dytham 2009) seems to be a good introduction for biologists and the more advanced (Quinn &
Keough 2002) receives a lot of recommendations.
Hacking’s book (Hacking 2001) is maybe a good introduction to probability and the philosophy
and practice of probabilistic inference.
The bibliography contains books in my collection and which I may have used in some small way
and/or which may be useful to users of these notes.
Crawley may be the best general book (Crawley 2005); for bio-scientists it has the advantage that
Crawley’s research area is bio-science.
Venables and Ripley’s MASS (Venables & Ripley 2002) is top class — note, do not be confused
by the title Modern Applied Statistics with S; R is an open-source version of S (and S-Plus) and
the book covers any differences, which are minimal. Maindonald (Maindonald & Braun 2007) is
good for R graphics; R code for all his diagrams is available online (free).
Matloff’s R for Programmers (Matloff 2008) has the advantage that it is available online.
http://www.r-project.org/doc/bib/R-books.html
(Sivia 2006) (best introduction to Bayesian statistics), (MacKay 2002), (Lee 2004).
• R: http://www.r-project.org/.
1.4 Outline
Chapter 5 gives an introduction to probability; if you want to understand basic statistics you must
have a basic understanding of probability — however we note that probability is to a great extent
common sense. Before starting you should have a quick run through Appendix A just to familiarise
yourself with basic mathematical notation; we note that the mathematical notation used is no
more than shorthand; it would be difficult to write these notes without employing that shorthand;
in addition, you will encounter similar shorthand in books and research papers.
1–3
Chapter 2 gives a very brief introduction to simple statistical techniques and visualisation and to
the statistical package R.
Chapter 3 gives a brief introduction to averages or what statisticians call central tendency.
Chapter 4 This chapter introduces methods of describing data variability, most notably variance
and standard deviation.
Chapter 6 introduces random variables and lists the common one-dimensional probability distribu-
tions.
Chapter 7 gives a brief introduction to multivariate random variables and some distributions. Note
that Appendix B gives a gentle introduction to vector and matrix mathematics which are necessary
in multivariate statistics.
Chapter 8 discusses important characteristics of randoms variables such a mean and variance.
Chapter 9 gives specialised treatment to the normal distribution — in view of its importance in
applications.
Chapter 10 introduces statistical inference, that is, how can we infer properties of a population
from statistics derived from a sample. One aspect of statistical inference is parameter estimation;
Chapter 11 introduces point estimation and confidence interval estimation. Hypothesis testing is
strongly related to estimation; Chapter 12 gives an introduction to hypothesis testing.
As of 2009-08-18 this is work in progress and will remain so for the foreseeable future.
1–4
Chapter 2
2.1 Introduction
The objectives of this chapter are to give a very brief introduction to simple statistical techniques
and visualisation and to the statistical package R.
2.1.1 Installation of R
Click on http://www.r-project.org/ and find the Download link. For Windows users there is
an exe file which does everything. You may need Administrator rights on your machine; contact
Computer Services as necessary.
Linux users are probably best advised to rely on the installer of their particular Linux distribution.
2.1.2 Running R
Start R by clicking on R desktop icon. R will open up a window with something like the following
in it.
2–1
¿
¿ 2 + 3
[1] 5
¿ sqrt(26)
[1] 5.09902
¿ 3ˆ4
[1] 81
¿
For the remainder of this chapter we’ll look at a significant example involving visualisation and
exploratory data analysis on a data set.
Were going to read in some examination result data and analyse them. The file exam.txt contains
data as follows:
exam
65
60
47
... etc. 66 results in total
The name of the column is exam and we tell R to pay attention to that.
In what follows, # is a comment symbol and R ignores anything after the # until the next line.
Anything after ¿ is something that you typed — a request to R. If something appears without
a¿ , that is an R response.
¿ ex ¡- read.table(”exam.txt”, header= T)
¿ attach(ex)
¿ exam # print ’exam’ data on the screen
[1] 65 60 47 43 51 32 62 71 0 56 52 59 15 49 54 67 44 2 47 61 45 95 62 80 46
[26] 52 61 12 62 69 78 62 48 56 56 58 60 0 48 71 50 90 51 53 5 51 63 35 39 10
[51] 57 53 20 54 22 44 53 52 25 60 55 39 30 53 67 50
¿
That printout is quite uninformative, for example you have no idea what the maximum is, nor the
range, nor have you an even rough idea of what the average mark is, etc.
2–2
¿ hist(exam)
Often, like me here, you want to save the diagram to a file so that you can include it in a report.
Here is how to do that; vis1-1.pdf is a filename that I made up.
Histogram of exam
20
15
Frequency
10
5
0
0 20 40 60 80 100
exam
2–3
Let us see what the average mark is and the range of marks:
¿ mean(exam)
[1] 49.07576
¿ range(exam)
[1] 0 95
¿
¿ length(exam)
[1] 66 # 66 results in ’exam’
¿ sum(exam)/length(exam)
[1] 49.07576
Let us see the data in sorted order — a good deal more informative than unsorted:
¿ sort(exam)
[1] 0 0 2 5 10 12 15 20 22 25 30 32 35 39 39 43 44 44 45 46 47 47 48 48 49
[26] 50 50 51 51 51 52 52 52 53 53 53 53 54 54 55 56 56 56 57 58 59 60 60 60 61
[51] 61 62 62 62 62 63 65 67 67 69 71 71 78 80 90 95
¿
Now read in corresponding continuous assessment (CA) marks (courswork); they came from a
spreadsheet so there’s a load of digits after the decimal point and that makes the data evern more
incomprehensible, so we use round to round them to the nearest integer number. It looks like
the CA marks are more generous than the exam. marks, and mean(ca) confirms this, as does the
histogram in Figure 2.2.
¿ cw ¡- read.table(”ca.txt”, header= T)
¿ attach(cw)
¿ ca
[1] 91.34390 85.54622 72.65543 63.10473 73.22074 50.99642 85.69151 97.06528
[9] 18.58191 83.30836 78.78221 77.68898 21.07860 76.04457 76.56793 86.90106
[17] 61.70048 16.28892 69.57387 83.08058 74.19594 97.12300 81.58833 98.12345
[25] 60.17263 79.49133 89.35610 27.89478 98.06673 92.34510 96.19500 88.69131
[33] 69.70333 85.23094 86.99767 82.89807 77.35877 15.12655 72.41332 90.07670
[41] 75.20815 97.17500 65.78075 70.29256 14.20315 73.02363 87.38178 52.74194
[49] 60.66164 20.05529 78.16085 73.58862 34.07182 78.03601 39.31353 69.57565
[57] 77.53929 77.20521 52.67979 89.10232 76.78222 54.16873 40.23080 81.09443
[65] 89.12518 67.58763
¿ car = round(ca)
¿ car
[1] 91 86 73 63 73 51 86 97 19 83 79 78 21 76 77 87 62 16 70 83 74 97 82 98 60
[26] 79 89 28 98 92 96 89 70 85 87 83 77 15 72 90 75 97 66 70 14 73 87 53 61 20
2–4
[51] 78 74 34 78 39 70 78 77 53 89 77 54 40 81 89 68
¿
¿ sort(car)
[1] 14 15 16 19 20 21 28 34 39 40 51 53 53 54 60 61 62 63 66 68 70 70 70 70 72
[26] 73 73 73 74 74 75 76 77 77 77 77 78 78 78 78 79 79 81 82 83 83 83 85 86 86
[51] 87 87 87 89 89 89 89 90 91 92 96 97 97 97 98 98
¿
¿ mean(ca)
[1] 70.10692
¿
¿ hist(ca)
# and save another one to a file
¿ pdf(”vis1-ca.pdf”, onefile=FALSE, height=4, width=6, pointsize=8, paper=”special”)
¿ hist(ca)
¿ dev.off()
Histogram of ca
15
Frequency
10
5
0
20 40 60 80 100
ca
2–5
Boxplots are another way of examining a data set. Figure 2.3 shows boxplots for the examination
and CA results.
The construction of the boxplot is as follows: (a) the heavy line across the interior of the box
correspond to the median value (see Chapter 3); (b) the top and bottom of the box correspond
to, respectively, the lower quartile and upper quartile, i.e. 25% of the data are below the lower
quartile and 25% are above the upper quartile (or, if you like, 75% are below it).
The so called whiskers show the smallest and largest values — excluding boxplot’s interpretation
of outliers. The outliers are then shown as single points.
Quartile is a specialisation of the general term quantile, see Chapter 4. In Chapters 9, 11 and
12, we’ll come across, for example, 5% and 95% quantiles. The median is the centre of the data,
i.e. as many of the data are above the median as are bwlow it; see Chapter 3.
To determine what are outliers, boxplot fits a Normal distribution to the data and labels as outliers
any data that are below the 1% or above the 99% quantiles of the fitted Normal distribution.
100
●
●
80
80
60
60
40
40
20
●
●
●
●
20
●
● ●
● ●
● ●
●
0
2–6
How to look at the two data sets together? There must be a way of superimposing one histogram
on another, but I haven’t found that yet.
So let us display a two-dimensional scatter plot of the two data sets, see Figure 2.4.
●
●
80 ●
●
● ●
●
● ●
●
●● ● ●●● ●
60 ●●
● ●
● ●
●● ● ●●
● ● ●●
exam
●●● ●
●● ●●●
● ● ●
● ● ●
●● ● ●
40 ● ●
●
●
●
●
●
20 ●
●
●
●
●
●
0 ● ●
20 40 60 80 100
ca
Someone says those CA and exam. marks look quite correlated, I wonder how accurately we could
have predicted the exam. results using the CA?. This is regression territory — and given that
Figure 2.4 shows a sort of straight line relationship, we’ll try linear regression, your old friend
y = mx + c, or in this case exam = mca + c and it is more usual to use a, b exam = a + bca. a
is the intercept, where the fitted straight line meets the y-axis at x = 0 and b is the slope.
Call:
lm(formula = exam ˜ ca)
Residuals:
Min 1Q Median 3Q Max
-10.9697 -3.1181 -0.7405 3.1036 22.8368
Coefficients:
Estimate Std. Error t value Pr(¿—t—)
2–7
(Intercept) -10.83639 2.21002 -4.903 6.77e-06 ***
ca 0.85458 0.03002 28.469 ¡ 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
R prints a lot of information that we’ll find out about in Chapter 20; for now all we need to know
are a = −10.83639 (intercept) and b = 0.85458 (coefficient multiplying ca), i.e. the fitted line is
exam = −10.83639 + 0.85458 × ca. Figure 2.5 shows the results of the straight line fitting.
●
●
80
●
●
● ●
●
● ●
●
● ● ● ● ●● ●
60
●● ● ●
● ●
●● ● ● ●
● ● ● ● ●
● ● ● ●●
exam
● ● ●
● ● ●
● ● ●
●● ● ●
40
● ●
●
●
●
●
●
20
●
●
●
●
●
●
● ●
0
20 40 60 80 100
ca
2–8
Finally, we can save all those commands:
¿ savehistory(”20090508-3.txt”)
# which we could load again at a later time with
¿ loadhistory(”20090508-3.txt”)
# but in any case, weh you use q() to quit, R will offer you the
# option of saving and thse saved commands will be loaded the
# next time you run R.
¿ q()
Save workspace image? [y/n/c]: y
2–9
Chapter 3
Averages
3.1 Introduction
This chapter gives a brief introduction to “average”s or what statisticians call central tendency.
These are often, but not always, useful in summarising a set of data, especially when we wish to
compare the data set with another.
There are some pitfalls in using the common-or-garden average and we will note some of these.
The most familiar average value is the arithmetic mean, i.e. sum the value and divide by the
number of data. Just to get used to some mathematical notation, see A.2, we’ll write this as you’ll
see it in textbooks (the data are xi , i = 1, . . . , n):
n
1X
x̄ = xi , (3.1)
n i=1
R-Example 1 .
As before, we’ll read the data and print them. This time they are already sorted, so much easier
to read, even in list form.
We can compute the mean by summing and dividing, see below, but not unexpectedly, R has a
function mean that does it for us.
3–1
¿ sum(exam2)
[1] 1647
¿ length(hw)
[1] 20
¿ sum(exam2)/length(exam2)
[1] 54.9
¿ mean(exam2)
[1] 54.9
¿
R-Example 2 . The following are a set of homework marks, marked out of 10. We read the data
in and print them. Then we produce a summarising table, marks versus frequency, which tells us
that we have a three students with four (4) marks, three with five, six with six, etc.
If we were not using a computer, we might think that we have a quick way to compute the mean, we
have just five marks, namely, 4 5 6 7 8, so we’ll take the average of those 4 + 5 + 6 + 7 + 8 = 30,
so mean = 30/5 = 6. But R thinks differently:
¿ mean(hw)
[1] 6.15
The method we used works only if the frequencies are the same for each mark; it would be a rare
fluke if this were the case.
But we’ll pursue the matter further, because (a) computing an arithmetic mean using a frequency
table — done properly — can be a (correct) shortcut if you have a lot of numbers and just a
calculator or pencil and paper; (b) using frequencies prepares the ground for topics covered in later
chapters.
We’ll rewrite the table, now calling the data (marks) x, we’ll label them with i so that we have
xi , i = 1 . . . n, and n = 5.
3–2
¿ table(hw)
hw
i= 1 2 3 4 5
----------------
xi 4 5 6 7 8 # marks
fi 3 3 6 4 4 # frequencies
¿
If we want to use the frequency table, we have to replace eqn. 3.1 with
Pn
fi xi
x̄ = Pi=1
n . (3.2)
i=1 fi
If we look at the sum divided by number calculation in R, we see that the frequency calculation
ends up with not only the same result, but the same division,
¿ length(hw)
[1] 20
¿ sum(hw)
[1] 123
¿ sum(hw)/length(hw)
[1] 6.15
If you look at the sum of fi × xi you will see that it is the same as
4 + 4 + 4 + 5 + . . . + 8 + 8 + 8 + 8;
the sorted hw marks are below:
¿ sort(hw)
[1] 4 4 4 5 5 5 6 6 6 6 6 6 7 7 7 7 8 8 8 8
And the sum of the frequencies is 20, i.e. the number of data. [B
3.3 Median
Sometimes neither the mean nor the mode give us what we would expect from a central value.
Look at the following speed data (speed of cars at a speed check). Here mean,37.1, is well off
the centre; and that offset is caused by an outlier, the 75. The offset would be a lot worse if the
outlier was 1000 — not likely in the case of speeds, but outliers of this magnitude are possible
in the case of some measurement systems. A common example is a mineralisation survey taken
across an area of land. For the sake of argument, assume that we are looking for zinc. A sample
that coincides with the dumping of an old bucket will produce a huge outlier. Now if we want to
produce contour plots based on smoothed values (averages over regions), then mean smoothing
will show a (false) hot-spot, while median smoothing will not.
3–3
sp = read.table(”cars.txt”, header = T)
¿ attach(sp)
¿ speed
[1] 25 31 33 31 30 35 75
¿ mean(speed)
[1] 37.14286
The media gives a the true central value. If we sort the speeds, we see that the central value (the
fourth) is 31. median give the same result.
¿ sort(speed)
[1] 25 30 31 31 33 35 75
¿ median(speed)
[1] 31
¿ speed[4]
[1] 31
In the example above there are seven values, so the central one is the fourth; if we had an even
number of values, we would take the average of the two central values.
It can be said that the median is a measure of central tendency that is robust against outliers.
3.4 Mode
Sometimes the mean does not give us what we would expect from a central value; for example,
in the homework example, the mean (6.15) gives us a value that appears nowhere in the original
data; that’s normally not a big deal, but it suggests the mode as a possible “average value”.
The mode is the most frequent value, i.e. obtained from a frequency table or from a histogram,
Figure 3.1.
¿ table(hw)
hw
xi 4 5 6 7 8 # marks
fi 3 3 6 4 4 # frequencies
3–4
Histogram of hw
6
5
4
Frequency
3
2
1
0
4 5 6 7 8
hw
3–5
Multimodal Data Now that we’ve mentioned the mode, we’d better take the opportunity of
warning about multi-modal data.
File hw2.txt contains data which has two peaks in its histogram, Figure 3.2.
Histogram of hw2
5
4
3
Frequency
2
1
0
3 4 5 6 7 8 9
hw2
We can work calculate the mean, but does it convey much about the centre of the data? No, and
using the mean as such may be quite misleading. For example, an average of 6.15 may indicate
that the homework was, on average, completed satisfactorily; however, in fact, we had two sets of
results, one good, one poor and the average of 6.15 adequately represents neither.
Multimodality is pretty obvious in that small and one-dimensional data set. In much larger data
sets and especially in multidimensional data, multimodality may be difficult to detect.
Much later, Chapter 19, we’ll look at methods for separating multimodal data into different classes
or clusters.
3–6
Chapter 4
4.1 Introduction
This chapter introduces methods of describing data variability, most notably variance and standard
deviation.
We are now going to work through an example based on two examination results, exam3 and
exam4, see below.
¿ exam3
[1] 68 70 71 72 72 73 73 73 74 75 75 75 75 75 76 76 76 76 76 77 77 78 78 80 82
¿ exam4
[1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60
[26] 61 62 62 64 69 73
¿
We are going to assume that these examinations are from two optional modules that final year
BSc Honours students can take, that is students take one or other of these modules and not both.
Final Honours classifications depend on these results; but we can see already that the students
who took exam3 are at an advantage; except for one, they all achieved first class honours in that
examination. If we assume that the exam3 students are equally capable as the exam4 students, then
can we correct the imbalance? Before you start to be incredulous, this technique was practiced at
a well-known university where I worked.
First of all let us look at the histograms, Figure 4.1 and the box-plots, Figure 4.2.
4–1
¿ hist(exam3)
¿ hist(exam4)
12
10
8
8
6
Frequency
Frequency
6
4
4
2
2
0
0
68 70 72 74 76 78 80 82 40 45 50 55 60 65 70 75
exam3 exam4
¿ boxplot(exam3)
¿ boxplot(exam4)
4–2
82
70
80
65
78
76
60
74
55
72
50
70
45
68
4–3
The means confirm the difference.
¿ mean(exam3)
[1] 74.92
¿ mean(exam4)
[1] 55.48387
¿
¿ diff ¡- mean(exam3) - mean(exam4)
¿ diff
[1] 19.43613
¿
Can we shift one of the means so that the two data sets have the same mean?
¿ diff
[1] 19.43613
¿ exam4new ¡- round(exam4 + diff)
¿ exam4new
[1] 62 62 62 63 65 67 67 69 70 72 72 72 74 75 75 76 76 77 77 78 78 78 79 79 79
[26] 80 81 81 83 88 92
¿ fpdfsmall()
¿ hist(exam4new)
10
8
8
6
Frequency
Frequency
6
4
4
2
2
0
68 70 72 74 76 78 80 82 60 65 70 75 80 85 90 95
exam3 exam4new
4–4
4.2.2 Variability and spread
That is a bit better, but there remains a greater spread in exam4new (mean shifted). Can we
quantify spread; range gives us the range between minimum and maximum, but we would like one
number.
¿ range(exam3)
[1] 68 82
¿ range(exam4new)
[1] 62 92
¿
From our experience with the mean, maybe we can take the mean (expected value) of deviations
from the means,
¿ mean(exam3 - mean(exam3))
[1] -1.705372e-15 # effectively zero
¿ mean(exam4new - mean(exam4new))
[1] -4.586385e-16
Not much good; from the definition of the mean we should have known in advance that these
means (or sums) of deviations would be zero — the negative deviations cancel the positive.
¿ mean((exam4new - mean(exam4new))ˆ2)
[1] 53.6691
¿ mean((exam3 - mean(exam3))ˆ2)
[1] 9.0336
¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3)
[1] 9.0336
The variance, which is the expected value of the squared deviations from the mean is the built-in
function to use (var in R), see eqn. 4.1,
n
1X
V ar (X) = E[(X − µ)] = (xi − µ)2 . (4.1)
n i=1
¿ var(exam3)
[1] 9.41
¿ var(exam4new)
[1] 55.45806
4–5
Immediately, we see that it is not an illusion that the variability of exam4new is much greater than
that of exam3. Note that the variance as calculated by var is slightly different from that calculated
using mean — we’ll return to that below.
The variance values, since they are sums of squares, give us a measure of squared variability; that
can be hard to interpret and use; what we want is the square-root of the variance, or the standard
deviation (sd in R), see eqn. 4.2,
p
σX = SD(X) = V ar [X]. (4.2)
¿ sqrt(var(exam4new))
[1] 7.447017
¿ sqrt(var(exam3))
[1] 3.067572
¿ sd(exam4new)
[1] 7.447017
¿ sd(exam3)
[1] 3.067572
¿
Variance different from mean of squared deviations? We return to the problem of variance
being different the mean of squared deviations. The clue is given below,
¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3)
[1] 9.0336
¿ sum((exam3 - mean(exam3))ˆ2)/(length(exam3) -1)
[1] 9.41
In fact, rather than eqn. 4.1, this particular implementation of var computes what is called the
sample variance using eqn. 4.3,
n
1 X
V ar (X) = (xi − µ)2 . (4.3)
(n − 1) i=1
We now return to our desire to manipulate (fairly) the two data sets, exam3, exam4, such that
students in each class have roughly the same opportunity; see section 4.2.1 where we equalised
the means, but where we noted that the difference in variability remained a problem.
4–6
4.3.1 Standard Scores
The normal way to equalise data sets like these (the proper term is either standardise or normalise)
is to use the standard score as in,
X−µ
Xss = . (4.4)
σ
Eqn. 4.4 gives a set of scores with mean zero and standard deviation one, µss = 0, σss = 1. Thus,
if we apply eqn. 4.4 to the two sets of marks, using the mean and standard-deviations of each, we
get two sets of marks with the same mean (0) and the same spread (standard-deviation 1).
That is fine for purely comparison purposes, but what if we need marks to publish? What we are
going to do is: (i) use eqn. 4.4 to standardise the scores; then (ii) multiply by whatever (new)
standard-deviation, call it σnew , that we require; finally, add the (new) mean that we require. The
whole operation is given in eqn. 4.5,
Xold − µ
Xnew = × σnew + µnew . (4.5)
σold
We’ll now apply this to exam4, i.e. we want to make exam4 as close as possible to exam3 (in terms
of mean and standard deviation).
¿ sd3 ¡- sd(exam3)
¿ sd3
[1] 3.067572
¿ m3 ¡- mean(exam3)
¿ sd4 ¡- sd(exam4)
¿ sd4
[1] 7.447017
¿ m4 ¡- mean(exam4)
¿ m4
[1] 55.48387
¿ m3
[1] 74.92
¿ exam4new = round(((exam4 - m4)/sd4)*sd3 + m3)
¿ exam4new
[1] 70 70 70 70 71 72 72 73 73 74 74 74 75 75 75 76 76 76 76 76 76 76 77 77
77
[26] 77 78 78 78 80 82
¿ mean(exam3)
[1] 74.92
¿ mean(exam4new)
[1] 74.96774 # difference due to rounding
¿ sd(exam3)
[1] 3.067572
¿ sd(exam4new)
[1] 2.99426 # difference due to rounding
¿
4–7
And let us compare the histograms in Figure 4.4
10
8
8
6
6
Frequency
Frequency
4
4
2
2
0
0
68 70 72 74 76 78 80 82 70 72 74 76 78 80 82
exam3 exam4new
Figure 4.4: Histograms of exam3 and exam4new (exam4 equalised with exam3).
4–8
Chapter 5
5.1 Introduction
This chapter gathers together some basic definitions, symbols and terminology to do with, proba-
bility, random variables, and random processes; the topics are chosen according to their applicability
to basic statistics for bio-scientists, as well as pattern recognition, image processing and data com-
pression. We will use some of the notation from Appendix A; you should have a quick look at that
first. We emphasise that such notation is merely shorthand for common sense concepts which
would otherwise be confusing and long-winded if written in English.
5.2.1 Introduction
0 ≤ pi ≤ 1, (5.1)
n
X
pi = 1. (5.2)
i=1
The above simple definition of probability over outcomes is satisfactory for simple applications, but
for many applications we need to extend it to apply to subsets of Ω.
We could call the outcomes above elementary events, i.e. indivisible events, and we could call the
subsets below composite, i.e. they are a composition of one or more outcomes.
Ω is often called the sample space, i.e. as defined above, the set of all possible outcomes of the
experiment. Elements of Ω are called outcomes, sample outcomes, or realisations. One of the
problems of learning probability and statistics is the confusion caused by the multiplicity of terms
for the same concept. In addition, different fields of study, e.g. bio-science, engineering, social
science, . . . have their own terminology.
5–1
Example 1 Six sided dice. Ω = {i | i ∈ {1, . . . 6}} = {1, 2, . . . 6}.
Let there be subsets of Ω called events with a general event ai ; the set of all ai s is A. We define
a probability measure P on A; P is a number and satisfies the following axioms:
P (a) ≥ 0, (5.3)
∞
[ ∞
X
P( ai ) = P (ai ). (5.5)
i=1 i=1
Disjoint (subsets) is another term for mutually exclusive, i.e. they cannot possibly happen together.
Example 4 Six sided dice. Ω = {1, 2, . . . 6}. Let a be the event score greater than three; i.e.
a = {4, 5, 6}.
Example 5 Toss two six sided dice. Ω = {(i, j) | i, j ∈ {1, . . . 6}}. Let a be the event score less
than four. Then a = {(1, 1), (1, 2), (2, 1)}.
5–2
5.2.3 A Point on Terminology
Above we have P (ai ) for probability that the outcome is in set ai . “The outcome is in set ai ” is
what is called a proposition. A proposition is a sentence which may be true or false — but only
one or the other and not in between.
We should note that in most textbooks and later in these notes the arguments of probability
functions, P (.) will be propositions, e.g. P (A) means the probability that A will occur, or that A
will be true.
Then, when we write P (AB) or P (A, B) (they mean the same), we mean probability of A and B
being both true; logical and.
Not or set complement We may want to talk about the probability that A will be false, i.e. the
probability that the outcome will be in the complement set to A, i.e. any of the outcomes (in Ω)
but not in As set. Not A is denoted Ā.
Example 6 Six sided dice. Ω = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so Ā = {5, 6}.
4 2
P (Ā) = 1 − P (A) = 1 − 6 = 6 = 13 .
We saw in eqn. 5.5 that to compute the probability of two disjoint events you can add probabilities.
For events A and B that are not necessarily disjoint (there may be overlap), we can write
[
P (A B) = P (A) + P (B) − P (AB). (5.8)
Example 7 Six sided dice. Ω = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so B = {4, 5}; so A ∪ B =
{1, 2, 3, 4, 5} and A ∩ B = {4}.
4 2 1
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 6 + 6 − 6 = 65 , and we can see that, computed directly,
P (A ∪ B) = P ({1, 2, 3, 4, 5}) = 65 .
We note that eqn. 5.8 collapses to eqn. 5.5 when AB is false (no overlap, the two cannot be true
together), because of eqn. 5.6, i.e. P (∅) = 0, and
[
P (A B) = P (A) + P (B) − P (∅) = P (A) + P (B) − 0 = P (A) + P (B).
5–3
5.2.5 Finite Sample Spaces
In Example 1 we could identify and list all possible outcomes and we have a finite sample space.
On the other hand, if the outcome was a weight, for example of a precipitate, then we could not
list all possible weights and we would have an infinite sample space.
If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable
(r.v.). X is a function over the set Ω = {ω1 , ω2 , . . .} of outcomes; if the range of X is the real
numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set,
then X is a discrete r.v. Chapter 6 contains an extensive discussion on random variables and an
introduction probability distributions.
We have already done this in examples, but we need to formalise a bit. The number of elements
in a (finite) set, say a, is called its cardinality and written |a|.
If the outcomes are equally likely (which {1, 2, . . . 6} are), then we can compute the probability of
an event a as the ratio:
|a|
P (a) = . (5.9)
|Ω|
5–4
5.5.1 Multiplication of outcomes
Let an event correspond to the combined outcomes of two experiments performed in sequence.
Let the first have n1 outcomes and the second n2 outcomes.
Any of the n1 outcomes of the first may be followed by any of the n2 outcomes of the second, so
the number of outcomes in the combined experiment is n1 × n2 .
Example 10 Toss two six sided dice in sequence (but the result is the same if we throw them
together). n1 = |Ω1 | = 6, n2 = |Ω2 | = 6, so, for the combined experiment, |Ω| = n1 × n2 = 36,
which we can also compute by counting the elements in Ω = {(i, j) | i, j ∈ {1, . . . 6}}.
Suppose again that we have two experiments. Let the first have n1 outcomes and the second n2
outcomes. This time we perform the first experiment or the second, but not both and which of
them gets performed is chosen randomly; how many outcomes?
We have n1 outcomes of the first, or the n2 outcomes of the second, so the total number of
outcomes in the combined experiment is n1 + n2 .
Example 11 Toss one six sided dice or toss a two sided coin. n1 = |Ω1 | = 6, n2 = |Ω2 | = 2,
so, for the combined experiment, |Ω| = n1 + n2 = 8, which we can also compute by counting the
elements in Ω = {1, 2, 3, 4, 5, 6, H, T }.
5.5.3 Permutations
Suppose we have n items and we wish to place them in a sequence — just any sequence, not
ordered according to size or any other attribute. How many ways to do this?
The first position may be filled by any of the n items; the second position may be filled by any of the
remaining n − 1 items, and so on, so that the number of possible different sequences (orderings) is
Suppose now we have n items and we wish to choose any r of them place these in a sequence. How
many ways to do this? The first position may be filled by any of the n items; the second position
may be filled by any of the remaining n − 1 items, and so on until we have r in the sequence. The
number of possible different sequences (orderings) is
n!
n(n − 1)(n − 2) . . . n − (r − 1) = n(n − 1)(n − 2) . . . n − r + 1) = =n P r . (5.11)
(n − r )!
n
Pr is the name for the number of permutations of r from n.
5–5
5.5.4 Combinations
Suppose again we have n items and we wish to choose any r of them, but we do not need to place
the r in a sequence. How many ways n Cr to do this? We can appeal to eqns. 5.11 and 5.10.
n!
=n Cr × (number of ways of permuting)r = r !n Cr ,
(n − r )!
which leads to
n n! n
Cr = = . (5.12)
r !(n − r )! r
Example 12 Ω = {1, 2, 3, 4, 5, 6}. I throw the dice. What is the probability of getting greater-
than-three, P (> 3)? Let A be greater-than-three so that A = {4, 5, 6}, and the cardinality of
this set is nA = |A| = 3, and ndice = |Ω| = 6, see section 5.4; there are three possibilities
greater-than-3, so P (A) = P (> 3) = nA /ndice = 3/6 = 1/2.
Now, I have a peek and I tell you that we have an odd number, let us call this event B (odd). What
now is the probability of A(> 3)? The probability surely has changed because the only possibilities
now are A odd = {1, 3, 5}. Within this set, 5 is the only (one) possibility that satisfies greater-
than-three, so, forgetting about any ideas we had before, we say that the conditional probability
of greater-than-three given that we already know that an odd number has occurred, 1/3, i.e. the
probability has doubled based on the information that an odd has occurred.
We write this P (> 3|odd), the conditional probability of a > 3 conditional on the fact that we
already know that an odd number has occurred.
Venn diagrams, see section A.1.4, can be used to think about conditional probabilities such as
the one in Example 12. Here Ω = {1, 2, 3, 4, 5, 6} corresponds to the universal set (the set of all
possibilities).
One we have been told that the number is odd, we can reduce our sample space to set odd; then
odd ∩ (> 3) = {5}.
Example 13 If after hearing first that we have an odd number, then secondly we are told that
greater-than-three has occurred, we are then asked (a) what is the probability of a six?, (b) what
is the probability of a five?
Think about it, once we have the two pieces of information: odd, then greater-than-three, the
possibilities are very greatly reduced. To what?
5–6
1 3 5
2 4 6
Figure 5.1: Dice: (a) universal set; (b) sets odd, even; (c) sets (> 3) and (<= 3) superimposed
to show that, for example, odd&(> 3) = (set-odd) ∩ (set > 3) = {1, 3, 5} ∩ {4, 5, 6} = {5}
.
Probability trees, see (Griffiths 2009, p. 158), are another way to think graphically about condi-
tional probability. In mathematics, trees can grow sideways or even upside down.
When we split into branches as in Figure 5.2, any branching must represent all possibilities; in this
case we first have odd and even; if we call odd B, we have even = not-odd = B̄. In the diagram
we have no bar symbol, so we use B 0 = B̄. Next we have (> 3 and (<= 3).
Thus, at any branching the probabilities in the branches must sum to one.
The diagram shows how to compute joint probabilities using conditional probabilities and the
probability of the conditioning event, for example P (> 3 & odd) = P (> 3 | odd) × P (odd).
The following may help us to think about conditional probability and joint probability. Think of the
tree as having probability flowing in its branches.
We start of at the root with all the probability (one, 1); proportions of the probability flow into
the first set of branches (the proportions sum to one); follow one of those branches, at the next
branching point, we split the remaining probability into proportions that again sum to one (it is
just the proportions that sum to one, if there is, for example, 0.4 flowing into the branching point,
and the proportions are 0.4, 0.4, 0.2 — three-way branch, then we will have probability flows of
0.16, 0.16, 0.08). And so on.
5–7
odd has occurred >3 and odd has occurred
Figure 5.2: Probability tree for the dice example. We start off on the left with the root and
everything possible. Then we split into branches odd and even. Next we split odd into (> 3) and
(<= 3); same for the even branch.
A has occurred
We know B has i.e. A & B have occurred
occurred
P(A | B) A
P(A & B) = P(AB) = P(A | B) x P(B)
B not A has occurred
P(B) i.e. not A & B = A’ & B
P(A’ | B) A’
P(A’B) = P(A’ | B) \x P(B)
P(A | B’) A
P(AB’) = P(A | B’) x P(B’)
P(B’)
(not B)
B’
A’
P(A’ | B’) P(A’B’) = P(A’ | B’) x P(B’)
B has not
occurred
i.e. not B has occurred
5–8
Symbolically, and referring to Figure 5.3 . . . If we have proportion P (B) in a branch and then
that splits into proportions P (A|B) and P (Ā|B) (these (relative) proportions again sum to one,
but their total probability sums to whatever flowed into the branching point). Then the P (A|B)
branch must an absolute amount of probability equal to P (A|B) × P (B) and this is P (AB).
Formula for Conditional Probability We now give the formula for computing conditional prob-
abilities,
P (AB)
P (A|B) = , (5.13)
P (B)
Sometimes we write P (AB), sometimes P (A&B), sometimes P (A and B), and sometimes, using
set notation, P (A ∪ B).
If we reverse the conditionality in eqn. 5.13 and noting that P (AB) = P (BA), we have
P (AB)
P (B|A) = , (5.15)
P (A)
leading to
so that
5–9
leading to Bayes’ rule:
Example 14 Let A be has disease-X; let B be has swollen ankles. From a sample of former
disease-X patients, we can estimate P (B|A); say it is P (B|A) = 0.3. Let us assume that we also
know the proportion of the general population that have swollen ankles, P (B) = 0.01. Also we
assume that we have the incidence of disease-X in the general population, P (A) = 0.005.
Eqn. 5.19 allows us to compute the probability that the patient has disease-X given that the swollen
ankles symptom (B) is present, P (A|B). Of course, in general, P (A|B) 6= P (B|A).
Bayes’ rule may be written in a more general manner. First we need a result called the law of total
probabilities.
Let A1 , A2 , . . . , An be a partition of Ω (see section 5.2.2 for a definition of partition), then
n
X
P (B) = P (B|Ai )P (Ai ). (5.21)
i=1
n
X
P (Ai |B) = P (B|Ai )P (Ai )/ P (B|Ai )P (Ai ). (5.22)
i=1
Let us return to Example 14 and apply eqn. 5.22. When we said proportion of the general population
that have swollen ankles, P (B) = 0.01, we strictly meant probability of people with disease-X to-
gether with those without disease-X = 0.01. We can restate the problem with A1 = has disease-X
and A2 = has not disease-X, so that they form a partition of the general population.
Assume that we now have P (B|A2 ) = 0.01 (i.e. we are changing the story slightly to associate
this probability with people who do not have disease-X) and, as before, P (B|A1 ) = 0.3; we need
also P (A1 = 0.005, as before. What is P (A2 ); it is P (Ā1 ) (probability that a general person does
not have disease-X) and this is 1 − P (Ā1 ) = 0.995.
Eqn. 5.21 now gives a revised figure for P (B),
n
X
P (B) = P (B|Ai )P (Ai ) = P (B|A1 )P (A1 )+P (B|A2 )P (A2 ) = 0.30.005+0.010.995 = 0.01145,
i=1
5–10
5.8 Independent Events
We have already discussed disjoint events, i.e. events which cannot occur simultaneously; thus,
disjoint events A, B, A ∩ B = ∅. Consequently, we can state that P (A|B) = 0 (if B has occurred,
A cannot).
At the opposite extreme, let A ⊂ B, i.e. A is a subset of B and if A has occurred, then so must
B, with certainty, so in this case P (B|A) = 1.
Example 15 Ω = {1, 2, 3, 4, 5, 6}. Let B = {2, 4, 6} (even number) and A = {6}. If we know
that a 6 has been thrown (A has occurred), what is P (B|A)? The answer is 1 — we know that 6
is even so B is a sure thing — in punter parlance :-).
But there are cases where A and B are totally unrelated — they are independent events.
Example 16 Throw a dice (1) and toss a coin (2). Ω1 = {1, 2, 3, 4, 5, 6}, Ω2 = {H, T } and
the combined sample space Ω = {(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )} and |Ω| = 12. Let
A = {4, 6} and B = {H}, so that AB = {(4, H), (6, H)} (two out of 12 equally likely events), so
P (AB) = 1/6. also P (A) = 1/3, P (B) = 1/2.
P (AB) 1 1 1
P (B|A) = = / = .
P (A) 6 3 2
Because the result of the dice throw is unrelated to the result of the coin toss we are not surprised
to find that
1
P (B|A) = P (B) = .
2
P (AB)
P (B|A) = P (B) = ,
P (A)
5–11
5.9 Betting and Odds
In circumstances where the terms have meaning, probability of A can be computed as the ratio
of the number of equal probability events favourable to A, nA , versus the total number of equal
probability events, nT ,
Odds, on the other hand are computed as the ratio of the number of equal probability events
favourable to A, nA , versus the number of equal probability events unfavourable to A, nĀ ,
Thus, the probability of a 1 on the throw of a dice is 61 , whilst the odds are 15 ; bookmakers express
this as five-to-one against.
4 4 2
The probability for any number less than five (1–4) would be 6, whilst the odds are 2 = 1;
bookmakers express this as two-to-one on.
O(A)
P (A) = . (5.26)
1 + O(A)
Thus, for any number less than five (1–4) on a dice throw,
2
O(A) 2
P (A) = = 1 2 = .
1 + O(A) 1+ 1
3
P (A)
O(A) = , (5.27)
1 − P (A)
1
6 1
O(A) = 1 = .
1− 6
5
5–12
Bookmakers odds and probabilities Bookmakers “probabilities” do not add to 1. Unlike proper
probabilities, which add one for all possible events, see eqn 5.2.
1
Let’s say we have four horses, each with an equal probability of winning (P (Ai ) = 4, for i =
1, 2, 3, 4. We would expect odds of
1
4 1
O(A) = 1 = ,
1− 4
3
or three-to-one against. But the bookmaker has to make a living, and not just provide a mutual
service for his punters. In this case, if four punters bet 10 Euro on each horse (bookie gets 40
Euro), one punter gets paid 30 Euro plus his stake returned = 40 Euro, and the bookie makes
nothing for his work.
The bookie is likely to give odds of something like two-to-one against, O0 (A) = 12 , and, computing
probabilities, we find
1
O0 (A) 1
P 0 (A) = = 2
1 = ,
1 + O0 (A) 1+ 2
3
In this amended case, if four punters bet 10 Euro on each horse (bookie gets 40 Euro), one punter
gets paid 20 Euro plus his stake returned = 30 Euro, and the bookie makes 10 Euro.
In many books and discussions you will see a distinction made between the classical and the
Bayesian interpretation of probability; also, in this context the term frequentist may be used as a
synonym for classical. As an interpretation of probability, the term Bayesian has little to do with
Bayes’ rule, section 5.7, that is until we get to statistical inference, Chapter 10.
Bayesian (belief) interpretation Take the case of the tossed (fair) dice. If you were asked to
rate, on a scale of [0, 1], your belief that 2 will be the outcome, you would, I hope, agree that
the probability is 16 ; for an even number of dots: 62 = 12 ; and any number 1-6 — a sure thing —
probability is 1.
5–13
Relative frequency interpretation The frequentist says that the probability of 2 is the relative
frequency with which 2 occurs in a large number of hypothetical throws.
Let us then run an experiment involving a large number (n = 600) of throws. and let yi = the
count of each Xi obtained. We might expect to obtain something like y1 = 95, y2 = 110, y3 =
90, y4 = 97, y5 = 105, y6 = 103. We then use p̂(i ) = yni ; the hat, ˆ, indicates that p̂(i ) is an
approximation to p(i ); however, p̂(i ) → p(i ) as n → ∞.
We have p̂(i ) = yni = p̂(i ) = {95/600, 110/600, 90/600, 97/600, 105/600, 103/600 =
0.158, 0.183, 0.15, 0.162, 0.175, 0.172}. The correct value is p(i ) = 16 = 0.1667.
The errors above are not a real indictment of the frequentist method; a thought experiment allows
us to reason that p(i ) = 61 .
On the other hand, when you want to bet on football match and would like to estimate the
probability and hence the odds, it makes no sense to think of an infinity of matches.
5–14
Chapter 6
6.1 Introduction
We have already introduced the notion of a random variable in section 5.3, i.e. where we associate
a number with the outcome of an experiment governed by probability.
In most cases, your (scientific) data will already be numerical, but it nonetheless remains worthwhile
to be cognisant of the details of probability and sample space described in Chapter 5.
In some of the examples in Chapter 5, namely those involving the dice, the outcome already is a
number, i.e. {1, . . . , 6}; in some considerations, this number is more a label than a number, but
in any case, the association of a number with the outcome is made trivial. In the coin example we
had {H, T }; in this case we could use the association {H → 1, T → 0}.
If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable
(r.v.). X is a function over the set Ω = {ω1 , ω2 , . . .} of outcomes; if the range of X is the real
numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set,
then X is a discrete r.v. The space of all possible values of X is called the range space of X, RX .
In discussing random variables we label the r.v. with an upper case letter, e.g. X, but particular
values of it are labelled with lower case, e.g. x, or xi .
Example 17 Toss two coins. Ω = {T T, T H, HT, HH}. Let a r.v. X be defined as the number of
heads in the outcome, i.e. {T T → 0, T H → 1, HT → 1, HH → 2}. Notice that two outcomes
map to the same number (1); this is not a problem or a mistake. RX = {0, 1, 2}.
If we have an event B with respect to a range space RX . Let the event A with respect to Ω be
defined as
6–1
A = {ω ∈ Ω | X(ω) ∈ B}. (6.1)
Then A and B are equivalent events and we can carry the definitions and equations of Chapter 5
over to random variables.
Example 18 Two coins as in Example 17. Examples of equivalent events are: A = {T T }, B = {0};
A = {T H, HT }, B = {1}; A = {HH}, B = {2}.
Let a r.v. X have a range space RX = {x1 , x2 , . . . , xn }. We denote the probability of a particular
value X = xi as pX (xi ) = P (X = xi ). The probabilities pX (xi ), i = 1, 2, . . . , n, in keeping with
eqns. 5.3 and 5.4, must satisfy
pX (xi ) ≥ 0, i = 1, 2, . . . , n, (6.3)
n
X
pX (xi ) = 1. (6.4)
i=1
pX is called the probability function or the probability mass function of the r.v. X. We’ll attempt to
standardise on probability mass function and its abbreviation pmf. We use the shorthand X ∼ pX
to state that the r.v. X has a pmf pX . Often, where there is no ambiguity, you will find the
subscript X omitted — pX (x) → p(x).
This section identifies and describes the pmfs of some commonly occurring discrete random vari-
ables.
6–2
6.3.2 Discrete Uniform Distribution
1
pX (x) = , for x = 1, . . . , k; and 0elsewhere. (6.6)
k
Example 20 . Lottery machine, k balls. First draw, X ∼ U(1, k).
Let X be the result of a (binary outcome) experiment with probability p of one outcome, X = 1,
say, and 1 − p for the other, X = 0; for example a coin flip. There’s overuse of the symbol p
here, but we need to keep to standard notation; context should resolve any ambiguities between
the parameter p = P (X = 1) and the pmf pX (X).
Repeat the experiment above (Bernoulli distribution — coin flip) n times and let X be the number
of 1s (e.g. heads) obtained.
n
pX (x) = p x (1 − p)n−x , for x ∈ {0, 1, . . . n}; 0, otherwise. (6.8)
x
n
Where does the come from? We have already introduced it in eqn. 5.12; it is the number
x
of ways of selecting x items from n. The probability one of the x 1s is p x and the probability one
of the n − x 0s is (1 − p)n−x ; the flips
are independent so we can multiply the probabilities
to get
n n n!
p x (1 − p)n−x . However, there are possible ways of getting the X = x 1s. = x!(n−x)! .
x x
Take n = 3; the sample space is Ω = {T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH} and the
event corresponding to x = 2 (two heads, any two heads) is A = {T HH, HT H, HHT }, i.e. there
are three outcomes that give two heads.
n 3 3! 6
= = = = 3.
x 2 2!1! 2
Example 21 . Distribution of the number of coin flips until the first head.
6–3
6.3.6 Poisson Distribution
x
−λ λ
pX (x) = e , x ≥ 0. (6.10)
x!
Example 22 . Distribution of rare events like traffic accidents; there can be long periods of
inactivity, but clumping of events is possible, e.g. waiting a long time for a town bus and three
arrive in quick succession!
This section identifies and describes the probability density functions of some commonly occurring
continuous random variables. First we must introduce a continuous alternative to the probability
mass function.
• the random variable is now continuous, i.e. the elements of the range space are not countable;
• the probability of any particular value of the r.v. is in fact zero. Example: you buy 0.5-kg
of cheese in Tesco; what is the chance of it being exactly 0.5-kg? Zero. Same goes for
the weight of a product of a chemical experiment. Hence we cannot use probability mass
functions.
We now must use a different probability function called a probability density function (pdf). A pdf,
over a range space RX , must satisfy (c.f. eqns. 6.3 and 6.4 for discrete r.v.’s)
Z
fX (x)dx = 1. (6.12)
Rx
We emphasise that fX (x) is not a probability, but fX (x)dx is. If you want to speak of a probability
over a continuous r.v. you mustRstate something like the probability that X is in the range a to b,
b
inclusive, is P (a ≤ X ≤ b), i.e. a fX (x)dx.
The term probability density function is used (in contrast to probability mass function (for discrete
r.v.’s)) because, with a continuous r.v. you simply cannot pick a value (X = x), say, and state
P (X = x), which is in fact zero.
6–4
Discrete probability mass versus Continuous probability density Think of a ruler upon which
we place (stick with Blue-tack) ball bearings of various sizes along its length; the ball bearings
represent discrete masses and
P we can state that we have a mass m1 at ruling x1 ; we can also
compute the total mass as i mi .
Now think of a rod of varying diameter laid along the ruler; we cannot pick a point x and say that
the mass at precisely that point is m(x), but we can say that the mass in a little length, x, x + ∆x,
is d(x)∆x, where Rd is the mass per unit length at x, (the density). In this case we can compute
the total mass as length d(x)dx.
Many textbooks base their treatment of continuous r.v.’s on the cumulative distribution function
(cdf); the cdf does give a probability.
Z x
FX (x) = fX (x)dx. (6.14)
−∞
(
1
(b−a) , for x ∈ [a, b]
fX (x) = (6.15)
0 otherwise.
0,
x <a
(x−a)
FX (x) = , x ∈ [a, b] (6.16)
(b−a)
0 x > b.
2 !
1 1 x −µ
fX (x) = √ exp − , ∞ < x < ∞. (6.17)
σ 2π 2 σ
The Normal distribution is often used to model measurements taken in the presence of error or
noise. If the true value of a variable X is µ, then measurement (random) variable is distributed as
N(µ, σ) where σ (the standard deviation) is a measure of the ‘size’ of the errors.
6–5
We say X has a standard Normal distribution if µ = 0 and σ = 1; standard Normal r.v.’s are
typically denoted by Z; Z ∼ N(0, 1). The CDF for Z is denoted by Φ(z); although there is no
formula for Φ(z), it is tabulated. In the days before widespread use of computers, tables such as
those for Φ(z) were of great importance to those involved in statistics and statistical inference.
Nowadays statistic packages and even some calculators will compute Φ(z) for you or even remove
the necessity by calculating the thing that required Φ(z) as an intermediate value.
1
fX (x) = exp(−x/β). (6.18)
β
The Exponential distribution is used to model the waiting times between infrequent events, c.f.
the Poisson distribution, see section 6.3.6.
1
fX (x) = x α−1 exp(−x/β), x > 0. (6.19)
β α Γ (α)
Z ∞
Γ (α) = y α−1 e −y dy . (6.20)
0
Γ (α + β) α−1
fX (x) = x (1 − x)β−1 ), 0 < x < 1. (6.21)
Γ (α)Γ (β)
6–6
6.4.8 Student t Distribution
ν+1
Γ 2 1
fX (x) = ν (ν+1)/2 . (6.22)
Γ 2 1+ x2
ν
1
fX (x) = . (6.23)
π(1 + x 2 )
1
fX (x) = x (n/2)−1 e −x/2 , x > 0. (6.24)
Γ (n/2)2n/2
In discussing discrete r.v.’s we mentioned, for example, a range space RX = {x1 , x2 , . . . , xn }. If the
range space is all the integers, we could use the common symbol RX = Z. If the range space is
all the real numbers, we could use the common symbol RX = R. If the range space is a subset of
R, we use, for example, RX = [0, 1] to state that the r.v. can be 0 − −1 inclusive. For a discrete
(integer) subset we use, for example, {1, 2, . . . , 10}.
6.6 Parameters
In discussing the Binomial distribution, eqn. 6.8, and the Normal, eqn. 6.17, see below,
n
pX (x) = q x (1 − q)n−x , for x ∈ {0, 1, . . . n}; 0, otherwise,
x
2 !
1 1 x −µ
fX (x) = √ exp − , ∞ < x < ∞,
σ 2π 2 σ
6–7
we note that q for the Binomial, and µ, σ for the Normal, completely specify the distributions. We
call these parameters and we will see distributions written as, for example, fX (x; θ1 , θ2 ), where θ is
a common symbol for parameter.
A lot of practical statistics involves parameter estimation, where, for example, we may have a set
(sample) of data x1 , x2 , . . . , xn , which we know to be drawn from a population with distribution
fX (x; θ1 , θ2 ) and we want to compute an estimate θˆ1 for θ1 .
6–8
Chapter 7
7.1 Introduction
Chapter 6 has introduced one dimensional random variables and certain well known distributions.
Both discrete and continuous r.v.’s were covered.
In many cases, your (scientific) data will consist not just of single numbers, for example, the weight
of a chemical in a mixture, but two or more numbers. If the numbers correspond to independent
events, see section 5.8, it may be possible or desirable to treat them separately as individual
one-dimensional r.v.’s, but, generally, you will want to treat pairs or triples or multiple numbers
together.
In section 5.6 and eqn. 5.13 we introduced the notion of the probability of two events happening
together, P (AB), the joint probability of A and B.
Range spaces — terminology for two and more dimensions See section 6.5 where we intro-
duced some symbols and terminology used in describing range spaces for one-dimensional r.v.’s.
If we have a two-dimensional continuous random variable — a pair (X, Y )— each member of which
can take on any real value, we say that the range space is R × R; for general multi-dimensions, say
p-dimensions, where the random variable is a random vector, we use Rp . For a subsets of R, we
use, for example, [0, 1] × [0, 1] and [0, 1]p . The term for a combination (product) of sets such as
[0, 1] × [0, 1] is Cartesian product.
7–1
Much of what we present here is just a two-dimensional analogue of what was covered in Chap-
ter 6. Also, what is described here in terms of two-dimensions transfers immediately to multiple
dimensions.
By analogy with eqns. 6.3 and 6.4, for one-dimension, we have pX,Y (xi , yj ) = P (X = xi , Y = yj )
(or just p(xi , yj )) and it must satisfy the following
p(xi , yj ) ≥ 0, i = 1, 2, . . . ; j = 1, 2, . . . (7.1)
m X
X n
p(xi , yj ) = 1. (7.2)
j=1 i=1
As with one-d., pX,Y or just p is called the probability function or the joint probability function for
the r.v. (X, Y ).
Example 23 From (Meyer 1966, p. 85). There are two production lines; the first has a capacity
to produce up to five items in a day; its actual production is a random variable X; the second has
a capacity to produce up to three items in a day and its actual production is a random variable Y .
The pair of random variables is the two-dimensional random variable (X, Y ) and the joint probability
function is given in Table 7.1. Each entry represents P (X = xi , Y = yj ); so p(2, 3) = 0.04. Such
a table could be estimated by noting (X, Y ) over a large number of days.
X 0 1 2 3 4 5
Y
0 0.0 0.01 0.03 0.05 0.07 0.09
1 0.01 0.02 0.04 0.05 0.06 0.08
2 0.01 0.03 0.05 0.05 0.05 0.06
3 0.01 0.02 0.04 0.06 0.06 0.05
We can verify that the table does represent a proper probability function in that requirement eqn.
7.1 is satisfied, and, by summing over all entries, that requirement eqn. 7.2 is satisfied — the
entries sum to 1.
By analogy with eqns. 6.11 and 6.12, for one-dimension, we have the (joint) PDF f (x, y ) and it
must satisfy the following
f (x, y ) ≥ 0, all (x, y ) ∈ R × R, (7.3)
7–2
Z ∞ Z ∞
f (x, y )dxdy = 1. (7.4)
−∞ −∞
We emphasise again that f (x, y ) is not a probability, but f (x, y )dxdy is.
Example 24 Suppose in Example 23 (Table 7.1) we want to compute the probability functions for
X and Y on their own. These are called marginal probability functions. The marginal probability
function for X is given by
m
X
pX (xi ) = P (X = xi ) = P (X = xi , Y = y1 , or . . . , or X = xi , Y = yn ) = p(xi , yj ). (7.5)
j=1
n
X
pY (yj ) = p(xi , yj ).
i=1
X 0 1 2 3 4 5 Sum
Y
0 0.0 0.01 0.03 0.05 0.07 0.09 0.25
1 0.01 0.02 0.04 0.05 0.06 0.08 0.26
2 0.01 0.03 0.05 0.05 0.05 0.06 0.25
3 0.01 0.02 0.04 0.06 0.06 0.05 0.24
Sum 0.03 0.08 0.16 0.21 0.24 0.28 1.00
We can verify that the sums corresponding to p(xi ) and p(yj ) do represent proper probability
functions in that requirement 6.3 is satisfied, and, by summing the marginals, that requirement
6.4 is satisfied — both sets of marginals sum to 1.
For continuous random variables, we can state the equivalent equation for marginal PDFs:
Z
fX (x) = fX,Y (x, y )dy . (7.6)
Y
7–3
7.5 Conditional Probability Distributions
In section 5.6 we introduced conditional probability, i.e. the probability of an event B when we
know that event A has occurred:
P (AB)
P (B|A) = . (7.7)
P (A)
Example 25 Suppose in Example 24 (Table 7.2) we want to compute the conditional probability
P (X = 2|Y = 1). Applying eqn. 7.7 we have
P (X = 2, Y = 1) 0.04
P (X = 2|Y = 1) = = = 0.154.
P (Y = 1) 0.26
We can give general rules, noting that q(yj ), p(xi ) are marginal probability functions given by
eqn. 7.5,
p(xi , yj )
p(xi |yj ) = if q(yj ) > 0, (7.8)
q(yj )
p(xi , yj )
p(yj |xi ) = if p(xi ) > 0. (7.9)
p(xi )
We can give similar general rules for continuous random variables, noting that h(yj ), h(x) are
marginal probability functions given by eqn. 7.6,
f (x, y )
f (x|y ) = if h(y ) > 0, (7.10)
h(y )
f (x, y )
h(y |x) = if g(x) > 0. (7.11)
g(xi )
We can define the notion of independent random variables using the definition of independent
events given in section 5.8; we had: A and B are independent events if and only if
7–4
Independent Discrete Random Variables Given the two-d. discrete random variable (X, Y ), X
and Y are said to be independent if and only if
noting that q(yj ), r (xi ) are marginal probability functions given by eqn. 7.5.
Independent Continuous Random Variables Similarly, given the two-d. continuous random
variable (X, Y ), X and Y are said to be independent if and only if
2 2 !
1 1 x − µx (x − µx )(y − µy ) y − µy
f (x, y ) = p exp − − 2ρ + ,
2πσx σy 1 − ρ2 2(1 − ρ2 ) σx σx σy σy
(7.15)
Before you start protesting that eqn. 7.15 is incomprehensible, (i) it isn’t and I can explain it; (ii)
there is a much better way of handling multivariate random variables that is better for even two-d.
See Chapter B and section B.7.
7–5
Chapter 8
8.1 Introduction
Here we identify and define some parameters (numbers) that characterise some aspects of r.v.
distributions.
Generally, the expected value or expectation of some function of the r.v. is found useful and the
expected value of the r.v. itself (the mean) is first amongst these.
Definition: Expected Value, Discrete R.V. Discrete r.v., range space RX = {x1 , . . . , xn };
probability mass function p(xi ) = P (X = xi ). The expected value or expectation ((E(X)), or
mean of X is given by
N
X
E(X) = µx = xi p(xi ). (8.1)
i=1
Continuous r.v., range space RX = R; probability density function f (x). The expected value or
expectation ((E(X)), or mean of X is given by
Z
E(X) = µx = xf (x)dx. (8.2)
R
8–1
Example 26 Toss two coins as in Example 18. X = number of heads. A = {T T }, P (A) = 14 , X =
{0}, P (X = 0) = 41 ; A = {T H, HT }, P (A) = 12 , X = {1}, P (X = 1) = 21 ; A = {HH}, P (A) =
1 1
4 , X = {2}, P (X = 2) = 4 .
N
X 1 1 1
E(X) = µx = xi p(xi ) = 0 + 1 + 2 = 0 + 0.5 + 0.5 = 1.
i=1
4 2 4
Example 27 Toss a dice and take X = the number of dots obtained; p(xi ) = 16 , i = 1, . . . , 6.
N 6
X 1X
E(X) = µx = xi p(xi ) = xi = 21/6 = 3.5. (8.3)
i=1
6 i=1
Aside — Sample Averages In later chapters we will encounter samples and sample averages.
By sample we mean that we run an experiment and take some example values, say n of them, of
the r.v., x1 , x2 , . . . , xn .
Here we use n for the size of the sample rather than N as in eqn. 8.1 and note that the sample
space Rx = x1 , . . . xN denotes the population, rather than a sample of it.
Then we can compute a sample mean, X̄, (pronounced x-bar ) as
n
1X
X̄ = xi . (8.4)
n i=1
n n
X 1 X yi
X̄ = yi × xi = xi , (8.5)
i=1
n i=1
n
and, comparing with eqn. 8.1, we have yni in place of p(xi ); we note that yni = p̄(xi ) =
{95/600, 110/600, 90/600, 97/600, 105/600, 103/600 = 0.158, 0.183, 0.15, 0.162, 0.175, 0.172},
i.e. we have sample estimates of the probability mass function, which are incorrect. The error,
X̄ 6= µx , is due to the errors in the p̄(xi ). Generally, as n → ∞, p̄(xi ) → p(xi ) and X̄ → µX .
8–2
Definition: Expected Value of a function of a r.v. The expected value ((E(r (X))) of a function
of X Y = r (X) is given by
N
X
E(Y ) = E(r (X)) = r (xi )p(xi ). (8.6)
i=1
N 6
X X 1
E(Y ) = r (xi )p(xi ) = (xi − 5) = −4/6 − 3/6 − 2/6 − 1/6 + 0/6 + 1/6 = −9/6 = −1.5.
i=1 i=1
6
That is, we lose on average 1.5c for every play and would lose 1500c in 1000 plays. (Maybe better
than the average slot-machine?)
Expected values for two-dimensions and higher Eqns. 8.1 and 8.2 carry over to two and more
dimensions.
Discrete r.v., range space RX,Y = {x1 , . . . , xN }×{y1 , . . . , yM }; probability mass function p(xi , yj ) =
P (X = xi , Y = yj ). The expected value or expectation, (E[(X, Y )], or mean of the pair (X, Y ) is
given by
N X
X M
E[(X, Y )] = µX,Y = (µX , µY ) = (xi , yj )p(xi , yj ). (8.7)
i=1 j=1
And similarly for two-d. (and multidimensional) continuous, where multiple integrals replace single
integrals.
X X
E( ai Xi ) = E(Xi ). (8.8)
i i
n
Y n
Y
E( Xi ) = E(Xi ). (8.9)
i=1 i=1
8–3
8.3 Variance of a Random Variable
Variance gives the spread of a distribution. The variance is the expected value (mean value) of
the squared deviation from the mean.
Definition: Variance Discrete r.v., range space RX = {x1 , . . . , xN }; probability mass function
p(xi ), mean µ. The variance is given by
N
X
2 2
V (X) = σ = E[(X − µX ) ] = (xi − µX )2 p(xi ). (8.10)
i=1
Continuous r.v.
Z
2 2
V (X) = σ = E[(X − µX ) ] = (x − µX )2 f (x)dx. (8.11)
R
Aside — Sample Variance Eqn. 8.2 gives the sample mean of a random variable; the sample
variance is given by
n
2 1 X
s = (xi − X̄)2 . (8.13)
(n − 1) i=1
You may wonder about the (n − 1) instead of n; if we divided by n, the estimate would be biassed.
p
Standard Deviation Standard deviation: σX = (V (X).
n
X n
X
V( Xi ) = V (Xi ). (8.15)
i=1 i=1
8–4
8.4 Expectations in Two-dimensions
8.4.1 Mean
Two-d. discrete r.v., range space RX = {x1 , . . . , xn } × {y1 , . . . , yM }; probability mass function
p(xi , yj ). The expected value or expectation ((E[(X, Y )]), or mean of (X, Y ) is given by
M X
X N
E[(X, Y )] = µX,Y = (xi , yj )p(xi , yj ). (8.17)
j=1 i=1
Similarly for a continuous r.v. — double integral replaces summation, pdf replaces probability mass
function.
8.4.2 Covariance
Let X, Y be r.v.’s with means µX , µY and standard deviations σX , σY . The covariance between X
and Y is defined as
8–5
Chapter 9
9.1 Introduction
Here we introduce some uses of the Normal distribution, eqn. 6.17. The Normal distribution can
be used as a model or approximate model in so many cases that a large amount of mathematics
has been built up around it. Note: we use Normal (capitalised) to distinguish from the word normal
(expected, typical) and because most other distribution names are capitalised.
2 !
1 1 x −µ
fX (x) = √ exp − , ∞ < x < ∞. (9.1)
σ 2π 2 σ
We say X ∼ N(µ, σ); note: some writers use X ∼ N(µ, σ 2 ), i.e. they use the variance for the
second parameter of N; we will attempt to standardise on N(µ, σ). It is well worth checking
carefully when reading books and papers, there can be a great difference between σ and σ 2 !
Because the pdf is different for each µ, σ, it is convenient to create a standardised Normal in which
µ = 0, σ = 1. We standardise the r.v. X as follows; first we shift to zero mean, and then we divide
by σ to obtain unit standard deviation.
Z = (X − µ)/σ. (9.2)
When we standardise X, we obtain Z = (X − µ)/σ ∼ N(0, 1), and eqn. 9.1 becomes eqn. 9.3,
1
fZ (z) = √ exp(−z 2 /2). (9.3)
2π
The pdf for N(0, 1) is shown in Figure 9.1. As you can see, most of the probability is located in
−3 < Z < 3; between these limits we have probability 0.9974, i.e. P (−3 < Z < 3) = 0.9974,
that is if we have a random variable Z, we can be pretty sure it will fall between these limits; you
may have heard the term three-sigma to denote nearly all occurrences. Likewise P (−1.96 < Z <
1.96) = 0.95, so that probability outside these limits is 0.05 or 5%;
9–1
R-Example 3 The following R code computes and plots Figure 9.1.
As we indicated in section 6.4.2, the pdf does not represent a probability, but a probability density,
the numbers we refer to above, for example, P (−1.96 < Z < 1.96) = 0.95, are obtained by
integration,
Z 1.96
P (−1.96 < Z < 1.96) = 0.95 = fX (x)dx. (9.4)
−1.96
Rb
However, for the Normal distribution, there is no easy way to compute a fX (x)dx, which is where
the cdf comes in; we recall that the cdf is given by eqns. 9.5 and 9.6,
Z z Z z
1
Φ(z) = FZ (z) = fZ (u)du = √ exp(−u 2 /2)du. (9.6)
−∞ −∞ 2π
Because it is so commonly used, the standardised Normal cdf gets it own symbol, Φ(z). Φ(z) is
plotted in Figure 9.2 which was created using the code in R-Example 4.
Following the discussion above on how most of the probability is located between (−3 < Z < 3),
we are not surprised to see that Φ(z) is close to zero at z = 3; it rises to 0.5 at z = 0 (one half
of the probability is below 0, the other above 0) and then flattens out at z = 3 after which there
is almost no probability for the integral to add in.
9–2
Figure 9.1: Standardised Normal distribution, N(0, 1), probability density function (pdf).
9–3
9.3 Normal Cdf
Traditionally, statistics books, and books of tables contained tabulations of the Normal cdf, Φ(z).
We will see below how these tables are used. However, because most statistics is now conducted
using software packages, tables may be less frequently used, and may be less commonly encountered
in textbooks.
¿ z = seq(-4, 4, length = 9)
¿ cdf = pnorm(z, 0, 1)
¿ z
[1] -4 -3 -2 -1 0 1 2 3 4
¿ cdf
[1] 3.167124e-05 1.349898e-03 2.275013e-02 1.586553e-01 5.000000e-01
[6] 8.413447e-01 9.772499e-01 9.986501e-01 9.999683e-01
¿
z -4 -3 -2 -1 0 1 2 3 4
Phi(z) 3.2e-05 1.35e-03 2.28e-02 0.159 0.5 0.84 0.977 0.999 0.99997
What does Φ(z = −2) = 2.28 × 10−02 = 0.0228 mean? Referring to Figure 9.1 it means that the
amount of probability to the left of Z = −2 is 0.0228, i.e. as indicated by eqn. 9.5.
Owing to the symmetry of Figure 9.1, we can state that the amount of probability to the right of
of Z = +2 is also 0.0228. Hence the probability P (Z < −2 or Z > +2) = 2 × 0.0228 = 0.0456 or
4.56%. If we move a little closer to the mean, we get P (Z < −1.96 or Z > +1.96) = 2 × 0.025 =
0.05 or 5%. This 5% quartile (+/ − 1.96) is used a lot in statistics.
If P (Z < −1.96 or Z > +1.96) = 0.05 then P (−1.96 < Z < +1.96) = 0.95.
In a similar way, we can determine that P (Z < −1 or Z > +1) = 2 × 0.159 = 0.318; that is, a
standard Normal random variable Z is between plus or minus one standard deviation of the mean
3.18% of the time. The 0.159 number is used below in Example 29.
Example 29 Suppose we have a manufacturing process which takes fixed quantities of raw mate-
rials A (1000-grams) and B (500-g.) which react together to produce a product C in the form of
a solid cake. The weights of the cakes, X, are monitored and those below a certain weight are set
aside as B-grade. The manufacturer of the machine gives the yield expected value as E(X) = 165
grams with a variance
√of 9 and has determined that the yield follows the Normal distribution; that
is, µX = 165, σX = 9 = 3 and X ∼ N(165, 3). We have decided that cakes below 162 grams
should be marked as B-grade.
9–4
What is the probability that a randomly selected output will be less than 162 grams?
We have no tables for N(165, 3), but we do have for N(0, 1), that is the cdf for the standardised
Normal Φ(z).
Solution.
(i) First we standardise using eqn. 9.2, Z = (X − µ)/σ = (X − 165)/3. Our standardisation
formula is
Z = (X − 165)/3,
in which case the standardised weight corresponding to 162 is Z162 = (162 − 165)/3 = −1.
(ii) The probability that Z < Z162 is just Φ(Z162 = Φ(−1) and we can read that from Table 9.1,
i.e. the probability is 0.159 and 15.9% of the output will be B-grade.
Add the means, add the variances; note not add the standard deviations.
9–5
Need example here.
Eqn. 9.7 generalises to give the distribution of a sum on n independent observations of the same
random variable. If Xi ∼ N(µ, σ),
n
X √
X = X1 + X2 , . . . , X n = Xi ∼ N(nµ, nσ). (9.8)
i=1
p √
That is, add n means, and add n variances, so that σsum = nV ar (X) = nσ.
X1 ∼ N(µ1 , σ1 ), X2 ∼ N(µ2 , σ2 )
Take the difference of the means and add the variances (not difference of variances).
If X ∼ N(µ, σ),
Why is the Normal distribution (a) so common; (b) so popular amongst statisticians. First, the
Central Limit Theorem (CLT) states, roughly speaking, that if a random variable has been created
by summing a large number of (independent) random variables, then the sum will have an approx-
imately Normal distribution. Second, it is popular not just because of its common occurrence but
because mathematics involving the distribution, eqn. 9.1 and its multivariate counterpart is in many
cases rather easy — or a good deal easier than mathematics involving some other distributions.
9–6
Let X1 , X2 , . . . , Xn be independent and identically distributed r.v.’s with mean µ and standard
deviation σ. Let X̄n = n1 ni=1 Xi . Then, as n → ∞,
P
X̄n − µ X̄n − µ
Zn = p = √ → Z, (9.11)
V ar (X̄n ) σ/ n
where Z ∼ N(0, 1).
9–7
Chapter 10
Statistical Inference
10.1 Introduction
We use the Normal distribution, eqn. 6.17, repeated here, to introduce statistical inference.
2 !
1 1 x −µ
fX (x) = √ exp − , ∞ < x < ∞. (10.1)
σ 2π 2 σ
Let us say we have performed and experiment and have collected a sample of random variables X,
x1 , x2 , . . . , xn ; we assume that X ∼ N(µ, σ) but we do not know either one or other (or both) of
the parameters.
Point Estimation Parameter estimation is concerned with estimating parameters. A point esti-
mate for say µ is an approximate value µ̂ computed from the sample. Typically, in addition to the
estimate, µ̂, we give some qualifications such as the variance of the estimate, that is, an indication
of how variable we think µ̂ might be if we repeated the experiment a number of times.
Interval Estimation An interval estimate (set estimate, confidence interval) for say µ is an
interval [µ1 , µ2 computed from the sample which we claim to contain the real µ. Typically, we give
some indication of how plausible the interval is in the form a some sort of probability value.
Hypothesis Testing A typical hypothesis testing example is when a scientist needs to test the
efficacy of a new method.
And experiment is performed where there are two methods, M1 and M2 . Often, M1 is a control
(say old method) and M2 is the new methods whose efficacy we wish to test.
Let us keep the hypothesis simply by assuming that we wish to test whether M2 will give a better
yield than M1 .
10–1
Chapter 11
Statistical Estimation
11.1 Introduction
When we state for example X ∼ fX (x; θ1 , θ2 ), we indicate that the distribution depends on parame-
ters θ1 , θ2 . For example, we may think of a family of Normal distributions, N(θ1 , θ2 ), parametrised
or labelled or indexed by θ1 = µ, θ2 = σ.
When we quote values of parameters, for example the mean and standard deviation of a Normally
distributed r.v., X ∼ N(µ, σ), we are talking about population parameters.
Note the difference: population versus sample. A population includes all possible random variables;
a sample contains, well, a sample taken from the population. If you wanted a quick estimate of
the mean salary of lecturers in the college, you could ask a number of lecturers you know and take
the average of that sample.
The Human Resources Department could give you an exact figure, because they have the data for
the (complete) population of, N, lecturers. They would compute the true population parameters
as,
N
1X
µ= xi , (11.1)
N i=1
N
2 1X
σ = (xi − µ)2 . (11.2)
N i=1
You could imagine that the larger your sample, the better the sample mean would approximate the
population mean.
11–1
Random Sample However, apart from being a small sample, lecturers you know could contain
another source of inaccuracy, namely that the sample is not random and so it may contain a bias
due to the fact that, for example, the lecturers in your sample tend to be younger.
By random sample we mean that each member of the population has an equal chance of being
sampled. Achieving a random sample is not always easy, see Chapter 13.
A point estimate for say µ is an approximate value µ̂ computed from the sample. Typically, in
addition to the estimate, µ̂, we give some qualifications such as the variance of the estimate, that
is, an indication of how variable we think µ̂ might be if we repeated the experiment a number of
times. The hat symbol, θ̂, is used to indicate that we have an estimate of θ.
The most obvious estimate for µ is to copy eqn. 11.1, noting that we use capital N for the size of
the population and lower-case n for the size of the sample,
n
1X
µ̂ = x̄ = xi . (11.3)
n i=1
The “best” estimate for σ is less obvious and eqn. 11.2 is modified slightly to,
n
1 X
σˆ2 = s 2 = (xi − x̄. (11.4)
n − 1 i=1
Thus, we not only replace µ by its estimate, x̄, we divide by n − 1 instead of n. It is usual to use
s 2 to denote sample variance.
The reason for the n − 1 is that dividing by n would generally lead to a systematic underestimate
— a so-called bias. This may be discussed in a later chapter; (reference it if we do).
11–2
11.5 Sampling Distributions
The estimate of the mean given by eqn. 11.3 is itself a random variable; we can imagine taking m
samples, each of size n, and each of these yielding a x̄ˆj for j = 1, 2, . . . , m.
E(x̄) = µ, (11.5)
Both eqns. 11.5 are rather comforting, (a) the expected value of x̄ is µ and the standard deviation
√
of x̄ is σ/ n, that is, as n increases the standard deviation decreases and will decrease to zero as
n → ∞.
√
Finally, we can state that the sampling distribution of µ̂ is N(µ, σ/ n). This means that if we
conduct a number of sample experiments (take a sample of n Xs and compute the mean bar x,
then bar x will be found to have a normal distribution centred on the true mean µ.
We note emphatically that we do not know µ. In the first part of the discussion below, we assume
that σ 2 is known. However, this is typically untrue, and we must use an estimate for the standard
deviation, as in eqn. 11.4.
Figure 11.1 (Maindonald & Braun 2007, p. 103) shows two sampling distributions, for a random
variable X which has µX = 10, σ = 1; Figure 11.1(a) shows the sampling distribution for a sample
size of n = 4, while Figure 11.1(b) shows the sampling distribution for a sample size of n = 9; the
distribution of X, corresponding to a sample size of n = 1 is shown for comparison.
x̄ − µ
√ ∼ N(0, 1). (11.7)
σ/ n
On the other hand, if σ is unknown, and we must replace σ with an estimate, s, see eqn. 11.4,
then
x̄ − µ
√ ∼ tn−1 , (11.8)
s/ n
where tn−1 is the Student t distribution with n − 1 degrees of freedom; see section 6.22. As with
N(0, 1), we have tables for the t distribution.
11–3
Figure 11.1: (a) Sampling distribution for a sample size of n = 4; (b) sampling distribution for a
sample size of n = 9; the distribution of X, corresponding to a sample size of n = 1 is shown for
comparison.
If the estimator for σ (unknown) is s, see eqn. 11.4, and µ is also unknown, with estimate x̄, then
n
X xi − x̄ (n − 1)s 2
2
= 2
∼ χ2n−1 , (11.9)
i=1
σ σ
where χ2n is the Chi-squared distribution with n degrees of freedom; see section 6.4.10. As with
N(0, 1) and tν we have tables for the χ2n distribution.
11–4
11.6 Confidence Intervals
√
In section 11.5.1 we established that the distribution of the sample mean is x̄ ∼ N(µ, σ/ n) or
x̄−µ
√ ∼ N(0, 1). This tells us that the estimate has a distribution that is
equivalently eqn. 11.7 σ/ n
centred on the mean, that the expected value of the estimate is the mean, and that the distribution
√
will have a standard-deviation (spread) of σ/ n.
Thus referring to Figure 11.1(a), we can say that the mean of x̄4 is µ, the true mean — which
we do not know and that different samples would vary between about 1.5σ above and below the
true mean. Hence if the true mean is 10 as in the diagram, and we kept repeating our sampling
experiment, we would expect the estimate x̄4 to vary between about 8.5 and 11.5.
On the other hand, if we used sample size n = 9, we would expect the estimate x̄9 to vary between
about 9.0 and 11.0, see Figure 11.1(b).
The previous few sentences should be suggesting that we should be able to give a plausible interval
estimate such as we estimate that the mean is between 9 and 11, together with a probability for
that assertion, e.g. about 0.95 as discussed in section 9.3 for P (−1.96 < Z < +1.96). But
unfortunately we cannot, for we do not know the true mean.
What can we say? Well, for example, that P (−1.96 < (x̄ − µ)/ √σn < +1.96) = 0.95. Still not
much good, for we do not know µ and we must be satisfied with the less useful statement that
the estimate x̄ is within plus-or-minus 1.96 × √σn from µ, with a probability of 0.95.
More explanation may be needed. What if x̄ is at one of these extremes, namely µ − 1.96 × √σn ;
this would correspond to about 9 in Figure 11.1(a). We can then say that x̄ + 1.96 × √σn just about
reaches up to µ. If we repeat the sampling, this will happen with a probability 1 − 0.025, i.e. the
amount of probability up to Z = −1.96 is 0.025.
Similarly, take the case that x̄ is at the other extreme, namely µ + 1.96 × √σn ; this would correspond
to about 11 in Figure 11.1(a). We can now say that x̄ − 1.96 × √σn just about reaches down to
µ. If we repeat the sampling, this will happen with a probability 1 − 0.025 (recall the symmetry
argument in section 9.3).
Consequently, if we take x̄ +/−1.96× √σn we can say that this interval will capture µ with probability
0.95.
This allows us to construct a confidence interval which we can claim contains µ; that is, we compute
not µ̂, but (L, U), an interval between (L)ower and (U)pper limits which we believe contains µ.
σ σ
(L, U) = (x̄ − 1.96 × √ , x̄ + 1.96 × √ ) (11.10)
n n
Summary on Point Estimation and Confidence Interval for the Mean when Variance Known
Refer to Figure 11.1, part (b) of which is based on a sample size of n = 9.
• If we take a point estimate for the mean, it will be distributed according to the narrow
distribution, i.e. if the true mean is 10, our estimate can be anywhere between 9 and 11.
11–5
• If we decide to give an interval estimate, we need to decide on a confidence (probability);
the wider the interval, the greater the confidence we can have in it — but a huge interval
with confidence of 100% is not much use to anyone. The usual confidence that is chosen is
95%.
• We would like to be able to look at Figure 11.1 (b) and say that our interval for the mean is
9 to 11 with confidence 95% (based on the diagram this is approximate, 10 − 1.96 × 0.5 to
10 + 1.96 × 0.5 are the precise values for 95%.
But we cannot make a statement like the latter, for we do not know that µ = 10.
• The best we can do is (a) take our estimate, x̄, (b) place a distribution like that in Fig-
ure 11.1(b) about it; (c) compute the x̄ + / − √σn (≈ 2) interval (eqn. 11.10).
This allows us to state:
if we repeated our sampling a large number of times, and we computed eqn. 11.10 each time
(getting a different interval), then 95% of these intervals would contain the true mean µ.
Need section on t-distribution and small sample sampling distrib. for mean with std.-dev.
unknown.
11–6
Chapter 12
Hypothesis Testing
12.1 Introduction
In Chapter 11 we discussed estimation of parameters, both point estimates and interval estimates
(with confidence value attached). This chapter is also based on sampling theory but here we are
interested in decisions rather than estimates. For example, based on a sample of occurrences of
heads and tails in a sample of n = 10 tosses of a coin, we might wish to come to the decision
whether the coin is fair. We might want to decide whether application of a new fertiliser really
does increase cropping yield, based on samples involving (i) the current fertiliser and (ii) the new
one.
The hypothesis testing technique involves the postulation of a hypothesis (an assumption, a state-
ment about population distributions or their parameters) and then designing an experiment which
will yield a sample upon which we can decide whether the hypothesis is true — based on sample
data.
A typical hypothesis test is as follows. We make a hypothesis that a random variable is distributed
according to fX (x), e.g. X ∼ N(µ, σ), where we assume that σ is known.
We compute a test statistic (a sample estimate with sample size n), for example µ̂ = X̄n and
reject H0 if X̄n > c, where c is some constant to be determined; X̄n > c is the critical region;
X̄n ≤ c is called the acceptance region.
The greater we make c, then the greater the significance level of the test X̄n > c. We can set
c using the same considerations we used in setting confidence levels for a confidence interval in
section 11.6. As in eqn. 11.7, we know that
X̄n − µ
Z= √ ∼ N(0, 1). (12.1)
σ/ n
¯−µ
so that we can use er f (z) = Φ(z) to choose a c = z such that P (z > c 0 ) = 0.05 = P ( Xσ/n √ n
>
0
c√σ
0
c ) = P (X̄n > n + µ, say, for a 2.5% significance level. (I’ve chose 2.5% = 0.025 because it
corresponds to a cutoff point (Z = 1.96) that we have already encountered.
12–1
That is, z > c 0 would occurs only 2.5% of the time if H0 is true; in other words the critical region
stretches from c 0 to the right of it. The acceptance region stretches to the left of c 0 , i.e. including
0
everywhere that X̄n ≤ c, where c = c√σn + µ.
Recalling P (Z > +1.96) = 0.025, we can set c 0 = 1.96 for a significance level of 0.025.
The standard normal pdf and the relevant critical region is shown in Figure 12.1 (Maindonald &
Braun 2007, p. 106).
Figure 12.1: One side hypothesis test, significance level = 0.025; critical region is shaded to the
right of 1.96. For a two sided test with significance level = 0.05, we include in the critical region
also the marked region to the left of -1.96.
Let us keep the original null hypothesis, H0 : µ = µ0 , and now choose an different alternative
hypothesis, namely HA : µ 6= µ0 . A suitable acceptance region for this might be cl < X̄n < ch ,
with the critical (rejection) region being all points below cl and all points above ch .
If we now choose a significance level of 0.05, we arrive at the familiar P (Z < −1.96 or Z >
12–2
X¯n √
−µ
+1.96) = 0.05, that is, if we have µ = µ0 , then values of Z < −1.96 or Z > +1.96 or σ/ n
<
X¯n √
−µ
−1.96 or σ/ n
> +1.96 should occur only 5% of the time and this is a sufficiently significant
deviation for us to reject the null hypothesis.
The significance level, usually denoted α, corresponds to the probability of rejecting H0 when H0
is true, that is, the extreme values in the critical region could occur, but with a small probability,
α.
H0 true HA true
Accept H0 correct Type 2 error, prob. β
Reject H0 Type 1 error, prob. α correct
12–3
Chapter 13
Sampling
13.1 Introduction
To be completed.
13–1
Chapter 14
14.1 Introduction
The terms classification and pattern recognition are used almost synomomously; statisticians tend
to favour classification, while engineers tend to use pattern recognition. This chapter merely
introduces the concepts; Chapters 15, 16, 18, 17 and 19 fill in the details.
These chapters are a reworking of some of the basic pattern recognition and neural network material
covered in (Campbell 2005) and (Campbell & Murtagh 1998) and (Campbell 2000).
We define/summarize a pattern recognition system using the block diagram in Figure 14.1.
Figure 14.1: Pattern recognition system; x a tuple of p measurements, output ω — class label.
14–1
Unsupervised classification Unsupervised classification is more of an exploratory data analysis
technique than is supervised classification.
In this case we have a set of patterns (random vectors) XT = {xi }ni=1 and we want to explore
structure in the set. For example, are they clustered, thereby suggesting that the clusters identify
a number of classes. Clustering involves assigning class labels to the XT = {xi }ni=1 based not on
training data but on proximity of the x’s or some other criterion.
14–2
Chapter 15
Let us assume that we want to classify a chemical product, for example fake pharmaceutical drugs,
according to the results of a chemical analysis. The analysis data comprise a vector x where x1
might be percentage mass of component 1, x2 component 2, etc. The label ω might be courntry
of origin, and it is this that we want to predict, given the results x from an analysis of a newly
seize batch.
For the moment, we’ll assume just two classes ω0 and ω1 ; two-class problems are easy to describe,
yet extension to n-class problems is easy.
In our simplistic character recognition system we require to recognise two sources, country 0 and
country 1, ω0 and ω1 . We start off with two components x = (x1 x2 )T .
As described in Chapter 14, we have earlier obtained examples of the drug from both countries,
XT = {xi , ωi }ni=1 , i.e. we have training data, or a sample.
Let us see whether we can recognise using component 1 alone (x1 . Figure 15.1 shows some
(training) data. We see that a threshold (T) set at about x1 = 2.8 is the best we can do; the
classification algorithm is:
ω = 1 when x1 ≥ T, (15.1)
= 0 otherwise. (15.2)
Use of histograms, see Figure 15.2 might be a more methodical way of determining the threshold,
T.
If enough training data were available, n → ∞, the histograms, h0 (x1 ), h1 (x1 ), properly normalised
would approach probability densities: p0 (x1 ), p1 (x1 ), more properly called class conditional proba-
bility densities (pdfs): p(x1 | ω), ω = 0, 1, see Figure 15.3.
15–1
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1 2 3 4 5 6 x1
T
freq.
h1(x1)
h0(x1)
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1 2 3 4 5 6 x1
T
15–2
p(x1 | 1)
p(x1 | 0)
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1 2 3 4 5 6 x1
T
the class confitional pdfs using parameters estimated from a sample (training data — estimation
= training); see Chapter 11.
The use of explicitly statistical methods is described in Chapter 16 but for now well try some
intuitive methods, but as you will see we are never far from statistics.
15–3
15.2 Linear separating lines/planes for two-dimensions
Since there is overlap in the component-1, x1 , measurement, let us use the two components,
x = (x1 x2 )T , i.e. (component-1, component-2). Figure 15.4 shows a scatter plot of these data
(the sample).
5 x2
0 0 0
4 0 0 0 0
0 0 0 0 0 0
3 0 0 0 0 0 0
1 0 0 0 0 0
2 1 1 1 1 0 0 0 0
1 1 1 1 1 0 0 0 0
1 1 1 1 1 1
1 1 1 1 1
1 1 1
1 2 3 4 5 6 x1
The dotted line shows that the data are separable by a straight line; it intercepts the axes at
x1 = 4.5 and x2 = 6.
Apart from plotting the data and drawing the line, how could we derive the separating from the
data? (Thinking of a computer program.)
Figure 15.5 shows the line joining the class means and the perpendicular bisector of this line; the
perpendicular bisector turns out to be the separating line. We can derive the equation of the
separating line using the fact that points on it are equidistant to both means, µ0 , µ1 , and expand
using Pythagoras’s theorem,
|x − µ0 |2 = |x − µ1 |2 , (15.3)
2 2 2 2
(x1 − µ01 ) + (x2 − µ02 ) = (x1 − µ11 ) + (x2 − µ12 ) . (15.4)
We eventually obtain
(µ01 − µ11 )x1 + (µ02 − µ12 )x2 − (µ201 + µ202 − µ211 − µ212 ) = 0, (15.5)
15–4
5 x2
0 0 0
4 0 0 0 0
0 0 0 0 0 0
3 0 0 0 0 0 0
1 0 0 0 0 0
2 1 1 1 1 0 0 0 0
1 1 1 1 1 0 0 0 0
1 1 1 1 1 1
1 1 1 1 1
1 1 1
1 2 3 4 5 6 x1
Figure 15.5: Two dimensional scatter plot showing means and separating line.
b1 x1 + b2 x2 − b0 = 0. (15.6)
In Figure 15.5, µ01 = 4, µ02 = 3, µ11 = 2, µ12 = 1.5; with these values, eqn 15.6 becomes
which intercepts the x1 axis at 18.75/4 ≈ 4.7 and the x2 axis at 18.75/3 = 6.25.
Eqn 15.6 becomes more interesting and useful in its normal form,
a1 x1 + a2 x2 − a0 = 0, (15.8)
p
where a12 + a22 = 1; eqn 15.8 can be obtained from eqn 15.6 by dividing across by b12 + b22 .
Figure 15.6 shows interpretations of the normal form straight line equation, eqn 15.8. The coef-
ficients of the unit vector normal to the line are n = (a1 a2 )T and a0 is the perpendicular distance
from the line to the origin. Incidentally, the components correspond to the direction cosines of
n = (a1 a2 )T = (cos θ sin θa2 )T . Thus, (Foley, van Dam, Feiner, Hughes & Phillips 1994) n cor-
responds to one row of a (frame) rotating matrix; in other words, see below, section 15.5, dot
product of the vector expression of a point with n, corresponds to projection onto n. (Note that
cos π/2 − θ = sin θ.)
15–5
x2
a0/a2
line (x1’ x2’)
a1x1 + a2x2 −a0 = 0
a1x1’ + a2x2’ −a0 > 0
a0
theta
a0/a1 x1
at (x1’’, x2’’)
a1x1’’ + a2x2’’ − a0 < 0
Also as shown in Figure 15.6, points x = (x1 x2 )T on the side of the line to which n = (a1 a2 )T
points have a1 x1 + a2 x2 − a0 > 0, whilst points on the other side have a1 x1 + a2 x2 − a0 < 0; as we
know, points on the line have a1 x1 + a2 x2 − a0 = 0.
We know that a1 x1 + a2 x2 = aT x, the dot product of n = (a1 a2 )T and x represents the projection
of points x onto n — yielding the scalar value along n, with a0 fixing the origin. This is plausible:
projecting onto n yields optimum separability.
Such a projection,
g(x) = a1 x1 + a2 x2 , (15.9)
g(x) = a1 x1 + a2 x2 − a0 , (15.13)
15–6
ω = 0 when g(x) > 0, (15.14)
= 1, g(x) < 0, (15.15)
= tie, g(x) = 0. (15.16)
Equation 15.13 readily generalises to p dimensions, n is a unit vector in p dimensional space, normal
to the the p − 1 separating hyperplane. For example, when p = 3, n is the unit vector normal to
the separating plane.
Other important projections used in pattern recognition are Principal Components Analysis (PCA)
and Fisher’s Linear Discriminant Analysis (lda), see Chapter 17.
An intuitive (but well founded) classification method is that of template matching or correlation
matching. Here we have perfect or average examples of classes stored in vectors {zj }cj=1 , one for
each class. Without loss of generality, we assume that all vectors are normalised to unit length.
Classification of an newly arrived vector x entails computing its template/correlation match with
all c templates:
xT zj ; (15.17)
Yet again we see that classification involves dot product, projection, and a linear discriminant.
Obviously, we may not always have the linear separability of Figure 15.5. One non-parametric
method is to go beyond nearest mean, see eqn. 15.4, to compute the nearest neighbour in the
entire training data set, and to decide class according to the class of the nearest neighbour.
A variation is k-nearest neighbour, where a vote is taken over the classes of the k nearest neighbours.
15–7
Chapter 16
p(x1 | 1)
p(x1 | 0)
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1 2 3 4 5 6 x1
T
We have class conditional pdfs: p(x1 | ω), ω = 0, 1; given a newly arrived x10 we might decide
on its class according to the maximum class conditional pdf at x10 , i.e. set a threshold T where
p(x1 | 0) and p(x1 | 1) cross, see Figure 16.1.
This is not completely correct. What we want is the probability of each class — its posterior
probability — based on the evidence supplied by the data, combined with any prior evidence.
In what follows, P (ω|x) is the posterior probability or a posteriori probability of class ωi given the
observation x; P (ωi ) is the prior probability or a priori probability. We use upper case P (.) for
discrete probabilities, whilst lower case p(.) denotes probability densities.
16–1
Bayes’ Rule Recall Bayes’ rule from eqn. 5.22 and repeated here,
n
X
P (Ai |B) = P (B|Ai )P (Ai )/ P (B|Ai )P (Ai ). (16.1)
i=1
1
X
P (ωi |x) = P (x|ωi )P (ωi )/ P (x|ωi )P (ωi ). (16.2)
i=0
P (ωi |x) is the posterior probability of class ωi given that our analysis has yielded x; P (ωi ) is the
prior probability — if we have no prior preference, the P (ω0 ) = 0.5, P (ω1 ) = 0.5.
Eqn. 16.2 forms a Bayes decision rule: compute the two posterior probabilities and take the class
which has the maximum.
Let the Bayes decision rule be represented by a function g(.) of the feature vector x:
To show that the Bayes decision rule, eqn. 16.3, achieves the minimum probability of error, we
compute the probability of error conditional on the feature vector x — the conditional risk —
associated with it: c
X
R(g(x) = ωj | x) = P (ωk | x). (16.4)
k=1,k6=j
That is to say, for the point x we compute the posterior probabilities of all the c − 1 classes not
chosen.
Since Ω = {ω1 , . . . , ωc } form a partition (they are mutually exclusive and exhaustive) and the
P (ωk |x)ck=1 are probabilities and so sum to unity, eqn. 16.4 reduces to:
It immediately follows that, to minimise R(g(x) = ωj ), we maximise P (ωj | x), thus establishing
the optimality of eqn. 16.3.
16–2
and, owing to the fact that the events in a joint probability are interchangeable, we can equate the
joint probabilities :
p(ω, x) = p(x, ω) = p(x | ω)P (ω). (16.7)
Therefore, equating the right hand sides of these equations, and rearranging, we arrive at Bayes’
rule for the posterior probability P (ω | x):
P (ω) expresses our belief that ω will occur, prior to any observation. If wePhave no prior knowledge,
c
we can assume equal priors for each class: P (ω1 ) = P (ω2 ) . . . = P (ωc ), j=1 P (ωj ) = 1. Although
we avoid further discussion here, we note that the matter of choice of prior probabilities is the
subject of considerable discussion especially in the literature on Bayesian inference, see, for example,
(Sivia 1996).
p(x) is the unconditional probability density of x, and can be obtained by summing the conditional
densities: c
X
p(x) = p(x | ωj )P (ωj ). (16.9)
j=1
Where we can assume that the densities follow a particular form, for example Gaussian, the density
estimation problem is reduced to that of estimation of parameters.
The multivariate normal density, see section B.7, p-dimensional, is given by:
1 1
p(x | ωj ) = exp [− (x − µj )T K−1
j (x − µj )] (16.10)
(2π)p/2 | Kj |1/2 2
p(x | ωj ) is completely specified by µj , the p-dimensional mean vector, and Kj the corresponding
p × p covariance matrix:
µj = E[x]ω=ωj , (16.11)
Kj = E[(x − µj )(x − µj )T ]ω=ωj . (16.12)
and,
Nj
1 X
Kj = (xn − µj )(xn − µj )T , (16.14)
Nj − 1 n=1
16–3
16.4 Discriminants based on Normal Density
Since p(x), the denominator of eqn. 16.15 is the same for all gj (x) and since eqn. 16.16 involves
comparison only, we may rewrite eqn. 16.15 as
We may derive a further possible discriminant by taking the logarithm of eqn. 16.17 — since
logarithm is a monotonically increasing function, application of it preserves relative order of its
arguments:
gj (x) = log p(x | ωj ) + log P (ωj ). (16.18)
In the multivariate Gaussian case, eqn. 16.18 becomes (Duda & Hart 1973),
1 p 1
gj (x) = − (x − µj )T K−1
j (x − µj ) − log2π − log | Kj | +logP (ωj ) (16.19)
2 2 2
The multivariate normal (Gaussian) density provides a good characterisation of pattern (vector)
distribution where we can model the generation of patterns as ideal pattern plus measurement
noise; for an instance of a measured vector x from class ωj :
xn = µj + en , (16.20)
Revealing comparisons with the other learning paradigms which play an important role in this thesis
are made possible if we examine particular forms of noise covariance in which the Bayes-Gauss
classifier decays to certain interesting limiting forms:
• Equal and Diagonal Covariances (Kj = σ 2 I, ∀j, where I is the unit matrix); in this case certain
important equivalences with eqn. 16.19 can be demonstrated:
16–4
– Linear discriminant;
– Template matching;
– Matched filter;
– Single layer neural network classifier.
When each class has the same covariance matrix, and these are diagonal, we have, Kj = σ 2 I, so
that K−1
j = σ12 I. Since the covariance matrices are equal, we can eliminate the 12 | logKj |; the
p T −1
2 log2π term is constant in any case; thus, including the simplification of the (x − µj ) Kj (x − µj ),
eqn. 16.19 may be rewritten:
1
gj (x) = − 2
(x − µj )T (x − µj ) + logP (ωj ) (16.21)
2σ
1
= 2
kx − µj )k2 + logP (ωj ). (16.22)
2σ
Nearest mean classifier If we assume equal prior probabilities P (ωj ), the second term in
eqn. 16.22 may be eliminated for comparison purposes and we are left with a nearest mean classifier.
where
1
wj0 = − 2
(µTj µj ) + logP (ωj ), (16.25)
2σ
and
1
wj = µj . (16.26)
σ2
Template matching In this latter form the Bayes-Gauss classifier may be seen to be performing
template matching or correlation matching, where wj = constant × µj , that is, the prototypical
pattern for class j, the mean µj , is the template.
16–5
Matched filter In radar and communications systems a matched filter detector is an optimum
detector of (subsequence) signals, for example, communication symbols. If the vector x is written
as a time series (a digital signal), x[n], n = 0, 1, . . . then the matched filter for each signal j may
be implemented as a convolution:
N−1
X
yj [n] = x[n] ◦ h[n] = x[n − m] hj [m], (16.27)
m=0
where the kernel h[.] is a time reversed template — that is, at each time instant, the correlation
between h[.] and the last N samples of x[.] are computed. Provided some threshold is exceeded,
the signal achieving the maximum correlation is detected.
Single Layer Neural Network If we restrict the problem to two classes, we can write the clas-
sification rule as:
1
and w = σ 2 (µ1 − µ2 ).
In other words, eqn. 16.29 implements a linear combination, adds a bias, and thresholds the result
— that is, a single layer neural network with a hard-limit activation function.
(Duda & Hart 1973) further demonstrate that eqn. 16.22 implements a hyper-plane partitioning
of the feature space.
When each class has the same covariance matrix, K, eqn. 16.19 reduces to:
Nearest Mean Classifier, Mahalanobis Distance If we have equal prior probabilities P (ωj ), we
arrive at a nearest mean classifier where the distance calculation is weighted. The Mahalanobis
distance (x−µj )T K−1
j (x−µj ) effectively weights contributions according to inverse variance. Points
of equal Mahalanobis distance correspond to points of equal conditional density p(x | ωj ).
Linear Discriminant Eqn. 16.30 may be rewritten as a linear discriminant, see also section 15.5:
where
1
wj0 = − (µTj K−1 µj ) + logP (ωj ), (16.32)
2
and
wj = K−1 µj . (16.33)
16–6
Weighted template matching, matched filter In this latter form the Bayes-Gauss classifier may
be seen to be performing weighted template matching.
Single Layer Neural Network As for the diagonal covariance matrix, it can be easily demon-
strated that, for two classes, eqns. 16.31– 16.33 may be implemented by a single neuron. The
only difference from eqn. 16.29 is that the non-bias weights, instead of being simple a difference
between means, is now weighted by the inverse of the covariance matrix.
We can formulate the problem of classification as a least-square-error problem. Let us require the
classifier to output a class membership indicator ∈ [0, 1] for each class, we can write:
d = f (x) (16.34)
where d = (d1 , d2 , . . . dc )T is the c-dimensional vector of class indicators and x, as usual, the
p-dimensional feature vector.
In order to continue the analysis we need to refer to the theory of linear regression, see Chapter 20.
ŷ = B̂x. (16.37)
16–7
16.7 Generalised linear discriminant function
Eqn. 15.13 may be adapted to cope with any function(s) of the features xi ; we can define a new
feature vector x0 where:
xk0 = fk (x). (16.38)
In the pattern recognition literature, the solution of eqn. 16.38 involving now the vector x0 is called
the generalised linear discriminant function (Duda & Hart 1973).
It is desirable to escape from the fixed model of eqn. 16.38: the form of the fk (x) must be
known in advance. Multilayer perceptron (MLP) neural networks provide such a solution. We have
already shown the correspondence between the linear model, eqn. 20.8, and a single layer neural
network with a single output node and linear activation function. An MLP with appropriate non-
linear activation functions, e.g. sigmoid, provides a model-free and arbitrary non-linear solution to
learning the mapping between x and y (Bishop 1995).
16–8
Chapter 17
Principal component analysis (PCA), also called Karhunen-Loève transform (Duda, Hart & Stork
2000) is a linear transformation which maps a p-dimensional feature vector x ∈ Rp to another
vector y ∈ Rp where the transformation is optimised such that the components of y contain
maximum information in a least-square-error sense. In other words, if we take the first r ≤ p
components (y0 ∈ Rq ), then using the inverse transformation, we can reproduce x with minimum
error. Yet another view is that the first few components of y contain most of the variance, that is,
in those components, the transformation stretches the data maximally apart. It is this that makes
PCA good for visualisation of the data in two dimensions, i.e. the first two principal components
give an optimum view of the spread of the data.
We note however, unlike linear discriminant analysis, see section 17.2, PCA does not take account
of class labels. Hence it is typically a more useful visualisation of the inherent variability of the
data.
p
X
x = Uy = yi ui (17.1)
i=1
where
U = (u1 , u2 , . . . , up ) (17.3)
is an orthonormal matrix:
17–1
If we truncate the expansion at i = q
q
X
0
x = Uq y = yi ui , (17.5)
i=1
|x − x0 | = mi ni mum. (17.6)
The optimum transformation matrix U turns out to be the eigenvector matrix of the sample
covariance matrix C:
1 t
C= A A, (17.7)
N
UCUt = Λ, (17.8)
In contrast with PCA (see section 17.1), linear discriminant analysis (LDA) transforms the data
to provide optimal class separability (Duda et al. 2000) (Fisher 1936).
Fisher’s original LDA, for two-class data, is obtained as follows. We introduce a linear discriminant
u (a p-dimensional vector of weights — the weights are very similar to the weights used in neural
networks) which, via a dot product, maps a feature vector x to a scalar,
y = ut x. (17.9)
u is optimised to maximise simultaneously, (a) the separability of the classes (between-class separa-
bility ), and (b) the clustering together of same class data (within-class clustering). Mathematically,
this criterion can be expressed as:
ut SB u
J(u) = t . (17.10)
u SW u
where SB is the between-class covariance,
17–2
Sw = C1 + C2 , (17.12)
u = S−1
w m1 − m2 . (17.13)
There are other formulations of LDA (Duda et al. 2000) (Venables & Ripley 2002), particularly
extensions from two-class to multi-class data.
In addition, there are extensions (Duda et al. 2000) (Venables & Ripley 2002) which form a second
discriminant, orthogonal to the first, which optimises the separability and clustering criteria, subject
to the orthogonality constraint. The second dimension/discriminant is useful to allow the data to
be view as a two-dimensional scatter plot.
17–3
Chapter 18
Here we show that a single neuron implements a linear discriminant (and hence also implements
a separating hyperplane). Then we proceed to a discussion which indicates that a neural network
comprising three processing layers can implement any arbitrarily complex decision region.
Recall eqn. 15.12, with ai → wi , and now (arbitrarily) allocating discriminant value zero to class 0,
p
(
X ≤ 0, ω = 0
g(x) = wi xi − w0 (18.1)
i=1
> 0, ω = 1.
Figure 18.1 shows a single artificial neuron which implements precisely eqn. 18.1.
+1 (bias)
w0
x1 w1
w2
x2
. wp
.
.
xp
The signal flows into the neuron (circle) are weighted; the neuron receives wi xi . The neuron sums
and applies a hard limit (output = 1 when sum > 0, otherwise 0). Later we will introduce a sigmoid
activation function (softer transition) instead of the hard limit.
The threshold term in the linear discriminant (a0 in eqn. 15.13) is provided by w0 × +1. Another
interpretation of bias, useful in mathematical analysis of neural networks, see section 16.6, is to
represent it by a constant component, +1, as the zeroth component of the augmented feature
vector.
18–1
Just to reemphasise the linear boundary nature of linear discriminants (and hence neural networks),
examine the two-dimensional case,
(
≤ 0, ω = 0
w1 x1 + w2 x2 − w0 (18.2)
> 0, ω = 1.
x2
−w0/w2
−w1/w0 x1
18–2
18.1 Neurons for Boolean Functions
Similarly, a neuron with weights w0 = −0.25, and w1 = w2 = 0.35 implements a Boolean OR.
Figure 18.3 shows the x1 -x2 -plane representation of AND, OR, and XOR (exclusive-or).
x2 1 x2 x2
1 0 1 1 1 1 0
1
0 0 0 1 0 1
1 x1 1 x1 1 x1
AND OR XOR
It is noted that XOR cannot be implemented by a single neuron; in fact it required two layers.
Two layer were a big problem in the first wave of neural network research in the 1960s, when it
was not known how to train more than one layer.
The purpose of this section is to give an intuitive argument as to why three processing layers can
implement an arbitrarily complex decision region.
As shown in the figure, however, each ‘island’ of class 1 may be delineated using a series of
boundaries, d11 , d12 , d13 , d14 and d21 , d22 , d23 , d24 .
Figure 18.5 shows a three-layer network which can implement this decision region.
First, just as before, input neurons implement separating lines (hyperplanes), d11, etc. Next, in
layer 2, we AND together the decisions from the separating hyperplanes to obtain decisions, ‘in
island 1’, ‘in island 2’. Finally, in the output layer, we OR together the latter decisions; thus we
can construct an arbitrarily complex partitioning.
18–3
d24
5 x2 d21
0 0 0 0 0 0 0 1 1 0 0 0 0 0
4 0 0 0 0 0 0 0 1 1 1 1 d23
1 1 1 1 1 10 0 0 0
0 0 0 0 0 0
3 1 1 1 1 10 0 0 0
0 0 0 0 0 0 0
d11 1 1 1 10 0 0 0
1 d22
2 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 d14
0 0 1 1 1 1 1 10 0 0 0 0 0 0 0
1 0 0 d121 1 1 1 0 0 0 0 0 0 0 0
0 0 0 1 1 1 0 0 0 0 0 0 0 0
d13
1 2 3 4 5 6 x1
Of course, this is merely an intuitive argument. A three layer neural network trained with back-
propagation or some other technique might well achieve the partitioning in quite a different manner.
If a neural network is to be trained using backpropagation or similar technique, hard limit activation
functions cause problems (associated with differentiation). Sigmoid activation functions are used
instead. A sigmoid activation function corresponding to the hard limit progresses from output
value 0 at −∞, passes through 0 with value 0.5 and flattens out at value 1 at +∞.
18–4
+1 (bias)
d11
x1
x2
. +1
. d12
.
xp
+1
d13
AND
.
.
.
+1 class
d14
OR
d21
.
.
. . . .
d24 AND
Figure 18.5: Three-layer neural network implementing an arbitrarily complex decision region.
18–5
Chapter 19
19–1
Chapter 20
Regression
y = b0 + b1 x + e, (20.1)
which shows the dependence of the dependent variable y on the independent variable x. In other
words, y is a linear function of x and the observation is subject to noise, e; e is assumed to be
a zero-mean random process. Strictly eqn. 20.1 is affine, since b0 is included, but common usage
dictates the use of linear. Taking the nth observation of (x, y ), we have (Beck & Arnold 1977, p.
133):
yn = b0 + b1 xn + en (20.2)
Least square error estimators for b0 and b1 , bˆ0 and bˆ1 may be obtained from a set of paired
observations {xn , yn }N
n=1 by minimising the sum of squared residuals:
N
X N
X
S= rn2 = (yn − yˆn )2 (20.3)
n=1 n=1
N
X
S= (yn − b0 − b1 xn )2 (20.4)
n=1
Minimising with respect to b0 and b1 , and replacing these with their estimators, bˆ0 and bˆ1 , gives
the familiar result:
X X X X X
bˆ1 = N[ yn xn − ( yi )( xi )]/[N( xi2 ) − ( xi )2 ] (20.5)
bˆ1 xn
P P
yn
bˆ0 = xn − (20.6)
N N
The validity of these estimates does not depend on the distribution of the errors en ; that is, as-
sumption of Gaussianity is not essential. On the other hand, all the simplest estimation procedures,
including eqns. 20.5 and 20.6, assume the xn to be error free, and that the error en is associated
with yn .
20–1
In the case where y , still one-dimensional, is a function of many independent variables — p in our
usual formulation of p-dimensional feature vectors — eqn. 20.2 becomes:
p
X
yn = b0 + bi xin + en (20.7)
i=1
yn = xTn b + en (20.8)
y = Xb + e (20.9)
S = (y − Xb̂)T . (20.10)
Minimising with respect to b — just as eqn. 20.3 was minimised with respect to b0 and b1 — leads
to a solution for b̂ (Beck & Arnold 1977, p. 235):
PN
The jk-th element of the (p + 1) × (p + 1) matrix XT X is n=1 xnj xnk , in other words, just N× the
jk-th element of the autocorrelation matrix, R, of the vector of independent variables x estimated
from the N sample vectors.
If we have multiple dependent variables (y ), in this case, c of them, we can replace y in eqn. 20.11
with an appropriate matrix N × c matrix Y formed by N rows each of c observations. Now,
eqn. 20.11 becomes:
B̂ = (XT X)−1 XT Y (20.12)
XT Y is a p + 1 × c matrix, and B̂ is a (p + 1) × c matrix of coefficients.
Eqn. 20.12 has one significant weakness: it depends on the condition of the matrix XT X. As with
any autocorrelation or auto-covariance matrix, this cannot be guaranteed; for example, linearly
dependent features will render the matrix singular. In fact, there is an elegant indirect implementa-
tion of eqn. 20.12 involving the singular value decomposition (SVD) (Press, Flannery, Teukolsky &
Vetterling 1992), (Golub & Van Loan 1989). The Widrow-Hoff iterative gradient-descent training
procedure (Widrow & Lehr 1990) developed in the early 1960s tackles the problem in a different
manner.
20–2
Bibliography
Beck, J. & Arnold, K. (1977). Parameter Estimation in Engineering and Science, John Wiley &
Sons, New York.
Berger, J. (1985). Statistical Decision Theory and Bayesain Analysis 2nd ed., Springer Verlag.
Bishop, C. (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford,
U.K.
Campbell, J. (2000). Fuzzy Logic and Neural Network Techniques in Data Analysis, PhD thesis,
University of Ulster.
Campbell, J. (2005). Lecture notes on pattern recognition and image processing, Technical report,
Letterkenny Institute of Technology. http://www.jgcampbell.com/ip/pr.pdf (accessed 2009-
05-01).
Campbell, J. & Murtagh, F. (1998). Image processing and pattern recognition, Technical report,
Computer Science, Queen’s University Belfast. available at: http://www.jgcampbell.com/ip
(2009-05-01).
Duda, R. & Hart, P. (1973). Pattern Classification and Scene Analysis, Wiley-Interscience, New
York.
Duntsch, I. & Gediga, G. (2000). Sets, Relations, Functions, Methodos Publishers. Available via
http://www.cosc.brocku.ca/ duentsch/papers/methprimer1.html (2009-04-30).
Dytham, C. (2009). Choosing and Using Statistics: A Biologist’s Guide, 2nd edn, Blackwell
Publishing. ISBN-13: 978-1-4051-0243-8.
Feller, W. (1968). An Introduction to Probability Theory and its Applications, volume 1, 3rd edn,
John Wiley & Sons, New York.
Fisher, R. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics
7: 179–188. in (?).
20–1
Foley, J., van Dam, A., Feiner, S., Hughes, J. & Phillips, R. (1994). Introduction to Computer
Graphics, Addison Wesley.
Gelman, A., Carlin, J., Stern, H. & Rubin, D. (1995). Bayesian Data Analysis, Chapman and Hall.
Gelman, A. & Nolan, D. (2002). Teaching statistics: a bag of tricks, Oxford University Press.
Golub, G. & Van Loan, C. (1989). Matrix Computations, 2nd edn, Johns Hopkins University Press,
Baltimore.
Griffiths, D. (2009). Head First Statistics, O’Reilly. ISBN-10: 0596527586. Excellent introduction.
Hacking, I. (2001). An Introduction to Probability and Inductive Logic, Oxford University Press.
Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Elements of Statistical Learning, Springer.
Hsu, H. (1997). Theory and Problems of Probability, Random Variables, and Random Processes
(Schaum’s Outlines), McGraw-Hill.
Jaynes, E. & (editor), L. B. (2003). Probability Theory: The Logic of Science, Cambridge Uni-
versity Press. Jaynes was one of the chief advocates of the Bayesian method.
Jeffreys, H. (1961/1998). Theory of Probability, 3rd edn, Oxford University Press (Oxford Classics
Series – 1998), Oxford, U.K.
Larson, H. (1982). Introduction to Probability and Statistical Inference, 3rd edn, John Wiley.
Lee, P. M. (2004). Bayesian Statistics: an introduction, 3rd edn, Arnold. Reputedly one of the
best introductions to Bayesian statistics; Contains examples in R.
Maindonald, J. & Braun, J. (2007). Data Analysis and Graphics Using R: an example-based
approach, 2nd edn, Cambridge University Press, Cambridge, U.K. ISBN: 978-0-521-86116-8;
good R examples, including graphics.
Milton, M. (2009). Head First Data Analysis: A learner’s guide to big numbers, statistics, and
good decisions, O’Reilly. ISBN-10: 0596153937. Another excellent introduction. Uses R.
Murtagh, F. (2005). Correspondence Analysis and data Coding with Java and R, Chapman and
Hall/CRC Press.
O’Hagan, A. (1994). Kendall’s Advanced Theory of Statistics, Vol. 2B, Bayesian Inference, Edward
Arnold.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (revised second printing), Morgan
Kaufmann, San Francisco, CA.
20–2
Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes in C, 2nd edn,
Cambridge University Press, Cambridge, UK.
Quinn, G. P. & Keough, M. J. (2002). Experimental Design and Data Analysis for Biologists,
Cambridge University Press. ISBN-13: 978-0521009768.
Ripley, B. (1996). Pattern Recognition and Neural Networks, Cambridge University Press, Cam-
bridge, U.K.
Rosenkrantz, R. D. (ed.) (1983). E.T. Jaynes. Papers on Probability, Statistics and Statistical
Physics, Kluwer, Dordrecht.
Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the 20th
Century, W.H. Freeman. Great introduction to the origins of statistics.
Sivia, D. (1996). Data Analysis, A Bayesian Tutorial, Oxford University Press, Oxford, U.K.
Sivia, D. (2006). Data Analysis, A Bayesian Tutorial, 2nd edn, Oxford University Press. Best
introduction to Bayesian inference there is.
Spiegel, M. R., Schiller, J. & Srinivasan, R. A. (2009). Theory and Problems of Probability and
Statistics (Schaum’s Outlines), 3rd edn, McGraw-Hill.
Spiegel, M. R. & Stephens, L. J. (2008). Statistics (Schaum’s Outlines), 4th edn, McGraw-Hill.
Highly recommended; if you have to buy one book, this is the one; has examples using a few
packages, most notably Excel.
Therrien, C. (1989). Decision, Estimation, and Classification, Chichester, UK: John Wiley and
Sons.
Venables, W. & Ripley, B. (2002). Modern Applied Statistics with S, 4th edn, Springer-Verlag.
Highly recommended for learning R (R is a free version of S).
Wasserman, L. (2004). All of Statistics: a concise course in statistical inference, Springer Verlag,
New York, NY. ISBN: 0-387-40272-1; top class encyclopedic reference.
Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks, Proc. IEEE 78(9): 1415–
1442.
–3
Appendix A
The notation described here notation is merely shorthand for common sense concepts which would
otherwise be confusing and long-winded if written in English. Casual familiarity with the most
important items will also allow you to read papers using statistics without becoming confused. The
online book Sets, Relations, Functions (Duntsch & Gediga 2000) is an ideal introduction; we take
these notes from that book.
A.1 Sets
A set is a very basic mathematical entity and hence is a bit hard to define. Let’s say that a set is
a collection of objects; there cannot be repetition (duplication) of objects. We can specify a set
by writing all its members within curly brackets, { }.
Example 30 Six sided dice, set of possible faces (identified by the number of spots); call the set
D. We can write D as, D = {1, 2, 3, 4, 5, 6}. When there is an obvious sequence, we can write,
D = {1, 2, . . . , 6}.
Sometimes we specify a rule for making the set, we have for example, the trivial rule generated set
D = {i | i ∈ {1, . . . , 6}} = {1, . . . , 6}; the set of even numbers between 1 and 6 is given by
Dev en = {i | i ∈ {1, . . . , 6} and i even} = {2, 4, 6}.
We use the membership symbol ∈ to state that an object is a member of a set, for example,
1 ∈ {1, 2, 3}; we can state non-membership by 6∈, for example, 6 6∈ {1, 2, 3}
There is no ordering of position in a set. {1, 2, 3}, {2, 3, 1} represent the same set. If there is
repetition, it is understood that the repeated elements have no effect so that {1, 2, 3}, {2, 3, 1, 1, 2}
represent the same set.
A–1
A.1.2 Important Number Sets
• Real numbers: R.
Comment. In case the notion of a universal set causes difficulty: the universal set depends
on the problem at hand; when talking about a class of students, then U would be the set of
all students in the class. You might have A as the set of all students (in that class — in that
universal set) from County Donegal; then Ā is the set of all students from outside County
Donegal — that is not from County Donegal.
Set operations such as intersection, union, difference and complement are often illustrated using
Venn diagrams such as those shown in Figure A.1.
A–2
11111111
00000000
00000000
11111111 U = universal set
A 00000000
11111111
00000000
11111111
A
00000000
11111111 111111
000000
11111
00000 00000000
11111111
00000000
11111111 000000
111111
00000
11111 000000000
111111111
00000000
11111111
000000000
111111111 000000
111111
00000
11111 00000000
11111111
000000000
111111111 000000
111111
00000
11111
00000
11111 00000000
11111111
000000000
111111111 000000
111111
00000
11111 00000000
11111111
000000000
111111111 000000
111111
00000
11111 00000000
11111111
000000000
111111111
00000000
11111111
000000000
111111111
B
000000000
111111111
B
000000000
111111111
000000000
111111111
Intersection of A, B Union of A, B (all shaded area) A complement
of A
Figure A.1: Set operations illustrated using Venn diagrams; (a) intersection, (b) union, (c) com-
plement.
Subset When a set A has no members or some or all of the members of B, but no more, we say
that A is a subset of B. A ⊆ B.
Equality of sets When a set A has the same members as B, or each is empty, we say that they
are equal: A = B. Another way of looking at this is, if A ⊆ B and B ⊆ A, then A = B.
Empty Set If a set contains no members, we call it the empty set; symbol ∅.
Cardinality of a Set The number of elements in a set A is called its cardinality and written |A|.
Given a set A, the power set of A, P(A), is the set of all subsets of A. |P(A)| = 2|A| . Notice that
you can have a set of sets, for example, the set of all classes in the computing department.
P(A) = {∅, {c}, {b}, {a}, {b, c}, {a, c}, {a, b}, {a, b, c}}.
A–3
Finite and Infinite Sets Roughly speaking, if |A| = n where n is some number we can identify,
then we say that A is a finite set. Most of the sets in our examples are finite sets; otherwise the
set is infinite.
N, Z, R are infinite sets.
This is an example of a finite set of integer numbers A = {1, 2, . . . , n}; in contrast an infinite set
of integer numbers would be written A = {1, 2, . . .} which means A = {1, 2, . . . , ∞}.
If we want to write down the operation of summing the numbers from 1 to 6, we could write
s = 1 + 2 + 3 + 4 + 5 + 6 or s = 1 + 2+, P
. . . , +6. But this becomes tedious or impossible for larger
6
lists. We have the summation notation i=1 i .
Similarly, if we want to write downQthe operation of multiplying together all the numbers from 1
6
to 6, we use the product notation i=1 i .
If we want to write down the operation of taking the union (see section A.1.3 of a list of sets
the numbers from A1 to A6 , we could write B = A1 ∪ A2 , . . . , ∪A6 . But this
S6becomes tedious or
impossible for larger lists. Similar to the summation notation we have B = i=1 Ai .
T6
For intersection we have B = i=1 Ai .
Quite often we need to make new sets by making pairs (or triples or n-tuples) from existing sets.
Example 38 Let B = {1, 2, 3, 4, 5, 6} the set of outcomes from throwing a six-sided dice and
A = {H, T }, the set of outcomes of a coin toss. If we perform an experiment where we
throw the dice and toss a coin and we want to describe the set of all possible pairs C =
{(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )}, we call set C the Cartesian product of A and B.
A–4
Appendix B
B.1 Introduction
In Chapters 7 and 8 we introduce two-dimensional random variables, that is, pairs of random
variables which, for one reason or another, we want to treat as pairs rather than separately. Much
of what we do in one-dimension generalises to two- and generally multi-dimensions; likewise two-d.
to multi-dimensions.
Price of an apple = x1 , price of an orange = x2 (both unknown). Person A buys 3 apples, and 1
orange and the total bill is 5c (y1 ). Person B buys 2 apples and 4 oranges and the total bill is 10c
(y2 ).
Now, what is x1 , the price of apples, and x2 , the price of oranges? We want to solve for the
unknowns x1 , x2 . Matrix algebra gives us a nice technique for solving such problems, see section B.6,
but first well see how to solve it without matrices.
Eqn. B.3 gives x2 = 5 − 3x1 , which, substituted into eqn. B.4 gives:
B–1
10 = 2x1 + 4(5 − 3x1 ),
10 = 2x1 + 20 − 12x1 ,
−10 = −10x1 ,
x1 = 1.
Vectors We could be extra careful and continue to call objects like x and y tuples. But everyone
in the statistical world uses the term vector for tuple, and, because we are using vector and matrix
arithmetic and algebra, this gives another reason to use vector.
A vector is nothing more than an ordered collection of one-dimensional variables; however, vector
and matrix mathematics have been developed to allow us to do mathematics on vectors without
having to deal with each of the elements of (X1 , X2 , . . . , Xp ) separately.
It will rarely be helpful to think of these vectors as being like vectors of physics and having magnitude
and direction; but it is often helpful to think of two-dimensional vectors as representing points in a
Euclidean plane and to think of general multidimensional vectors (p-dimensions, say) as representing
points in p-dimensional space.
Generally, a system of m equations, in n variables, x1 , x2 , . . . , xn ,
y1 = a11 x1 + a12 x2 · · · + a1n xn (B.6)
y2 = a21 x1 + a22 x2 · · · + a2n xn
...
yr = ar 1 x1 + ar 2 x2 · · · + ar n xn
...
ym = am1 x1 + am2 x2 · · · + amn xn
B–2
can be written in matrix form as
y = Ax, (B.7)
where y is an m × 1 vector,
y1
y2
y=
. ,
.
ym
x is an n × 1 vector,
x1
x2
x=
. ,
.
xn
and A is an m-row × n-column matrix
a11 a12 a1n
a21 a22 a2n
.. .. .. ..
A= .
.. ar c .. ..
.. .. .. ..
am1 am2 .. amn
That is, the matrix A is a rectangular array of numbers whose element in row r , column c is ar c
(rows are horizontal, think rows of teeth; columns are vertical. The matrix A is said to be m × n,
i.e. m rows, n columns.
Eqn. B.7 can be interpreted as the definition of a function which takes n arguments (x1 , x2 , . . . , xn )
and returns m variables (y1 , y2 . . . ym ). Such a function is also called a transformation: it transforms
n-dimensional vectors to m-dimensional vectors.
Such equations are linear transformations because there are no terms in xr2 or higher, only in
xr = xr1 , and no numbers like 5 (5xr0 = 5 × 1 = 5).
Why transformations?
y is an 2 × 1 vector,
B–3
U
y= ,
V
x is an 2 × 1 vector,
X
x= ,
Y
and A is an 2-row × 2-column matrix
a11 = a a12 = b
A= .
a21 = c a22 = d
The larger equation above allows us to create a m−dimensional random variable, y, as the linear
combination of the n random variables in the n−dimensional vector x.
C = A B.
(B.8)
m×p m×n n×p
Method: The element at the r th row and cth column of C is the product (sum of component-wise
products) of the r th row of A with the cth column of B. Pictorially:
n p p
---------------- ---------- -----------
—----¿ — — — — — —
— A — — B — — = — C —
— — — — — — —
m — — — — — n — — m
---------------- — V — -----------
— —
----------
C = AB
,
a11 a12
A= ,
a21 a22
b11 b12
B= ,
b21 b22
B–4
so, the product
a11 b11 + a12 b21 a11 b12 + a12 b22
C= .
a21 b11 + a22 b21 a21 b12 + a22 b22
Example. Consider Eqn. B.7, y = Ax. Thus the product of A(m × n) and x(n × 1) is
Pc=n
In summation notation, yr = c=1 ar c xc .
The product is (m × n) × (n × 1) so the result is (m × 1), which checks okay, for y is (m × 1).
As with vectors (when represented as components), we simply multiply each component by the
scalar,
a11 a12 ca11 ca12
c = .
a21 a22 ca21 ca22
B.4.3 Addition
a11 a12 b11 b12 a11 + b11 a12 + b12
+ = .
a21 a22 b21 b22 a21 + b21 a22 + b22
Clearly, the matrices must be the same size, i.e. row and column dimensions must be equal.
We can define the matrix inverse as follows, if AB = I then B = A−1 , see section B.6.
B–5
B.5.2 Orthogonal Matrix
For each row of the matrix (ar 1 ar 2 ....ar n ), the scalar product with itself is 1, and with all other
rows, 0. I.e. Pn
c=1 ar c apc = 1 for r = p,
= 0 otherwise.
B.5.3 Diagonal
Sx 0
A=
0 Sy
is diagonal, i.e. the only non-zero elements are on the diagonal.
At , spoken ‘A-transpose’.
If
a11 a12
A=
a21 a22
then
t a11 a21
A =
a12 a22
i.e. replace column 1 with row 1 etc.
B–6
B.6 Inverse Matrix
Only for square matrices (m = n). Consider again Eqns. B.1 and B.2:
y1 = 3x1 + 1x2
y2 = 2x1 + 4x2
i.e. y = Ax.
3 1
A= .
2 4
Apply this to
1
x= ,
2
to get
y1 = 3.1 + 1.2 = 5,
y2 = 2.1 + 4.2 = 10.
What if you know y = (5 10)t and you want to retrieve x = (x1 x2 )t ? In other words, can
matrices help us solve for x1 , x2 as we did in section B.2?
The answer is yes. Find the inverse of A = A−1 and then apply the inverse transformation to y,
that is, multiply y by the inverse of the matrix,
x = A−1 y. (B.9)
a11 a12
A=
a21 a22
−1 1 a22 −a12
A = (B.10)
|A| −a21 a11
where the determinant of the array, A, is | A |= a11 a22 − a12 a21
Inverse matrices give us the equivalent of division. If | A |= 0, attempting to find the inverse is
the equivalent to calculating 1/0.
Thus for
B–7
3 1
A=
2 4
we have | A |= 3 × 4 − 2 × 1 = 10 so
−1 4 −1 0.4 −0.1
A = (1/10) =
−2 3 −0.2 0.3
−1 5
Therefore, apply A to
10
We find: A−1 y =
0.4 −0.1 5 5 × 0.4 + 10 × −0.1 1
. = =
−0.2 0.3 10 5 × −0.2 + 10 × 0.3 2
which is the answer we got in section B.2. In fact, in section B.2 what we did was something very
similar to how one inverts a matrix in a computer program.
B–8