Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
University of Waterloo
ii
Contents
1. INTRODUCTION TO STATISTICAL SCIENCES
1.1 Statistical Sciences . . . . . . . . . . . . . . . . . . .
1.2 Collecting Data . . . . . . . . . . . . . . . . . . . . .
1.3 Data Summaries . . . . . . . . . . . . . . . . . . . .
1.4 Probability Distributions and Statistical Models . . .
1.5 Data Analysis and Statistical Inference . . . . . . . .
1.6 Statistical Software and R . . . . . . . . . . . . . . .
1.7 Chapter 1 Problems . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
7
21
24
28
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
49
58
60
62
63
75
.
.
.
.
83
83
86
92
101
.
.
.
.
.
.
105
105
106
110
113
120
123
iv
CONTENTS
4.7
4.8
4.9
5. TESTS OF HYPOTHESES
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Tests of Hypotheses for Parameters in the G( ; ) Model
5.3 Likelihood Ratio Tests of Hypotheses - One Parameter . .
5.4 Likelihood Ratio Tests of Hypotheses - Multiparameter .
5.5 Chapter 5 Problems . . . . . . . . . . . . . . . . . . . . .
6. GAUSSIAN RESPONSE MODELS
6.1 Introduction . . . . . . . . . . . . . . . . .
6.2 Simple Linear Regression . . . . . . . . .
6.3 Comparing the Means of Two Populations
6.4 More General Gaussian Response Models2
6.5 Chapter 6 Problems . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
149
149
155
160
167
173
.
.
.
.
.
179
179
183
200
210
215
.
.
.
.
227
227
229
232
238
.
.
.
.
.
243
243
245
247
248
252
May be omitted
May be omitted in Stat 231/221
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
257
257
259
260
261
CONTENTS
263
307
APPENDIX C: DATA
345
CONTENTS
Preface
These notes are a work-in-progress with contributions from those students taking the
courses and the instructors teaching them. An original version of these notes was prepared
by Jerry Lawless. Additions and revisions were made by Don McLeish, Cyntha Struthers,
Jock MacKay, and others. Richard Cook supplied the example in Chapter 8. In order to
provide improved versions of the notes for students in subsequent terms, please email lists of
errors, or sections that are confusing, or additional remarks/suggestions to your instructor
or castruth@uwaterloo.ca.
Specic topics in these notes also have associated video les or powerpoint shows that
can be accessed at www.watstat.ca. Where possible we reference these videos in the text.
1. INTRODUCTION TO
STATISTICAL SCIENCES
1.1
Statistical Sciences
Statistical Sciences are concerned with all aspects of empirical studies including problem
formulation, planning of an experiment, data collection, analysis of the data, and the conclusions that can be made. An empirical study is one in which we learn by observation or
experiment. A key feature of such studies is that there is usually uncertainty in the conclusions. An important task in empirical studies is to quantify this uncertainty. In disciplines
such as insurance or nance, decisions must be made about what premium to charge for
an insurance policy or whether to buy or sell a stock, on the basis of available data. The
uncertainty as to whether a policy holder will have a claim over the next year, or whether
the price of a stock will rise or fall, is the basis of nancial risk for the insurer and the
investor. In medical research, decisions must be made about the safety and e cacy of new
treatments for diseases such as cancer and HIV.
Empirical studies deal with populations and processes; both of which are collections
of individual units. In order to increase our knowledge about a process, we examine a
sample of units generated by the process. To study a population of units we examine
a sample of units carefully selected from that population. Two challenges arise since we
only see a sample from the process or population and not all of the units are the same.
For example, scientists at a pharmaceutical company may conduct a study to assess the
eect of a new drug for controlling hypertension (high blood pressure) because they do not
know how the drug will perform on dierent types of people, what its side eects will be,
and so on. For cost and ethical reasons, they can involve only a relatively small sample
of subjects in the study. Variability in human populations is ever-present; people have
varying degrees of hypertension, they react dierently to the drug, they have dierent side
eects. One might similarly want to study variability in currency or stock values, variability
in sales for a company over time, or variability in the number of hits and response times
for a commercial web site. Statistical Sciences deal both with the study of variability
in processes and populations, and with good (that is, informative, cost-eective) ways to
collect and analyze data about such processes.
1
We can have various objectives when we collect and analyze data on a population or
process. In addition to furthering knowledge, these objectives may include decision-making
and the improvement of processes or systems. Many problems involve a combination of
objectives. For example, government scientists collect data on sh stocks in order to further
scientic knowledge and also to provide information to policy makers who must set quotas
or limits on commercial shing.
Statistical data analysis occurs in a huge number of areas. For example, statistical
algorithms are the basis for software involved in the automated recognition of handwritten
or spoken text; statistical methods are commonly used in law cases, for example in DNA
proling; statistical process control is used to increase the quality and productivity of
manufacturing and service processes; individuals are selected for direct mail marketing
campaigns through a statistical analysis of their characteristics. With modern information
technology, massive amounts of data are routinely collected and stored. But data do not
equal information, and it is the purpose of the Statistical Sciences to provide and analyze
data so that the maximum amount of information or knowledge may be obtained3 . Poor
or improperly analyzed data may be useless or misleading. The same could be said about
poorly collected data.
We use probability models to represent many phenomena, populations, or processes
and to deal with problems that involve variability. You studied these models in your rst
probability course and you have seen how they describe variability. This course will focus
on the collection, analysis and interpretation of data and the probability models studied
earlier will be used extensively. The most important material from your probability course
is the material dealing with random variables, including distributions such as the Binomial,
Poisson, Multinomial, Normal or Gaussian, Uniform and Exponential. You should review
this material.
Statistical Sciences is a large discipline and this course is only an introduction. Our
broad objective is to discuss all aspects of: problem formulation, planning of an empirical
study, formal and informal analysis of data, and the conclusions and limitations of the
analysis. We must remember that data are collected and models are constructed for a
specic reason. In any given application we should keep the big picture in mind (e.g. Why
are we studying this? What else do we know about it?) even when considering one specic
aspect of a problem. We nish this introduction with a recent quote4 from Hal Varien,
Googles chief economist.
The ability to take data - to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate its going to be a hugely important skill in the next
decades, not only at the professional level but even at the educational level for elementary
3
A brilliant example of how to create information through data visualization is found in the video by
Hans Rosling at: http://www.youtube.com/watch?v=jbkSRLYSojo
4
For the complete article see How the web challenges managers Hal Varian, The McKinsey Quarterly,
January 2009
school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complemintary (sic) scarce factor is the ability to
understand that data and extract value from it.
I think statisticians are part of it, but its just a part. You also want to be able to
visualize the data, communicate the data, and utilize it eectively. But I do think those
skills - of being able to access, understand, and communicate the insights you get from data
analysis - are going to be extremely important. Managers need to be able to access and
understand the data themselves.
1.2
Collecting Data
The objects of study in this course are referred to as populations or processes. A population
is a collection of units. For example, a population of interest may be all persons under the
age of 18 in Canada as of September 1, 2012 or all car insurance policies issued by a
company over a one year period. A process is a mechanism by which units are produced.
For example, hits on a website constitute a process (the units are the distinct hits). Another
process is the sequence of claims generated by car insurance policy holders (the units are
the individual claims). A key feature of processes is that they usually occur over time
whereas populations are often static (dened at one moment in time).
We pose questions about populations (or processes) by dening variates for the units
which are characteristics of the units. For example, variates can be measured or continuous
quantities such as weight and blood pressure, discrete quantities such as the presence or
absence of a disease or the number of damaged pixels in a monitor, categorical quantities
such as colour or marital status, or more complex quantities such as an image or an open
ended response to a survey question. We are interested in functions of the variates over
the whole population; for example the average drop in blood pressure due to a treatment
for individuals with hypertension. We call these functions attributes of the population or
process.
We represent variates by letters such as x; y; z. For example, we might dene a variate
y as the size of the claim or the response time to a hit in the processes mentioned above.
The values of y typically vary across the units in a population or process. This variability
generates uncertainty and makes it necessary to study populations and processes by collecting data about them. By data, we mean the values of the variates for a sample of units
in the population or a sample of units taken from the process.
In planning to collect data about some process or population, we must carefully specify
what the objectives are. Then, we must consider feasible methods for collecting data as
well as the extent it will be possible to answer questions of interest. This sounds simple
but is usually di cult to do well, especially since resources are always limited.
There are several ways in which we can obtain data. One way is purely according to
what is available: that is, data are provided by some existing source. Huge amounts of
data collected by many technological systems are of this type, for example, data on credit
card usage or on purchases made by customers in a supermarket. Sometimes it is not
clear what available data represent and they may be unsuitable for serious analysis. For
example, people who voluntarily provide data in a web survey may not be representative of
the population at large. Alternatively, we may plan and execute a sampling plan to collect
new data. Statistical Sciences stress the importance of obtaining data that will be objective
and provide maximal information at a reasonable cost. There are three broad approaches:
(i) Sample Surveys. The object of many studies is to learn about a nite population
(e.g. all persons over 19 in Ontario as of September 12 in a given year or all cars
produced by the car manufacturer General Motors in the past calendar year). In this
case information about the population may be obtained by selecting a representative sample of units from the population and determining the variates of interest
for each unit in the sample. Obtaining such a sample can be challenging and expensive. Sample surveys are widely used in government statistical studies, economics,
marketing, public opinion polls, sociology, quality assurance and other areas.
(ii) Observational Studies. An observational study is one in which data are collected
about a process or population without any attempt to change the value of one or
more variates for the sampled units. For example, in studying risk factors associated
with a disease such as lung cancer, we might investigate all cases of the disease at a
particular hospital (or perhaps a sample of them) that occur over a given time period.
We would also examine a sample of individuals who did not have the disease. A distinction between a sample survey and an observational study is that for observational
studies the population of interest is usually innite or conceptual. For example, in
investigating risk factors for a disease, we prefer to think of the population of interest
as a conceptual one consisting of persons at risk from the disease recently or in the
future.
(iii) Experiments or Experimental Studies. An experiment is a study in which the
experimenter (that is, the person conducting the study) intervenes and changes or
sets the values of one or more variates for the units in the sample. For example, in an
engineering experiment to quantify the eect of temperature on the performance of
a certain type of computer chip, the experimenter might decide to run a study with
40 chips, ten of which are operated at each of four temperatures 10, 20, 30, and 40
degrees Celsius. Since the experimenter decides the temperature level for each chip
in the sample, this is an experiment.
The three types of studies described above are not mutually exclusive, and many studies
involve aspects of all of them. Here are some slightly more detailed examples.
One of the most important studies was conducted in the Waterloo school board; see
for example "Six-year follow-up of the rst Waterloo school smoking prevention trial" at
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1350177/
6
Example 1.2.3
As an example of a statistical setting where the data are not obtained by a survey,
experiment, or even an observational study, consider the following.
1.3
Data Summaries
Numerical Summaries
First we describe some numerical summaries which are useful for describing features of a
single measured variate in a data set. They fall generally into three categories: measures of
location (mean, median, and mode), measures of variability or dispersion (variance, range,
and interquartile range), and measures of shape (skewness and kurtosis).
1. Measures of location:
The (sample) mean also called the sample average: y =
1
n
n
P
yi .
i=1
The median m
^ or the middle value when n is odd and the sample is ordered from
smallest to largest, and the average of the two middle values when n is even.
The mode, or the value of y which appears in the sample with the highest frequency
(not necessarily unique).
2. Measures of dispersion or variability:
The (sample) variance: s2 =
1
n 1
n
P
(yi
i=1
p
s2 .
y)2 =
1
n 1
n
P
i=1
yi2
The range = y(n) y(1) where y(n) = max (y1 ; y2 ; : : : ; yn ) and y(1) = min (y1 ; y2 ; : : : ; yn ).
The interquartile range IQR which is described below.
3. Measures of shape:
Measures of shape generally indicate how the data, in terms of a relative frequency
histogram, dier from the Normal bell-shaped curve, for example whether one tail of the
relative frequency histogram is substantially larger than the other so the histogram is asymmetric, or whether both tails of the relative frequency histogram are large so the data are
more prone to extreme values than data from a Normal distribution.
The (sample) skewness
1
n
1
n
n
P
(yi
y)3
i=1
n
P
3=2
y)2
(yi
i=1
0.5
0.45
skewness = 1.15
0.4
0.35
0.3
Relative
F requency
0.25
0.2
0.15
0.1
0.05
0
10
11
12
Figure 1.1: Relative frequency histogram for data with positive skewness
When the relative frequency histogram of the data is approximately symmetric then
there is an approximately equal balance between the positive and negative values in the
n
P
sum
(yi y)3 and this results in a value for the skewness that is approximately zero. If
i=1
the relative frequency histogram of the data has a long right tail (see Figure 1.1), then the
positive values of (yi y)3 dominate the negative values in the sum and the value of the
skewness will be positive.
Similarly if the relative frequency histogram of the data has a long left tail (see Figure
1.2) then the value of the skewness will be negative.
0.7
0.6
skewness = -1.35
0.5
Relative
F requency
0.4
0.3
0.2
0.1
10
11
12
Figure 1.2: Relative frequency histogram for data with negative skewness
The (sample) kurtosis
1
n
1
n
n
P
(yi
y)4
i=1
n
P
y)2
(yi
i=1
measures the heaviness of the tails and the peakedness of the data relative to data
that are Normally distributed. For the Normal distribution the kurtosis is equal to
three.
0.35
skewness = 0.71
0.3
kurtosis = 5.24
0.25
Relative
F requency
0.2
G (0.15,1.52) p.d.f.
0.15
0.1
0.05
-4
-3
-2
-1
Figure 1.3: Relative frequency histogram for data with kurtosis > 3
Since the term (yi y)4 is always positive, the kurtosis is always positive and values
greater than three indicate heaver tails (and a more peaked center) than data that are
10
1.4
skewness = 0.08
kurtosis = 1.73
1.2
1
Relative
F requency
0.8
0.6
0.4
G (0.49,0.29) p.d.f.
0.2
0
-0.5
0.5
1.5
Figure 1.4: Relative frequency histogram for data with kurtosis < 3
Normally distributed. See Figures 1.3 and 1.4. Typical nancial data such as the S&P500
index have kurtosis greater than three, because the extreme returns (both large and small)
are more frequent than one would expect for Normally distributed data.
Sample Quantiles and Percentiles
For 0 < p < 1; the pth quantile (also called the 100pth percentile) is a value such that
approximately a fraction p of the y values in the data set are less than q(p) and roughly
1 p are greater. More precisely:
Denition 1 (sample percentiles and sample quantiles): The pth quantile (also called the
100pth percentile) is a value, call it q(p), determined as follows:
Let m = (n + 1)p where n is the sample size.
If m 2 f1; 2; : : : ; ng then take the m0 th smallest value q(p) = y(m) , where
y(1) y(2)
y(n) denotes the ordered sample values.
If m 2
= f1; 2; : : : ; ng but 1 < m < n then determine the closest integer j such that
j < m < j + 1 and take q(p) = 21 y(j) + y(j+1) .
Depending on the size of the data set, quantiles are not uniquely dened for all values
of p. For example, what is the median of the values f1; 2; 3; 4; 5; 6g? What is the lower
quartile? There are dierent conventions for dening quantiles in these cases; if the sample
size is large, the dierences in the quantiles based on the various denitions are small.
Denition 2 The values q(0:5), q(0:25) and q(0:75) are called the median, the lower or
rst quartile, and the upper or third quartile respectively.
11
We can easily understand what the sample mean, quantiles and percentiles tell us about
the variate values in a data set. The sample variance and sample standard deviation measure
the variability or spread of the variate values in a data set. We prefer the standard deviation
because it has the same scale as the original variate. Another way to measure variability
is to use the interquartile range, the dierence between the lower or rst quartile and the
higher or third quartile.
Denition 3 The interquartile range is IQR = q(0:75)
q(0:25).
Denition 4 The ve number summary of a data set consists of the smallest observation,
the lower quartile, the median, the upper quartile and the largest value, that is, the ve
values y(1) ; q (0:25) ; q (0:5) ; q (0:75) ; y(n) .
12
Sex
Female
Male
From the table, we see that there are only small dierences in any of the summary
measures except for the standard deviation which is substantially larger for females. In
other words, there is more variability in the BMI measurements for females than for males
in this sample.
We can also construct a relative frequency table that gives the proportion of subjects
that fall within each obesity class by sex.
Table 1.4: BMI Relative Frequency Table by Sex
Obesity Classication Males Females
Underweight
0:01
0:02
Normal
0:28
0:33
Overweight
0:50
0:42
Moderately Obese
0:19
0:17
Severely Obese
0:02
0:06
Total
1:00
1:00
From Table 1.4, we see that the reason for the larger standard deviation for females is
that there is a greater proportion of females in the extreme classes.
Sample Correlation
So far we have looked only at graphical summaries of a data set fy1 ; y2 ; : : : ; yn g. Often
we have bivariate data of the form f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g. A numerical summary
of such data is the sample correlation.
13
Denition 5 The sample correlation, denoted by r, for data f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g
is
Sxy
r=p
Sxx Syy
where
Sxx =
and Syy =
n
P
i=1
n
P
i=1
(xi
(yi
x)2 =
y)2 =
n
P
i=1
n
P
i=1
x2i
n (x)2 ; Sxy =
yi2
n (y)2 :
n
P
(xi
x) (yi
i=1
y) =
n
P
x i yi
nxy
i=1
The sample correlation, which takes on values between 1 and 1; is a measure of the
linear relationship between the two variates x and y. If the value of r is close to 1 then
we say that there is a strong positive linear relationship between the two variates while if
the value of r is close to 1 then we say that there is a strong negative linear relationship
between the two variates. If the value of r is close to 0 then we say that there is no linear
relationship between the two variates.
Example 1.3.1 Continued
If we let x = height and y = weight then the sample correlation for the males is r = 0:55
and for the females r = 0:31 which indicates that there is a positive relationship between
height and weight which is exactly what we would expect.
Relative Risk
Recall that categorical variates consist of group or category names that do not necessarily have any ordering. If two variates of interest in a study are categorical variates then
it does not make sense to use sample correlation as a measure of the relationship between
the two variates.
Example 1.3.2 PhysiciansHealth Study
During the 1980s in the United States a very large study called the PhysiciansHealth
Study was conducted to study the relationship between taking daily aspirin and the occurrence of coronary heart disease (CHD). One set of data collected in the study are given in
Table 1.5.
Table 1.5: PhysiciansHealth Study
Placebo
Daily Aspirin
Total
CHD
189
104
293
No CHD
10845
10933
21778
Total
11034
11037
22071
14
What measure can be used to summarize the relationship between taking daily aspirin and
the occurrence of CHD?
One measure which is used to summarize the relationship between two categorical variates is relative risk. To dene relative risk consider a generalized version of Table 1.5 given
by
Table 1.6: General Two-way Table
B
B
Total
Total
y11
y21
y11 + y21
y12
y22
y12 + y22
y11 + y12
y21 + y22
n
15
Graphical Summaries
We consider several types of plots for a data set fy1 ; y2 ; : : : ; yn g and one type of plot for a
data set f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g.
Frequency histograms
Consider measurements fy1 ; y2 ; : : : ; yn g on a variate y. Partition the range of y into k
non-overlapping intervals Ij = [aj 1 ; aj ); j = 1; 2; : : : ; k and then calculate for j = 1; : : : ; k
fj = number of values from fy1 ; y2 ; : : : ; yn g that are in Ij .
The fj are called the observed frequencies for I1 ; : : : ; Ik ; note that
k
P
fj = n. A
j=1
histogram is a graph in which a rectangle is placed above each interval; the height of the
rectangle for Ij is chosen so that the area of the rectangle is proportional to fj . Two main
types of frequency histogram are used. The second is preferred.
(a) a standard frequency histogram where the intervals Ij are of equal length. The
height of the rectangle for Ij is the frequency fj or relative frequency fj =n. This type
of histogram is similar to a bar chart.
(b) a relative frequency histogram, where the intervals Ij = [aj 1 ; aj ) may or may not
be of equal length. The height of the rectangle for Ij is chosen so that its area equals
fj =n, that is, the height of the rectangle for Ij is equal to
fj =n
(aj aj
1)
Note that in this case the sum of the areas of the rectangles in the histogram is equal
to one.
We can make the two types of frequency histograms visually comparable by using intervals of equal length for both types. If we wish to compare two groups which have dierent
sample sizes then a relative frequency histogram must be used. If we wish to superimpose
a probability density function on the relative frequency histogram to see how well the data
t the model then a relative frequency histogram must always be used.
To construct a frequency histogram, the number and location of the intervals must be
chosen. The intervals are typically selected so that there are ten to fteen intervals and each
interval contains at least one y-value from the sample (that is, each fj 1). If a software
package is used to produce the frequency histogram (see Section 1.7) then the intervals are
usually chosen automatically. An option for user specied intervals is also usually provided.
16
0.12
kurtosis = 3.03
0.1
Relative
Frequency
0.08
0.06
0.04
0.02
16
18
20
22
24
26
28
30 32
BMI
34
36
38
40
0.09
0.08
skewness = 0.30
kurtosis = 2.79
0.07
Relative
Frequency
0.06
0.05
0.04
0.03
0.02
0.01
0
16
18
20
22
24
26
28
30 32
BMI
34
36
38
40
17
15
30
45
60
75
18
1
0.9
Males
0.8
0.7
cumlative
relative
0.6
frequency
F emales
0.5
0.4
0.3
0.2
0.1
0
1.4
1.5
1.6
1.7
Height
1.8
1.9
Figure 1.8: Empirical cumulative distribution function of heights for males and
for females
Boxplots
In many situations, we want to compare the values of a variate for two or more groups,
as in Example 1.3.1 where we compared BMI values and heights for males versus females.
Especially when the number of groups is large (or the sample sizes within groups are small),
side-by-side boxplots are a convenient way to display the data. Boxplots are also called box
and whisker plots.
The boxplot is usually displayed vertically. The center line in each box corresponds
to the median and the lower and upper sides of the box correspond to the lower quartile
q(0:25) and the upper quartile q(0:75) respectively. The so-called whiskers extend down
and up from the box to a horizontal line. The lower line is placed at the smallest observed
data value that is larger than the value q(0:25) 1:5 IQR where IQR = q(0:75) q(0:25)
is the interquartile range. Similarly the upper line is placed at the largest observed data
19
value that is smaller than the value q(0:75) + 1:5 IQR. Any values beyond the whiskers
(often called outliers) are plotted with special symbols.
120
110
100
90
80
70
60
50
40
Males
Females
45
40
35
30
MPG
25
20
15
10
USA
France
Japan
Germany
Sweden
Italy
Figure 1.10: Boxplots for miles per gallon for 100 cars from six dierent countries
20
The graphical summaries discussed to this point deal with a single variate. If we have
data on two variates x and y for each unit in the sample then then data set is represented
as f(xi ; yi ); i = 1; : : : ; ng. We are often interested in examining the relationships between
the two variates.
Scatterplots
A scatterplot, which is a plot of the points (xi ; yi ); i = 1; : : : ; n, can be used to see
whether the two variates are related in some way.
120
110
100
Weight
90
80
70
r=0.55
60
50
1.55
1.6
1.65
1.7
1.75
1.8
Height
1.85
1.9
1.95
120
110
100
Weight
90
80
70
60
r=0.31
50
40
1.4
1.45
1.5
1.55
1.6
1.65
Height
1.7
1.75
1.8
1.85
1.4
21
Statistical models are used to describe processes such as the daily closing value of a stock
or the occurrence and size of claims over time in a portfolio of insurance policies. With
populations, we use a statistical model to describe the selection of the units and the measurement of the variates. The model depends on the distribution of variate values in the
population (that is, the population histogram) and the selection procedure. We exploit
this connection when we want to estimate attributes of the population and quantify the
uncertainty in our conclusions. We use the models in several ways:
questions are often formulated in terms of parameters of the model
the variate values vary so random variables can describe this variation
empirical studies usually lead to inferences that involve some degree of uncertainty,
and probability is used to quantify this uncertainty
procedures for making decisions are often formulated in terms of models
models allow us to characterize processes and to simulate them via computer experiments
500
y
(1
)500
for y = 0; 1; : : : ; 500
Here the parameter represents the unknown proportion of smokers in the population,
one attribute of interest in the study.
Example 1.4.2 An Exponential Distribution Example
In Example 1.3.3, we examined the lifetime (in 1000 km) of a sample of 200 front brake
pads taken from the population of all cars of a particular model produced in a given time
period. We can model the lifetime of a single brake pad by a continuous random variable
Y with Exponential probability density function (p.d.f.)
1
f (y; ) = e
y=
for y > 0:
22
Here the parameter > 0 represents the mean lifetime of the brake pads in the population
since, in the model, the expected value of Y is E (Y ) = .
To model the sampling procedure, we assume that the data fy1 ; : : : ; y200 g represent 200
independent realizations of the random variable Y . That is, we let Yi = the lifetime for
the ith brake pad in the sample, i = 1; 2; : : : ; 200, and we assume that Y1 ; Y2 ; : : : ; Y200 are
independent Exponential random variables each having the same mean .
We can use the model and the data to estimate and other attributes of interest such
as the proportion of brake pads that fail in the rst 100; 000 km of use. In terms of the
model, we can represent this proportion by
P (Y
Z100
100; ) =
f (y; )dy = 1
100=
23
+ x:
However it would be possible to reverse the roles of the two variates here and consider
the weight to be an explanatory variate and height the response variate, if for example we
wished to predict height using data on individualsweights.
Models for describing the relationships among two or more variates are considered in
more detail in Chapters 6 and 7.
24
1.5
Whether we are collecting data to increase our knowledge or to serve as a basis for making
decisions, proper analysis of the data is crucial. We distinguish between two broad aspects
of the analysis and interpretation of data. The rst is what we refer to as descriptive
statistics. This is the portrayal of the data, or parts of it, in numerical and graphical ways
so as to show features of interest. (On a historical note, the word statisticsin its original
usage referred to numbers generated from data; today the word is used both in this sense
and to denote the discipline of Statistics.) We have considered a few methods of descriptive
statistics in Section 1.3. The terms data mining and knowledge discovery in data bases
(KDD) refer to exploratory data analysis where the emphasis is on descriptive statistics.
This is often carried out on very large data bases. The goal, often vaguely specied, is to
nd interesting patterns and relationships
A second aspect of a statistical analysis of data is what we refer to as statistical inference.
That is, we use the data obtained in the study of a process or population to draw general
conclusions about the process or population itself. This is a form of inductive inference, in
which we reason from the specic (the observed data on a sample of units) to the general
(the target population or process). This may be contrasted with deductive inference (as
in logic and mathematics) in which we use general results (e.g. axioms) to prove specic
things (e.g. theorems).
This course introduces some basic methods of statistical inference. Three main types
of problems will be discussed, loosely referred to as estimation problems, hypothesis testing
problems and prediction problems. In the rst type, the problem is to estimate one or more
attributes of a process or population. For example, we may wish to estimate the proportion
of Ontario residents aged 14 - 20 who smoke, or to estimate the distribution of survival
times for certain types of AIDS patients. Another type of estimation problem is that of
tting or selecting a probability model for a process.
Hypothesis testing problems involve using the data to assess the truth of some question
or hypothesis. For example, we may hypothesize that in the 14-20 age group a higher
proportion of females than males smoke, or that the use of a new treatment will increase
the average survival time of AIDS patients by at least 50 percent.
In prediction problems, we use the data to predict a future value for a process variate
or a unit to be selected from the population. For example, based on the results of a clinical
trial such as Example 1.2.3, we may wish to predict how much an individuals blood pressure
would drop for a given dosage of a new drug. Or, given the past performance of a stock
and other data, to predict the value of the stock at some point in the future.
Statistical analysis involves the use of both descriptive statistics and formal methods of
estimation, prediction and hypothesis testing. As brief illustrations, we return to the rst
two examples of section 1.2.
25
Female
Male
Total
Smokers
82
71
153
Non-smokers
168
179
347
Total
250
250
500
Suppose we are interested in the question Is the smoking rate among teenage girls higher
than the rate among teenage boys? From the data, we see that the sample proportion of
girls who smoke is 82=250 = 0:328 or 32:8% and the sample proportion of males who smoke
is 71=250 = 0:284 or 28:4%. In the sample, the smoking rate for females is higher. But
what can we say about the whole population? To proceed, we formulate the hypothesis
that there is no dierence in the population rates. Then assuming the hypothesis is true,
we construct two Binomial models as in Example 1.4.1 each with a common parameter .
We can estimate using the combined data so that ^ = 153=500 = 0:306 or 30:6%. Then
using the model and the estimate, we can calculate the probability of such a large dierence
in the observed rates. Such a large dierence occurs about 20% of the time (if we selected
samples over and over and the hypothesis of no dierence is true) so such a large dierence
in observed rates happens fairly often and therefore, based on the observed data, there is no
evidence of a dierence in the population smoking rates. In Chapter 7 we discuss a formal
method for testing the hypothesis of no dierence in rates between teenage girls and boys.
Example 1.5.2 A can ller study
Recall Example 1.2.2 where the purpose of the study was to compare the performance
of the two machines in the future. Suppose that every hour, one can is selected from the
new machine and one can from the old machine over a period of 40 hours. You can nd
measurements of the amounts of liquid in the cans in the le ch1example152.txt and also
listed in Appendix C. The variates (column headings) are hour, machine (new = 1, old = 2)
and volume (ml). We display the rst few rows of the le below.
Hour
1
1
2
2
..
.
Machine
1
2
1
2
..
.
Volume
357:8
358:7
356:6
358:5
..
.
First we examine if the behaviour of the two machines is stable over time. In Figures
1.13 and 1.14, we show a run chart of the volumes over time for each machine. There is no
26
10
15
20
Hour
25
30
35
40
Figure 1.13: Run chart of the volume for the new machine over time
indication of a systematic pattern for either machine so we have some condence that the
data can be used to predict the performance of the machines in the near future.
360
359.5
359
358.5
Volume
358
357.5
357
356.5
356
10
15
20
Hour
25
30
35
40
Figure 1.14: Run chart of the volume for old machine over time
The sample mean and standard deviation for the new machine are 356:8 and 0:54 ml
respectively and, for the old machine, are 357:5 and 0:80. Figures 1.15 and 1.16 show the
relative frequency histograms of the volumes for the new machine and the old machine respectively. To see how well a Gaussian model might t these data we superimpose Gaussian
probability density functions with the mean equal to the sample mean and the standard
deviation equal to the sample standard deviation on each histogram. The agreement is
reasonable given that the sample size for both data sets is only forty. Note that it only
makes sense to compare density functions and relative frequency histograms (not standard)
since the areas both equal one.
None of the 80 cans had volume less than the required 355ml. However, we examined
27
0.9
0.8
skewness = 0.22
0.7
kurtos is = 2.38
0.6
Relative
Frequenc y
0.5
G(356.76,0.54)
0.4
0.3
0.2
0.1
0
355
356
357
358
Volume
359
360
361
Figure 1.15: Relative frequency histogram of volumes for the new machine
0.7
skewness = 0.54
0.6
kurtos is = 2.84
0.5
Relative
Frequenc y
0.4
G(357.5,0.80)
0.3
0.2
0.1
0
355
356
357
358
Volume
359
360
361
Figure 1.16: Relative frequency histogram of volumes for the old machine
only 40 cans per machine. We can use the Gaussian models to estimate the long term
proportion of cans that fall below the required volume. For the new machine, we nd
that if V
G(356:8; 0:54) then P (V
355) = 0:0005 so about 5 in 10; 000 cans will be
underlled. The corresponding rate for the old machine is about 8 in 10; 000 cans. These
estimates are subject to a high degree of uncertainty because they are based on a small
sample and we have no way to test that the models are appropriate so far into the tails of
the distribution.
We can also see that the new machine is superior because of its smaller sample mean
which translates into less overll (and hence less cost to the manufacturer). It is possible
to adjust the mean of the new machine to a lower value because of its smaller standard
deviation.
28
1.6
Statistical software is essential for data manipulation and analysis. It is also used to deal
with numerical calculations, to produce graphics, and to simulate probability models. There
are many statistical software systems; some of the most comprehensive and popular are SAS,
S-Plus, SPSS, Strata, Systat Minitab and R. Spreadsheet software such as EXCEL is also
useful.
In this course we use the R software system. It is an open source package that has
extensive statistical capabilities and very good graphics procedures. The R home page is
www.r-project.org where a free download is available for most common operating systems.
Some of the basics of R are described in the next section. We use R for several purposes:
to manipulate and graph data, to t and check statistical models, to estimate attributes or
test hypotheses, to simulate data from probability models.
Using R
Lots of online help is available in R. You can use a search engine to nd the answer to most
questions. For example, if you search for R tutorial, you will nd a number of excellent
introductions to R that explain how to carry out most tasks. Within R, you can nd help
for a specic function using the command help(function name) but it is often easier to look
externally using a search engine.
Here we show how to use R on a Windows machine. You should have R open as you
read this material so you can play along.
Some R Basics
R is command-line driven. For example, if you want to dene a quantity x, use the assignment function <- (that is, < followed by -).
x<
15
c(1; 3; 5)
29
You can add comments by entering # with the comment following on the same line.
Vectors
Vectors can consist of numbers or other symbols; we will consider only numbers here.
Vectors are dened using the function c( ). For example,
x<
c(1; 3; 5; 7; 9)
denes a vector of length 5 with the elements given. You can display the vector by typing
x and carriage return. Vectors and other objects possess certain attributes. For example,
typing
length(x)
will give the length of the vector x.
You can cut and paste comma- delimited strings of data into the function c(). This is
one way to enter data into R. See below to learn how you can read a le into R.
Arithmetic
R can be used as a calculator. Enter the calculation after the prompt > and hit return as
shown below.
> 7+3
[1] 10
> 7*3
[1] 21
> 7/3
[1] 2.333333
> 2^3
[1] 8
You can save the result of the calculation by assigning it to a variable such as y<-7+3
Some Functions
There are many functions in R. Most operate on vectors in a transparent way, as do
arithmetic operations. (For example, if x and y are vectors then x + y adds the vectors
element-wise; if x and y are dierent lengths, R may do surprising things! Some examples,
with comments, follow
> x<-c(1,3,5,7,9)
> x
[1] 1 3 5 7 9
> y<-seq(1,2,.25)
# Define a vector x
# Display x
# A useful function for defining a vector whose
30
We often want to compare summary statistics of variate values by group (such as sex). We
can use the by()function. For example,
> y<-rnorm(100)
31
Graphs
Note that in R, a graphics window opens automatically when a graphical function is used. A
useful way to create several plots in the same window is the function par() so, for example,
following the command
par(mfrow=c(2,2))
the next 4 plots will be placed in a 2 2 array within the same window.
There are various plotting and graphical functions. Three useful ones are
plot(y~x)
hist(y)
boxplot(y~x)
You can control the axes of plots (especially useful when you are making comparisons) by
including xlim = c(a; b) and ylim = c(d; e) as arguments separated by commas within
the plotting function. Also you can label the axes by including xlab = \yourchoice" and
ylab = \yourchoice". A title can be added using main = \yourchoice". There are many
other options. Check out the Html help An Introduction to R for more information on
plotting.
To save a graph, you can copy and paste into a Word document for example or alternately
use the Save as menu to create a le in one of several formats.
Probability Distributions
There are functions which compute values of probability functions or probability density
functions, cumulative distribution functions, and quantiles for various distributions. It is
also possible to generate random samples from these distributions. Some examples follow
for the Gaussian distribution. For other distributions, type help(distributionname) or
check the Introduction to R in the Html help menu.
> y<- rnorm(10,25,5)
32
at y=2
The header=Ttells R that the variate names are in the rst row of the data le. The
object a is called a data frame in R and the variate names are of the form \a : v1" where
v1 is the name of the rst column in the le. The R function attach(a) allows you to drop
the a : from the variate names.
Writing data to a le
You can cut and paste output generated by R in the sessions window although the format
is usually messed up. This approach works best for Figures. You can write an R vector or
other object to a text le through
write(y,file="filename")
To see more about the write function use help(write).
33
machine
1
2
1
..
.
volume
357:8
358:7
356:6
..
.
hour
21
21
22
..
.
machine
1
2
1
..
.
volume
356:5
357:3
356:9
..
.
34
1.7
Chapter 1 Problems
1. The sample mean and the sample median are two dierent ways to measure the
location of a data set (y1 ; y2 ; : : : ; yn ). Let y be the average and m
^ be the median of
the data set.
(a) Suppose we transform the data so that ui = a + byi , i = 1; :::; n where a and
b are constants with b 6= 0. How are the sample mean and sample median of
u1 ; : : : ; un related to y and m?
^
(b) Suppose we transform the data by squaring so that vi = y i 2 , i = 1; : : : ; n. How
are the sample mean and sample median of v1 ; : : : ; vn related to y and m?
^
n
P
(c) Consider the quantities ri = yi y, i = 1; : : : ; n. Show that
ri = 0. Is it true
that
n
P
i=1
(yi
m)
^ = 0?
i=1
(d) Suppose we include an extra observation y0 to the data set and dene a(y0 ) to
be the mean of the augmented data set. Express a(y0 ) in terms of y and y0 .
What happens to the sample mean as y0 gets large (or small)?
(e) Repeat the previous question for the sample median. Hint: Let y(1) ; :::; y(n) be
the original data set with the observations arranged in increasing order.
(f) Use (d) and (e) to explain why the sample median income of a country might be
a more appropriate summary than the sample mean income.
n
P
(g) Show that V ( ) =
(yi
)2 is minimized when = y.
i=1
n
P
i=1
jyi
j is minimized when
= m.
^ Hint: Calculate the
2. The sample standard deviation and the interquartile range (IQR) are two dierent
measures of the variability of a data set (y1 ; y2 ; : : : ; yn ).
(a) Suppose we transform the data so that ui = a+byi , i = 1; :::; n where a and b are
constants and b 6= 0. How do the sample standard deviation and IQR change?
n
n
P
P
(b) Show that
(yi y)2 =
yi2 n (y)2 .
i=1
i=1
(c) Suppose we include an extra observation y0 to the data set. Use the result in
(b) to write the sample standard deviation of the augmented data set in terms
of y0 and the original sample standard deviation. What happens when y0 gets
large (or small)?
35
3. The sample skewness and kurtosis are two dierent measures of the shape of a data
set (y1 ; y2 ; : : : ; yn ). Suppose we transform the data so that ui = a + byi , i = 1; :::; n
where a and b are constants and b 6= 0. How do the sample skewness and kurtosis
change?
4. Suppose we have data for the costs of production for a rm every month from January
2011 to December 2012. The data are denoted by c1 ; c2 ; :::; c24 : For this data set the
mean cost was $2500, the sample deviation was $5500 and the range was $7500. The
relationship between cost and revenue is given by ri = 7ci + 1000; i = 1; 2; :::; 24.
Find the mean revenue, the sample variance of the revenues and the range of the
revenues.
5. Mass production of complicated assemblies such as automobiles depend on our ability
to manufacture the components to very tight specications. The component manufacturer tracks performance by measuring a sample of parts and comparing the measurements to the specication. Suppose the specication for the diameter of a piston
is a nominal value 10 microns (10 6 m). The data below (also available in the le
ch1exercise3.txt) are the diameters of 50 pistons collected from the more than 10; 000
pistons produced in one day. (The measurements are the diameters minus the nominal
value in microns.)
12:8
0:8
1:2
3:3
5:8
7:3
0:7
1:8
3:4
6:6
3:9
0:6
1:8
3:5
6:6
50
P
i=1
3:4
0:4
2:0
3:8
7:0
2:9
0:4
2:1
4:3
7:2
yi = 100:7
2:7
0:2
2:5
4:6
7:9
50
P
i=1
2:5
0:0
2:6
4:7
8:5
2:3
0:5
2:6
5:1
8:6
1:0
0:6
2:7
5:
8:7
0:9
0:7
2:8
5:7
8:9
yi2 = 1110:79
(a) Plot a relative frequency histogram of the data. Is the process producing pistons
within the specications.
(b) Calculate the sample mean y and the sample median of the diameters.
(c) Calculate the sample standard deviation s and the IQR.
(d) Such data are often summarized using a single performance index called P pk
dened as
U y y L
P pk = min
;
3s
3s
where (L; U ) = ( 10; 10) are the lower and upper specication limits. Calculate
P pk for these data.
(e) Explain why larger values of P pk (i.e. greater than 1) are desirable.
36
37
n
P
1
n
1 i=1
Yi
1
n
Yi .
i=1
where
1
n
P
n
P
i=1
Yi2
n Y
9. The data below show the lengths (in cm) of 43 male coyotes and 40 female coyotes
captured in Nova Scotia. (Based on Table 2.3.2 in Wild and Seber 1999.) The data
are available in the le ch1exercise5.txt.
Females x
71:0
73:7
86:5
86:5
91:4
91:5
97:8
98:0
80:0
88:0
91:7
101:6
81:3
87:0
92:0
102:5
40
P
83:5
88:0
93:0
84:0
88:0
93:0
xi = 3569:6
i=1
Males y
78:0
80:0
88:0
88:9
93:5
95:0
100:0 100:5
80:0
88:9
95:0
101:0
i=1
40
P
i=1
81:3
90:0
95:0
101:6
43
P
84:0
88:5
93:5
83:8
90:5
94:0
103:0
84:5
91:0
95:5
104:1
yi = 3958:4
43
P
i=1
84:5
89:5
93:5
85:0
90:0
93:5
85:0
90:0
96:0
86:0
90:2
97:0
86:4
91:0
97:0
x2i = 320223:38
85:0
91:0
96:0
105:0
86:0
91:0
96:0
86:4
91:4
96:0
86:5
92:0
96:0
87:0
92:5
97:0
88:0
93:0
98:5
yi2 = 366276:84
(a) Plot relative frequency histograms of the lengths for females and males separately. Be sure to use the same bins.
(b) Determine the ve number summary for each data set.
(c) Compute the sample mean y and sample standard deviation s for the lengths
of the female and male coyotes separately. Assuming = y and = s, overlay
the corresponding G ( ; ) probability density function on the histograms for the
females and males separately. Comment on how well the Normal model ts each
data set.
(d) Plot the empirical distribution function of the lengths for females and males
separately. Assuming
= y and
= s, overlay the corresponding G ( ; )
cumulative distribution functions. Comment on how well the Normal model ts
each data set.
38
10. Does the value of an actor inuence the amount grossed by a movie? The value
of an actor will be measured by the average amount the actorsmovies have made.
The amount grossed by a movie is measured by taking the highest grossing movie,
in which that actor played a major part. For example, Tom Hanks, whose value is
103.2 had his best results with Toy Story 3 (gross 415.0). All numbers are corrected
to 2012 dollar amounts and have units millions of U.S. dollars. Twenty actors
were selected by taking the rst twenty alphabetically listed by name on the website
(http://boxo cemojo. com), and the corresponding measurements (above), were
obtained for each actor. The data for 20 actors, their value (x) and the gross (y) of
their best movie are given below:
Actor
Value (x)
Gross (y)
1
67
177:2
2
49:6
201:6
3
37:7
183:4
4
47:3
55:1
5
47:3
154:7
6
32:9
182:8
7
36:5
277:5
8
92:8
415
9
17:6
90:8
10
14:4
83:9
Actor
Value (x)
Gross (y)
11
51:1
158:7
12
54
242:8
13
30:5
37:1
14
42:1
220
15
23:6
146:3
16
62:4
168:4
17
32:9
173:8
18
26:9
58:4
19
43:7
199
20
50:3
533
50
P
i=1
50
P
i=1
xi = 860:6
yi = 3759:5
50
P
x2i = 43315:04
i=1
50
P
i=1
yi2 = 971560:19
50
P
xi yi = 184540:93
i=1
(a) What are the two variates in this data set? Choose one variate to be an explanatory variate and the other to be a response variate. Justify your choice.
(b) Plot a scatterplot of the data.
(c) Calulate the sample correlation for the data (xi ; yi ) ; i = 1; 2; : : : ; 20. Is there a
strong positive or negative relationship between the two variates?
(d) Is it reasonable to conclude that the explanatory variate in this problem causes
the response variate? Explain.
11. In a very large population a proportion of people have blood type A. Suppose n
people are selected at random. Dene the random variable Y = number of people
with blood type A in sample of size n.
(a) What is the probability function for Y ?
(b) What are E(Y ) and V ar(Y )?
(c) Suppose n = 50. What is the probability of observing 20 people with blood type
A as a function of ?
(d) If for n = 50 we observed y = 20 people with blood type A what is a reasonable
estimate of based on this information?
39
(e) More generally, suppose in a given experiment the random variable of interest Y
has a Binomial(n; ) distribution. If the experiment is conducted and y successes
are observed what is a good estimate of based on this information?
(f) Let Y s Binomial (n; ). Find E(Y =n) and V ar(Y =n). What happens to
V ar(Y =n) as n ! 1? What does this imply about how far Y =n is from
for large n?
(g) There are actually 4 blood types: A, B, AB, O. Let Y1 = number with type
A, Y2 = number with type B, Y3 = number with type AB, and Y4 = number
with type O in a sample of size n. What is the joint probability function of Y1 ,
Y2 , Y3 , Y4 ? (Let 1 = proportion of type A, 2 = proportion of type B, 3 =
proportion of type AB, 4 = proportion of type O in the population.)
(h) If in a sample of n people the observed data were y1 , y2 , y3 , y4 what would be
reasonable estimates of 1 , 2 , 3 , 4 ?
12. The IQs of students of UWaterloo Math students are Normally distributed with
mean and standard standard deviation . Dene the random variable Y = IQ of
UWaterloo Math student.
(a) What is the probability density function of Y ?
(b) What are E(Y ) and V ar(Y )?
(c) Suppose that the IQs for 16 students were:
127 108 127 136 125 130 127 117 123 112 129 109 109 112 91 134
16
P
yi = 1916;
i=1
16
P
i=1
yi2 = 231618
n
1 P
Yi ?
n i=1
1:0 is greater
40
13. The lifetimes of a certain type of battery are Exponentially distributed with parameter
. Dene the random variable Y = lifetime of a battery.
(a) What is the probability density function of Y ?
(b) What are E(Y ) and V ar(Y )?
(c) Suppose the lifetimes (in hours) for 20 batteries were:
20:5 9:9 206:4 9:1 45:8 232:7 127:8 60:4 4:3 3:6
184:8 3:0 4:4 72:3 22:3 195:3 86:3 8:8 23:3 4:1
20
P
yi = 1325:1
i=1
n
1 P
Yi
n i=1
Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does this
imply about how far Y is from for large n?
14. Accidents occur on Wednesdays at a particular intersection at random at the average
rate of accidents per Wednesday according to a Poisson process. Dene the random
variable
Y = number of accidents on Wednesday at this intersection.
(a) What is the probability function for Y ?
(b) What are E(Y ) and V ar(Y )?
(c) Suppose on 6 consecutive Wednesdays the number of accidents observed was
0, 2, 0, 1, 3, 1. What is the probability of observing these data as a function
of ? (Remember the Poisson process assumption that the number of events in
non-overlapping time intervals are independent.) What is a reasonable estimate
of based on these data?
(d) Suppose Yi v P oisson ( ), i = 1; 2; : : : ; n independently. Let
Y =
n
1 P
Yi
n i=1
Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does this
imply about how far Y is from for large n?
41
Figure 1.17: Pie chart for support for Republican Presidental candidates
15. The pie chart in Figure 1.17, from Fox News, shows the support for various Republican
Presidential candidates. What do you notice about this pie chart? Comment on how
eective pie charts are in general at conveying information.
16. For the graph in Figure 1.18 indicate whether you believe the graph is eective in
conveying information by giving at least one feature of the graph which is either good
or bad.
boys
girls
Candy
Chips
Chocolate bars
Cookies
Crackers
Fruit
Ice cream
P opcorn
P retzels
V egetables
50
100
150
Number of S tudents
200
250
300
42
17. The graphs in Figures 1.19 and 1.20 are two more classic Fox News graphs. What do
you notice? What political message do you think they were trying to convey to their
audience?
43
18. Information about the mortality from malignant neoplasms (cancer) for females living
in Ontario is given in gures 1.21 and 1.22 for the years 1970 and 2000 respectively.
The same information displayed in these two pie charts is also displayed in the bar
graph in Figure 1.23. Which display seems to carry the most information?
Lung
Leukemia & Lymphoma
O ther
Breast
Colorectal
Stomach
Figure 1.21: Mortality from malignant neoplasms for females in Ontario 1970
Lung
O ther
Stomach
Breast
Colorectal
Figure 1.22: Mortality from malignant neoplasms for females in Ontario in 2000
44
40
1970
2000
35
30
25
20
15
10
Lung
Breast
Colorectal
Stomach
Other
Figure 1.23: Mortality from malignant neoplasms for females living in Ontario,
1970 and 2000
The material in this section is largely a review of material you have seen in a previous probability course.
This material is available in the STAT 230 Notes which are posted on the course website.
7
The University of Wisconsin-Madison statistician George E.P. Box (18 October 1919 28 March 2013)
says of statistical models that "All models are wrong but some are useful" which is to say that although
rarely do they t very large amounts of data perfectly, they do assist in describing and drawing inferences
from real data.
45
46
In probability theory, there is a large emphasis on factor 1 above, and there are many
familiesof probability distributions that describe certain types of situations. For example,
the Binomial distribution was derived as a model for outcomes in repeated independent
trials with two possible outcomes on each trial while the Poisson distribution was derived
as a model for the random occurrence of events in time or space. The Gaussian or Normal
distribution, on the other hand, is often used to represent the distributions of continuous
measurements such as the heights or weights of individuals. This choice is based largely on
past experience that such models are suitable and on mathematical convenience.
In choosing a model we usually consider families of probability distributions. To be
specic, we suppose that for a random variable Y we have a family of probability functions/probability density functions, f (y; ) indexed by the parameter (which may be a
vector of values). In order to apply the model to a specic problem we need a value for .
The process of selecting a value for based on the observed data is referred to as estimating the value of ortting the model. The next section describes the most widely used
method for estimating .
Most applications require a sequence of steps in the formulation (the word specication is also used) of a model. In particular, we often start with some family of models in
mind, but nd after examining the data set and tting the model that it is unsuitable in certain respects. (Methods for checking the suitability of a model will be discussed in Section
2.4.) We then try other models, and perhaps look at more data, in order to work towards
a satisfactory model. This is usually an iterative process, which is sometimes represented
by diagrams such as:
Collect and examine data set
#
Propose a (revised?) model
#
"
Fit model ! Check model
#
Draw conclusions
Statistics devotes considerable eort to the steps of this process. However, in this
course we will focus on settings in which the models are not too complicated, so that model
formulation problems are minimized. There are several distributions that you should review
before continuing since they will appear frequently in these notes. See the Stat 230 Notes
available on the course webpage. You should also consult the Table of Distributions at the
end of these notes for a condensed table of properties of these distributions including their
moment generating functions and their moments.
47
Discrete
F (x) = P (X
x) =
c.d.f.
Continuous
P
P (X = t)
t x
f (x) = P (X = x)
Probability
of an event
P
P (X 2 A) =
P (X = x)
x2A
P
f (x)
=
Expectation
P (X = x) =
all x
f (x) =
d
dx F
f (t) dt
(x) 6= P (X = x) = 0
P (a < X
b) = F (b)
Rb
= f (x) dx
F (a)
R1
f (x) = 1
all x
E [g (X)] =
Rx
F is a continuous
function for all x 2 <
x2A
Total Probability
x) =
F (x) = P (X
f (x) dx = 1
g (x) f (x)
E [g (X)] =
all x
R1
g (x) f (x) dx
Binomial Distribution
The discrete random variable (r.v.) Y has a Binomial distribution if its probability
function is of the form
P (Y = y; ) = f (y; ) =
n
y
(1
)n
for y = 0; 1; : : : ; n
Binomial(n; ).
Poisson Distribution
The discrete random variable Y has a Poisson distribution if its probability function is
of the form
y
e
f (y; ) =
for y = 0; 1; 2; : : :
y!
where is a parameter with
V ar(Y ) = .
> 0. We write Y
and
48
Exponential Distribution
The continuous random variable Y has an Exponential distribution if its probability
density function is of the form
1
f (y; ) = e
where is parameter with
V ar(Y ) = 2 .
y=
> 0. We write Y
for y > 0
Exponential( ). Recall that E(Y ) =
and
1
2
exp
1
2
(y
)2
for y 2 <
where and are parameters, with 2 < and > 0. Recall that E(Y ) = ; V ar(Y ) =
2 ; and the standard deviation of Y is sd(Y ) =
. We write either Y
G( ; ) or
Y
N ( ; 2 ). Note that in the former case, G( ; ), the second parameter is the standard deviation whereas in the latter, N ( ; 2 ), the second parameter is the variance 2 .
Most software syntax including R requires that you input the standard deviation for the
parameter. As seen in examples in Chapter 1, the Gaussian distribution provides a suitable
model for the distribution of measurements on characteristics like the height or weight of
individuals in certain populations, but is also used in many other settings. It is particularly
useful in nance where it is the most commonly used model for asset prices, exchange rates,
interest rates, etc.
Multinomial Distribution
The Multinomial distribution is a multivariate distribution in which the discrete random
variables Y1 ; : : : ; Yk (k 2) have the joint probability function
P (Y1 = y1 ; : : : ; Yk = yk ; ) = f (y1 ; : : : ; yk ; )
n!
y1
=
y1 !y2 ! : : : yk ! 1
y2 ::: yk
2
k
(2.1)
where each yi , for i = 1; : : : ; k, is an integer between 0 and n, and satisfying the condition
k
P
yi = n. The elements of the parameter vector = ( 1 ; : : : ; k ) satisfy 0 < i < 1 for i =
i=1
1; : : : ; k and
k
P
i=1
arises when there are repeated independent trials, where each trial has k possible outcomes
(call them outcomes 1; : : : ; k), and the probability outcome i occurs is i . If Yi , i = 1; : : : ; k
is the number of times that outcome i occurs in a sequence of n independent trials, then
i=1
2.2
Suppose a probability distribution that serves as a model for some random process depends
on an unknown parameter (possibly a vector). In order to use the model we have to
estimate or specify a value for . To do this we usually rely on some data set that has
been collected for the random variable in question. It is important that a data set be
collected carefully, and we consider this issue in Chapter 3. For example, suppose that the
random variable Y represents the weight of a randomly chosen female in some population,
and that we consider a Gaussian model, Y
G ( ; ). Since E(Y ) = , we might decide to
randomly select, say, 50 females from the population, measure their weights y1 ; y2 ; : : : ; y50 ,
and use the average,
50
1 P
^=y=
yi
(2.2)
50 i=1
to estimate . This seems sensible (why?) and similar ideas can be developed for other
parameters; in particular, note that must also be estimated, and you might think about
how you could use y1 ; : : : ; y50 to do this. (Hint: what does or 2 represent in the Gaussian
model?) Note that although we are estimating the parameter we did not write = y.
We introduced a special notation ^ . This serves a dual purpose, both to remind you that y
is not exactly equal to the unknown value of the parameter , but also to indicate that ^ is
a quantity derived from the data yi , i = 1; 2; : : : ; 50 and depends on the sample. A dierent
draw of the sample yi , i = 1; 2; : : : ; 50 will result in a dierent value for ^ :
Denition 7 An estimate of a parameter is the value of a function of the observed data
y1 ; y2 ; : : : ; yn and other known quantities such as the sample size n. We use ^ to denote an
estimate of the parameter .
Note that ^ = ^(y1 ; y2 ; : : : ; yn ) = ^(y) depends on the sample y =(y1 ; y2 ; : : : ; yn ) drawn.
50
is dened as
L ( ) = L ( ; y) = P (Y = y; ) for
where the parameter space
Note that the likelihood function is a function of the parameter and the given data y.
For convenience we usually write just L ( ). Also, the likelihood function is the probability
that we observe at random the observation y, considered as a function of the parameter
. Obviously values of the parameter that make our observation y more probable would
seem more credible or likely than those that make it less probable. Therefore values of
for which L( ) is large are more consistent with the observed data y. This seems like a
sensible approach, and it turns out to have very good properties.
Denition 9 The value of which maximizes L( ) for given data y is called the maximum
likelihood estimate 8 (m.l. estimate) of . The value is denoted by ^.
Example 2.2.1 A public opinion poll9
We are surrounded by polls. They guide the policies of our political leaders, the products that are developed by manufacturers, and increasingly the content of the media. For
example the article on the next page which was published in the CAUT (Canadian Association of University Teachers) Bulletin describes a poll 10 conducted by the Harris/Decima
company. Harris/Decima conducts semi-annual polls for CAUT to learn about Canadian
public opinion about post-secondary education in Canada. The poll described in the article
was conducted in November 2010. Harris/Decima uses a telephone poll of 2000 representative adults. Figure 2.1 shows the results for the polls conducted in fall 2009 and
2010. In 2009 and 2010, 26% of respondents agreed and 48% disagreed with the statement:
University and college teachers earn too much.
8
We will often distinguish between the random variable, the maximum likelihood estimator, which is the
function of the data in general, and its numerical value for the data at hand, referred to as the maximum
likelihood estimate.
9
See the corresponding video harris decima poll and introduction to likelihoods at www.watstat.ca
10
http://www.caut.ca/uploads/Decima_Fall_2010.pdf
52
Figure 2.1: Harris/Decima poll. The two bars are from polls conducted in Nov.
9, 2009 (left bar) and Nov 10, 2010 (right bar)
Harris/Decima declared their result to be accurate within 2:2%, 19 times out of 20
(the margin of error for regional, demographic or other subgroups is larger). What does
this mean and how were these estimates and intervals obtained?
Suppose that the random variable Y represents the number of individuals who, in a
randomly selected group of n persons, agreed with the statement. Suppose we assume that
Y is closely modelled by a Binomial distribution with probability function
n
y
P (Y = y; ) = f (y; ) =
)n
(1
for y = 0; 1; : : : ; n
where represents the fraction of the Canadian adult population that agree. In this case,
if we select a random sample of n persons and obtain their views we have Y = Y , and the
observed data are y = y = 520, the number out of 2000 who were polled that agreed with
the statement. Thus the likelihood function is given by
L( ) =
n
y
(1
)n
for 0 <
<1
(2.3)
520
for 0 <
< 1:
(2.4)
520
(1
)2000
It is easy to see that (2.3) is maximized by the value = ^ = y=n. (You should show this.)
The estimate ^ = y=n is called the sample proportion. For this example the value of this
maximum likelihood estimate is 520=2000 = 0:26 or 26%. This is also easily seen from a
graph of the likelihood function (2.4) given in Figure 2.2.
The interval suggested by the pollsters was 26 2:2% or [23:8; 28:2]. Looking at Figure
2.2 we see that the interval [0:238; 0:282] is a reasonable interval for the parameter since
it seems to contain most of the values of with large values of the likelihood L( ). We will
return to the construction of such interval estimates in Chapter 4.
0.025
0.02
L()
0.015
0.01
0.005
0
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
Figure 2.2: Likelihood function for the Harris/Decima poll and corresponding
interval estimate for
Note that the likelihood functions basic properties, for example, where its maximum
occurs and its shape, are not aected if we multiply L( ) by a constant. Indeed it is not
the absolute value of the likelihood function that is important but the relative values at
two dierent values of the parameter, e.g. L( 1 )=L( 2 ): You might think of this ratio as
how much more or less consistent the data are with the parameter 1 versus 2 . The ratio
L( 1 )=L( 2 ) is also unaected if we multiply L( ) by a constant. In view of this we might
dene the likelihood as P (Y = y; ) or any constant multiple of it, so, for example, we
could drop the term ny in (2.3) and dene L( ) = y (1 )n y . This function and (2.3) are
maximized by the same value ^ = y=n and have the same shape. Indeed we might rescale
the likelihood function by dividing through by its maximum value L(^) so that the new
function has a maximum value equal to one.
Denition 10 The relative likelihood function is dened as
R( ) =
Note that 0
R( )
1 for all
L( )
L(^)
for
2 :
2 .
for
2 :
Note that ^ also maximizes l( ). In fact in Figure 2.3 we see that l( ), the lower of the
two curves, is a monotone function of L( ) so they increase together and decrease together.
This implies that both functions have a maximum at the same value = ^.
54
1.5
1
L()
0.5
0
-0.5
l()
-1
-1.5
-2
-2.5
-3
0.23
0.24
0.25
0.26
0.27
0.28
0.29
Figure 2.3: The functions L ( ) (upper graph) and l ( ) (lower graph) are both
maximized at the same value = ^
Because functions are often (but not always!) maximized by setting their derivatives
equal to zero11 , we can usually obtain ^ by solving the equation
dl
= 0:
d
For example, from L( ) =
(1
)n
we get l( ) = y log( ) + (n
dl
y
=
d
n
1
y) log(1
) and
n
Q
i=1
f (yi ; ) for
2 :
(You should recall from probability that if Y1 ; : : : ; Yn are independent random variables then
their joint probability function is the product of their individual probability functions.)
11
Can you think of an example of a continuous function f (x) dened on the interval [0; 1] for which the
maximum max0 x 1 f (x) is NOT found by setting f 0 (x) = 0?
P (Y2 = y2 ; )
L2 ( ) for
where Lj ( ) = P (Yj = yj ; ); j = 1; 2:
Example 2.2.2 Likelihood function for Poisson distribution
Suppose y1 ; : : : ; yn is an observed random sample from a Poisson( ) distribution. The
likelihood function is
L( ) =
n
Q
f (yi ; ) for
i=1
yi
n
Q
e
yi !
i=1
or more simply
L( ) =
ny
n
P
n 1
Q
i=1 yi !
yi
i=1
for
for
>0
> 0:
d
y
l( ) = n
d
1 =
for
n
>0
(y
):
Example 2.2.3
Suppose that the random variable Y represents the number of persons infected with
the human immunodeciency virus (HIV) in a randomly selected group of n persons. We
assume the data are reasonably modeled by Y v Binomial(n; ) with probability function
n
y
P (Y = y; ) = f (y; ) =
)n
(1
for y = 0; 1; : : : ; n
where represents the fraction of the population that are infected. In this case, if we select
a random sample of n persons and test them for HIV, we have Y = Y , and y = y as the
observed number infected. Thus
L( ) =
n
y
(1
)n
for 0 <
<1
56
or more simply
L( ) =
)n
(1
for 0 <
<1
(2.5)
yi
)1
(1
yi
< 1:
n
P
n
Q
i=1
n
Q
f (yi ; )
i=1
yi
y
yi
)1
(1
(1
(1
)n
)n
yi
yi
for 0 <
<1
yi . This is the same likelihood function as (2.5). The reason for this is
i=1
n
P
i=1
( vi )y
e
y!
vi
for y = 0; 1; : : :
(2.6)
where is the average number of bacteria per milliliter (ml) of water. There is an inexpensive test which can detect the presence (but not the number) of bacteria in a water sample.
In this case what we do not observe Y , but rather the presence indicator I(Y > 0), or
(
1 if Y > 0
Z=
0 if Y = 0:
vi
=1
P (Z = 0; ):
n
Q
P (Zi = zi ; )
i=1
n
Q
(1
vi zi
vi
vi 1 zi
) (e
for
>0
i=1
n
P
[zi log(1
(1
zi ) vi ] for
> 0:
i=1
8
10
10
4
10
8
2
10
7
1
10
3
This gives
l( ) = 10 log(1
+ 3 log(1
e
e
) + 8 log(1
21
for
) + 7 log(1
> 0:
12
58
-17
-18
-19
l()
-20
-21
-22
-23
-24
-25
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2.3
Recall that we dened likelihoods for discrete random variables as the probability of observing the data y or
L ( ) = L ( ; y) = P (Y = y; )
for
2 :
f (yi ; )
i=1
n
Q
i=1
f (yi ; )
for
2 :
(2.7)
59
n 1
Q
yi =
i=1
n
P
exp
n
1P
n log
yi =
for
> 0:
i=1
yi =
n log +
for
>0
i=1
with derivative
d
l( ) =
d
n
2
(y
):
1
2
exp
)2
(y
for y 2 <:
f (yi ; ; )
i=1
n
Q
i=1
1
2
= (2 )
exp
n=2
1
2
exp
or more simply
L( ) = L( ; ) =
exp
n log
)2
(yi
n
1 P
2
(yi
)2
for
2 < and
>0
(yi
)2
for
2 < and
> 0:
i=1
n
1 P
2
i=1
= ( ; ) is
n
1 P
2
(yi
i=1
)2
log c for
2 < and
>0
60
@l
=
@
)=
n
1 P
3
n
2
and
(y
)2 = 0;
(yi
i=1
is ^ = (^ ; ^ ), where
2.4
n
1 P
^=
(yi
n i=1
and
1=2
2
y)
k
Q
n!
y1 !
yk ! i=1
= ( 1;
yi
i
for yi = 0; 1; : : : where
n!
y1 !
or more simply
k
Q
i=1
k
P
yi = n:
2; : : : ; k )
L( ) =
k
P
i=1
2; : : : ; k )
L( ) = L ( 1 ;
k
Q
yk ! i=1
yi
i
yi
i
yi log i :
i=1
i=1
The Lagrange multiplier method (Calculus III) for constrained optimization allows us to nd the solution
^i = yi =n , i = 1; : : : ; k.
61
4
P
^
1;
2;
i=1
3;
4.
(Note: studies involving much larger numbers of people put the values of the i s
for Caucasians at close to 1 = 0:448; 2 = 0:083; 3 = 0:034; 4 = 0:436:)
In some problems the Multinomial parameters 1 ; : : : ; k may be functions of fewer than
k 1 parameters. The following is an example.
Example 2.4.2 MM, MN, NN blood types
Another way of classifying a persons blood is through their M-N type. Each person
is one of three types, labelled MM, MN and NN and we can let 1 ; 2 ; 3 be the fraction
of the population that is each of the three types. In a sample of size n we let Y1 = number
of MM types observed, Y2 = number of MN types observed and Y3 = number of NN types
observed. The joint probability function of Y1 ; Y2 ; Y3 is
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) =
According to a model in genetics, the
for human populations:
1
where
i s
n!
y1 !y2 !y3 !
y1 y2 y3
1 2 3
= 2 (1
);
= (1
)2
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) =
n!
[
y1 !y2 !y3 !
2 y1
] [2 (1
)]y2 [(1
L( ) =
2 y1
] [2 (1
2y1 +y2
(1
)]y2 [(1
)y2 +2y3
is
)2 ]y3
for 0 <
<1
)2 ]y3 :
62
or more simply
L( ) =
2y1 +y2
(1
)y2 +2y3
for 0 <
< 1:
and
dl
2y1 + y2
=
d
dl
= 0 if
d
for 0 <
<1
y2 + 2y3
1
2y1 + y2
2y1 + y2
=
2y1 + 2y2 + 2y3
2n
so
^=
2y1 + y2
2n
2.5
Example 2.5.1
Suppose we want to estimate attributes associated with BMI for some population of
individuals (for example, Canadian males age 21-35). If the distribution of BMI values in
the population is well described by a Gaussian model, Y
G( ; ), then by estimating
and we can estimate any attribute associated with the BMI distribution. For example:
(i) The mean BMI in the population corresponds to
bution.
(ii) The median BMI in the population corresponds to the median of the Gaussian
distribution which equals since the Gaussian distribution is symmetric about its mean.
(iii) For the BMI population, the 0:1 (population) quantile, Q (0:1) =
see this, note that P (Y
1:28 ) = P (Z
1:28) = 0:1, where Z = (Y
G(0; 1) distribution.)
1:28 . (To
)= has a
63
(iv) The fraction of the population with BMI over 35:0 given by
p=1
where
35:0
Suppose a random sample of 150 males gave observations y1 ; : : : ; y150 and that the
maximum likelihood estimates based on the results derived in Example 2.3.2 were
^ = y = 27:1 and ^ =
P
1 150
(yi
150 i=1
1=2
y)2
= 3:56:
35:0 ^
^
=1
0:98679 = 0:01321.
Note that (iii) and (iv) follow from the invariance property of maximum likelihood
estimates.
2.6
The models used in this course are probability distributions for random variables that
represent variates in a population or process. A typical model has probability density
function f (y; ) if the variate Y is continuous, or probability function f (y; ) if Y is discrete,
where is (possibly) a vector of parameter values. If a family of models is to be used for some
purpose then it is important to check that the model adequately represents the variability
in Y . This can be done by comparing the model with random samples y1 ; : : : ; yn of y-values
from the population or process.
For data that have arisen from a discrete probability model, a straightforward way to
check the t of the model is to compare observed frequencies with the expected frequencies
calculated using the assumed model as illustrated in the example below.
Example 2.6.1 Rutherford and Geiger study of alpha-particles and the Poisson
model
In 1910 the physicists Ernest Rutherford and Hans Geiger conducted an experiment
in which they recorded the number of alpha particles omitted from a polonium source (as
detected by a Geiger counter) during 2608 time intervals each of length 1=8 minute. The
number of particles j detected in the time interval and the frequency fj of that number of
particles is given in Table 2.1.
64
Observed
Frequency: fj
57
203
383
525
532
408
273
139
45
27
10
4
0
1
1
2608
Expected
Frequency: ej
54:3
210:3
407:1
525:3
508:4
393:7
254:0
140:5
68:0
29:2
11:3
4:0
1:3
0:4
0:1
2607:9
We can see whether a Poisson model t these data by comparing the observed frequencies
with the expected frequencies calculated assuming a Poisson model. To calculate these
expected frequencies we need to specify the mean of the Poisson model. We estimate
using the sample mean for the data which is
^=
14
1 P
1
jfj =
(10097) = 3:8715:
2608 j=0
2608
(3:8715)j e
j!
3:8715
; j = 0; 1; : : :
65
Observed
Frequency: fj
21
45
50
27
21
9
12
7
8
200
= 200
Zaj
aj
Expected
Frequency: ej
52:72
38:82
28:59
21:05
15:50
11:42
8:41
6:19
17:3
200
1 ; aj )
1
e
49:0275
is calculated using
y=49:0275
dy
= 200 e
aj
1 =49:0275
aj =49:0275
The expected frequencies are also given in Table 2.2. We notice that the observed and
expected frequencies are not close in this case and therefore the Exponential model does
not seem to be a good model for these data.
The di culty of using this method for continuous data is that the intervals must be
selected and this adds a degree of arbitrariness to the method.
66
16
We may also use graphical techniques for checking the t of a model. These methods are
particularly useful for continuous data.
The rst graphical method is to superimpose the probability density function on the
relative frequency histogram of the data as we did in Figures 1.15 and 1.16 for the data
from the can ller study.
See the video at www.watstat.ca called "The empirical c.d.f. and the qqplot" on the material in this
section.
17
We usually denote the ordered values y(1) y(2) : : : y(n) where y(1) is the smallest and y(n) is the
largest. In this case y(n) = 0:88:
67
More generally for a sample of size n we rst order the yi s, i = 1; : : : ; n to obtain the
ordered values y(1) y(2) : : : y(n) . F^ (y) is a step function with a jump at each of the
ordered observed values y(i) . If y(1) ; y(2) ; : : : ; y(n) are all dierent values, then F^ (y(j) ) = j=n
and the jumps are all of size 1=n. In general the size of a jump at a particular point y is
the number of values in the sample that are equal to y, divided by n:
0 .9
0 .8
theoretical quantiles
0 .7
0 .6
0 .5
0 .4
0 .3
0 .2
0 .1
0
0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
s a m p l e q u a n ti l e s
Figure 2.5: The empirical cumulative distribution function for n = 10 data values
and a superimposed Uniform(0; 1) cumulative distribution function.
By superimposing on this graph the theoretical Uniform(0; 1) cumulative distribution
function, which in this case is a straight line, we can see how well the theoretical distribution
and empirical distribution agree. Since the sample is quite small we cannot expect a perfect
straight line, but for larger samples we would expect much better agreement with the
straight line.
Because the Uniform(0; 1) cumulative distribution function is a straight line, it is easy
to assess graphically how close the two curves t, but what if the hypothesized distribution
is Normal, whose cumulative distribution function is distinctly non-linear?
68
As an example we consider data (see Appendix C) for the time between 300 eruptions,
between the rst and the fteenth of August 1985, of the geyser Old Faithful in Yellowstone
National Park. One might hypothesize that the random distribution of times between
consecutive eruptions follows a Normal distribution. We plot the empirical cumulative
distribution function in Figure 2.6 together with the cumulative distribution function of a
Gaussian distribution.
1
0.9
0.8
0.7
0.6
e.c .d.f.
0.5
0.4
0.3
0.2
G(72.3,13.9)
0.1
0
30
40
50
60
70
80
90
T ime between Eruptions
100
110
120
Figure 2.6: Empirical c.d.f. of times between eruptions of Old Faithful and
superimposed G (72:3; 13:9) c.d.f.
Of course we dont know the parameters of the appropriate Gaussian distribution so
we use the sample mean 72:3 and sample standard deviation 13:9 in order to approximate
these parameters. Are the dierences between the two curves in Figure 2.6 su cient that
we would have to conclude a distribution other than the Gaussian? There are two ways of
trying to get another view of the magnitude of these dierences. The rst way is to plot the
relative frequency histogram of the data and then superimpose the Gaussian curve. The
second way is to use a qqplot which will be discussed in the next section.
69
0.035
0.03
0.025
Relative
Frequency
0.02
0.015
G(72.3,13.9)
0.01
0.005
43
49
55
61 67 73 79 85 91
T ime between Eruptions
97 103 109
Figure 2.7: Relative frequency histogram for times between eruptions of Old
Faithful and superimposed G (72:3; 13:9) p.d.f.
Figure 2.7 seems to indicate that the distribution of the times between eruptions is not
very Normal because it appears to have two modes. The plot of the empirical cumulative
distribution function did not show the shape of the distribution as clearly as the histogram.
The empirical cumulative distribution function does allow us to determine the pth quantile
or 100pth percentile (the left-most value on the horizontal axis yp where F^ (yp ) = p). For
example, from the empirical cumulative distribution function of the Old Faithful data, we
see that the median time (F^ (m)
^ = 0:5) between eruptions is around m
^ = 78.
Example 2.6.4 Heights of females
For the data on female heights in Chapter 1 and using the results from Example 2.3.2
we obtain ^ = 1:62; ^ = 0:064 as the maximum likelihood estimates of and . Figure
2.8 shows a plot of the empirical cumulative distribution function with the G(1:62; 0:0637)
cumulative distribution function superimposed. Figure 2.9 shows a relative frequency histogram for these data with the G(1:62; 0:0637) probability density function superimposed.
The two types of plots give complementary but consistent pictures. An advantage of the
distribution function comparison is that the exact heights in the sample are used, whereas in
the histogram plot the data are grouped into intervals to form the histogram. However, the
histogram and probability density function show the distribution of heights more clearly.
Both graphs indicate that a Normal model seems reasonable for these data.
70
1
0.9
0.8
0.7
0.6
e.c.d.f.
0.5
G(1.62,0.064)
0.4
0.3
0.2
0.1
0
1.4
1.45
1.5
1.55
1.6
1.65
Height
1.7
1.75
1.8
1.85
Figure 2.8: Empirical c.d.f. of female heights and G (1:62; 0:064) c.d.f.
5
Relative
Frequency
4
G(1.62,0.064)
1.4
1.5
1.6
Height
1.7
1.8
Figure 2.9: Relative frequency histogram of female heights and G (1:62; 0:064) p.d.f.
71
Qqplots
An alternative view, which is really just another method of graphing the empirical cumulative distribution function, tailored to the Normal distribution, is a graph called a qqplot.
Suppose the data Yi , i = 1; : : : ; n were in fact drawn from the G( ; ) distribution so that
the standardized variables, after we order them from smallest Y(1) to largest Y(n) , are
Z(i) =
Y(i)
These behave like the ordered values from a sample of the same size taken from the G(0; 1)
distribution. Approximately what value do we expect Z(i) to take? If
denotes the
standard Normal cumulative distribution function then for 0 < u < 1
P ( (Z)
u) = P (Z
(u)) =
(u)) = u
so that (Z) has a Uniform distribution. It is easy to check that the expected value of
the ith largest value in a random sample of size n from a Uniform(0; 1) distribution is
i 18
i
equal to n+1
so we expect that the i=n0 th quantile (Z(i) ) to be close to n+1
. In other
words we expect Z(i) = Y(i)
a linear function of
i
n+1
i
n+1
= to be approximately
or Y(i) to be roughly
i
n+1
, i = 1; : : : ; n should be
1 (u)
ln(1
u).
Since reading qqplots is an art acquired from experience, it is a good idea to generate
similar plots where we know the answer. This can be done by generating data from a known
distribution and then plotting a qqplot. See the R code below and Chapter 2, Problem
14. A qqplot of 100 observations randomly generated from a G ( 2; 3) distribution is
given in Figure 2.10. The theoretical quantiles are plotted on the horizontal axis and the
empirical quantiles are plotted on the vertical axis. Since the quantiles of the Normal
distribution change more rapidly in the tails of the distribution, we expect the
points at both ends of the line to lie further from the line.
18
This is intuitively obvious since n values Y(i) breaks the interval into n + 1 spacings,
and it makes sense each should have the same expected length.
For empirical evidence see
http://www.math.uah.edu/stat/applets/OrderStatisticExperiment.html. More formally we must rst show
n!
the p.d.f. of Y(i) is (i 1)!(n
ui 1 (1
u)n i for 0 < u < 1: Then nd the integral E(Y(i) ) =
i)!
R1
i
n!
ui (1 u)n i du = n+1
:
0 (i 1)!(n i)!
72
Sample Quantiles
-5
-10
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
2.5
N(0,1) Quantiles
1.9
Sample Quantiles
1.8
1.7
1.6
1.5
1.4
-3
-2
-1
N(0,1) Quantiles
73
A qqplot of the times between eruptions of Old Faithful is given in Figure 2.12. The
points form an S-shaped curve which indicates as we saw before that the Normal is not a
reasonable model for these data.
12 0
11 0
10 0
Sample Quantiles
90
80
70
60
50
40
30
-3
-2
-1
N(0,1) Q uantiles
Sample Quantiles
15 0
10 0
50
- 50
-3
-2
-1
N(0,1) Q uantiles
74
R Code for Checking Models Using Histograms, Empirical c.d.f.s and Qqplots
# Normal Data Example
y<-rnorm(100,5,2)
# generate 100 observations from a G(5,2) distribution
mn<-mean(y)
# find the sample mean
s<-sd(y)
# find the sample standard deviation
summary(y)
# five number summary
skewness(y,type=1)
# find the sample skewness as given in the Course Notes
kurtosis(y,type=1)+3 # find the sample kurtosis as given in the Course Notes
hist(y,freq=F)
# graph the relative frequency histogram
w<-mn+s*seq(-3,3,0.01) # calculate points at which to graph the Normal pdf
d<-dnorm(w,mn,s)
# calculate values of Normal pdf at these points
points(w,d,type=l)
# superimpose the Normal pdf on the histogram
A<-ecdf(y)
# calculate the empirical cdf for the data
e<-pnorm(w,mn,s)
# calculate the values of the Normal cdf
plot(A,verticals=T,do.points=F,xlab=y,ylab=ecdf) # plot the ecdf
points(w,e,type=l)
# superimpose the Normal cdf
qqnorm(y)
# graph a qqplot of the data
#
# Exponential Data Example
y<-rexp(100,5)
# generate 100 observations from Exponential(5) distn
mn<-mean(y)
# find the sample mean
s<-sd(y)
# find the sample standard deviation
summary(y)
# five number summary
skewness(y,type=1)
# find the sample skewness as given in the Course Notes
kurtosis(y,type=1)+3 # find the sample kurtosis as given in the Course Notes
hist(y,freq=F)
# graph the relative frequency histogram
w<-mn+s*seq(-3,3,0.01) # calculate points at which to graph the Normal pdf
d<-dnorm(w,mn,s)
# calculate values of Normal pdf at these points
points(w,d,type=l)
# superimpose the Normal pdf on the histogram
A<-ecdf(y)
# calculate the empirical cdf for the data
e<-pnorm(w,mn,s)
# calculate the values of the Normal cdf
plot(A,verticals=T,do.points=F,xlab=y,ylab=ecdf) # plot the ecdf
points(w,e,type=l)
# superimpose the Normal cdf
qqnorm(y)
# graph a qqplot of the data
2.7
75
Chapter 2 Problems
1. In modelling the number of transactions of a certain type received by a central computer for a company with many on-line terminals the Poisson distribution can be used.
If the transactions arrive at random at the rate of per minute then the probability
of y transactions in a time interval of length t minutes is
P (Y = y; ) = f (y; ) =
( t)y
e
y!
for y = 0; 1; : : : and
> 0:
(a) The numbers of transactions received in 10 separate one minute intervals were
8, 3, 2, 4, 5, 3, 6, 5, 4, 1. Write down the likelihood function for and nd the
maximum likelihood estimate ^.
(b) Estimate the probability that during a two-minute interval, no transactions arrive.
(c) Use the R function rpois() with the value = 4:1 to simulate the number of
transactions received in 100 one minute intervals. Calculate the sample mean
and variance; are they approximately the same? (Note that E(Y ) = V ar(Y ) =
for the Poisson model.)
2. Suppose y1 ; y2 ; :::yn is an observed random sample from the distribution with probability density function
f (y) = ( + 1)y
>
1:
76
5. Suppose that in a population of twins, males (M ) and females (F ) are equally likely
to occur and that the probability that a pair of twins is identical is . If twins are
not identical, their sexes are independent.
(a) Show that
P (M M ) = P (F F ) =
1+
4
and P (M F ) =
1
2
(b) Suppose that n pairs of twins are randomly selected; it is found that n1 are M M ,
n2 are F F , and n3 are M F , but it is not known whether each set is identical or
fraternal. Use these data to nd the maximum likelihood estimate ^ of . What
is the value of ^ if n = 50 and n1 = 16, n2 = 16, n3 = 18?
6. Estimation from capture-recapture studies: In order to estimate the number
of animals, N , in a wild habitat the capture-recapture method is often used. In this
scheme k animals are caught, tagged, and then released. Later on n animals are
caught and the number Y of these that have tags are noted. The idea is to use this
information to estimate N .
(a) Show that under suitable assumptions
P (Y = y) =
k
y
N k
n y
N
n
based on
77
8. The following model has been proposed for the distribution of the number of ospring
Y in a family, for a large population of families:
P (Y = 0; ) =
1 2
1
and
P (Y = k; ) =
1
< :
2
(a) Suppose that n families are selected at random and that fy is the number of
families with y children (f0 + f1 +
= n). Determine the maximum likelihood
estimate of .
(b) Consider a dierent type of sampling wherein a single child is selected at random
and the size of family the child comes from is determined. Let Y represent the
number of children in the family. Show that
P (Y = y; ) = cy
for y = 1; 2; : : :
and determine c.
(c) Suppose that the type of sampling in part (b) was used and that with n = 33
the following data were obtained:
y
fy
1
22
2
7
3
3
4
1
9. Radioactive particles are emitted randomly over time from a source at an average rate
of per second. In n time periods of varying lengths t1 ; t2 ; : : : ; tn (seconds), the numbers of particles emitted (as determined by an automatic counter) were y1 ; y2 ; : : : ; yn
respectively.
(a) Determine an estimate of
to do this?
(b) Suppose that instead of knowing the yi s, we know only whether or not there
was one or more particles emitted in each time interval. Making a suitable
assumption, give the likelihood function for based on these data, and describe
how you could nd the maximum likelihood estimate of .
78
10. In a study of osteoporosis, the heights in centimeters of a sample of 351 elderly women
randomly selected from a community were recorded as follows:
156
163
150
157
166
162
157
165
159
155
156
153
164
167
163
166
157
167
158
156
163
163
153
162
152
178
163
149
167
166
155
153
155
145
158
170
163
170
164
145
158
157
169
161
155
154
153
162
169
162
160
161
156
159
153
158
154
163
159
170
150
155
156
161
170
161
173
160
164
154
153
157
157
158
156
169
161
158
151
170
164
164
168
152
154
159
158
154
156
155
146
156
153
158
164
161
160
160
162
163
155
166
161
160
156
156
170
163
162
151
155
156
163
159
153
160
158
159
163
164
153
168
170
157
153
165
163
157
158
163
165
161
157
157
156
155
157
164
159
163
158
157
152
160
163
163
159
164
156
162
163
169
162
163
147
152
161
158
163
163
169
165
159
165
148
166
161
170
158
161
159
155
166
159
151
153
158
163
161
165
158
154
160
155
154
150
162
154
150
164
154
160
160
159
157
155
152
153
167
176
157
165
164
167
167
165
153
147
164
158
155
151
165
158
156
166
157
159
162
158
158
158
165
164
158
165
168
161
159
158
164
163
163
160
160
159
162
169
158
155
168
161
157
170
159
147
163
155
157
162
163
160
158
165
170
157
168
155
163
150
161
156
167
174
152
162
160
158
166
160
164
165
153
152
158
152
155
161
147
154
165
171
142
155
158
165
165
161
(a) Construct a frequency histogram and determine whether the data appear to be
approximately Normally distributed.
(b) Determine the sample mean y and the sample standard deviation s for these
data. Compare the proportion of observations in the interval [y s; y + s] and
[y 2s; y + 2s] with the proportion one would expect if the data were Normally
distributed with these parameters.
(c) Find the interquartile range for these data. What is the relationship between
the IQR and for Normally distributed data?
(d) Find the ve-number summary for these data.
(e) Draw a boxplot for these data. Does it resemble a boxplot for Normal data?
(f) Plot a qqplot for these data and again assess whether the data are approximately
Normally distributed. What departures do you see and why?
79
11. Consider the data on heights of adult males and females from Chapter 1. (The data
are posted on the course webpage.)
(a) Assuming that for each sex the heights Y in the population from which the samples were drawn is adequately represented by Y
G( ; ), obtain the maximum
likelihood estimates ^ and ^ in each case.
(b) Give the maximum likelihood estimates for Q (0:1) and Q (0:9), the 10th and
90th percentiles of the height distribution for males and for females.
(c) Give the maximum likelihood estimate for the probability P (Y > 1:83) for males
and females (i.e. the fraction of the population over 1:83 m, or 6 ft).
(d) A simpler estimate of P (Y > 1:83) that doesnt use the Gaussian model is
number of person in sample with y > 1:83
n
where here n = 150. Obtain these estimates for males and for females. Can
you think of any advantages for this estimate over the one in part (c)? Can you
think of any disadvantages?
(e) Suggest and try a method of estimating the 10th and 90th percentile of the
height distribution that is similar to that in part (d).
12. The lifetimes of 92 right front disc brakes pads for a specic car model are posted in
the le brakelife.text on the course webpage. The lifetimes y are in km driven, and
correspond to the point at which the brake pads in new cars are reduced to a specied
thickness.
(a) Assuming a G( ; ) model for the lifetimes, determine the maximum likelihood
estimates of and based on the data. How well does the Gaussian model t
the data?
(b) Another model for such data is given by
"
1
1 log y
f (y; ; ) = p
exp
2
2 y
for y > 0:
(Note: Show using methods you learned in your course on probability that if
X v G( ; ) then Y = log X has the probability density function given above.)
Using this model determine the maximum likelihood estimates of and based
on the data. How well does this model t the data? Which of the two models
describes the data better?
80
13. In a large population of males ages 40 - 50, the proportion who are regular smokers is
where 0 <
< 1 and the proportion who have hypertension (high blood
pressure) is where 0 < < 1. If the events S (a person is a smoker) and H (a
person has hypertension) are independent, then for a man picked at random from the
population the probabilities he falls into the four categories SH; S H; SH; S H are
respectively,
; (1
); (1
) ; (1
)(1
). Explain why this is true.
(a) Suppose that 100 men are selected and the numbers in each of the four categories
are as follows:
Category SH S H SH S H
Frequency 20
15
22
43
Assuming that S and H are independent events, determine the likelihood function for and based on the Multinomial distribution, and nd the maximum
likelihood estimates of and .
(b) Compute the expected frequencies for each of the four categories using the maximum likelihood estimates. Do you think the model used is appropriate? Why
might it be inappropriate?
14. Censored lifetime data: Consider the Exponential distribution as a model for the
lifetimes of equipment. In experiments, it is often not feasible to run the study long
enough that all the pieces of equipment fail. For example, suppose that n pieces of
equipment are each tested for a maximum of C hours (C is called a censoring time).
The observed data are: k (where 0 k n) pieces fail, at times y1 ; : : : ; yk and n k
pieces are still working after time C.
(a) If Y has an Exponential( ) distribution, show that P (Y > C; ) = e
C > 0:
C=
, for
(b) Determine the likelihood function for based on the observed data described
above. Show that the maximum likelihood estimate of is
k
^ = 1 P yi + (n
k i=1
k)C :
(c) What does part (b) give when k = 0? Explain this intuitively.
(d) A standard test for the reliability of electronic components is to subject them
to large uctuations in temperature inside specially designed ovens. For one
particular type of component, 50 units were tested and k = 5 failed before 400
5
P
hours, when the test was terminated, with
yi = 450 hours. Find the maximum
i=1
likelihood estimate of .
81
15. Poisson model with a covariate: Let Y represent the number of claims in a given
year for a single general insurance policy holder. Each policy holder has a numerical
risk scorex assigned by the company, based on available information. The risk score
may be used as a covariate (explanatory variable) when modeling the distribution of
Y , and it has been found that models of the form
P (Y = yjx) =
[ (x)]y
e
y!
(x)
for y = 0; 1; : : :
0. 8
Quantiles of Input Sample
82
0. 6
0. 4
0. 2
0
-3
-2
-1
0
S tandard Norm al Q uantiles
3. PLANNING AND
CONDUCTING EMPIRICAL
STUDIES
3.1
Empirical Studies
An empirical study is one which is carried out to learn about a population or process by
collecting data. We have given several examples in the preceding two chapters but we have
not yet considered the details of such studies in any systematic way. It is the object of this
chapter to do that. Well-conducted studies are needed to produce maximal information
within existing cost and time constraints. Conversely, a poorly planned or executed study
can be worthless or even misleading.
It is helpful to think of planning and conducting a study as a set of steps. We describe
below the set of steps to which we assign the acronym PPDAC
Problem: a clear statement of the studys objectives, usually involving one or more
questions
Plan: the procedures used to carry out the study including how we will collect the
data.
Data: the physical collection of the data, as described in the Plan.
Analysis: the analysis of the data collected in light of the Problem and the Plan.
Conclusion: The conclusions that are drawn about the Problem and their limitations.
PPDAC has been designed to emphasize the statistical aspects of empirical studies. We
develop each of the ve steps in more detail below. Several examples of the use of PPDAC
in an empirical study will be given. We identify the steps in the following example.
Example 3.1
The following newspaper article appeared in the Globe and Mail on February 3, 2014. It
83
84
describes an empirical investigation in the eld of medicine. There are thousands of studies
in this eld every year conducted at very high costs to society and with critical consequences.
These investigations must be well planned and executed so that the knowledge they produce
is useful, reliable and obtained at reasonable cost.
Excess sugar can triple risk of dying of heart disease
New research published in the journal JAMA Internal Medicine shows that people who get 25
per cent or more of their daily calories from added sugar almost triple their risk of dying of heart
disease. Even those who have more moderate levels of sugar consumption, from 10 to 25 per cent
of their daily diet, still increase their cardiovascular risk by 30 per cent. In the U.S., the Institute
of Medicine recommends that people not consume more than 25 per cent of their daily calories in
added sugars, while the World Health Organization sets the threshold at 10 per cent. However,
Canada does not have guidelines on safe levels of sugar consumption, as it does with salt and trans
fats. While excessive sugar consumption has long been considered a health risk, a paradigm shift
in thinking now holds that sugar not only contributes to conditions like obesity and diabetes, it
damages the bodys organs directly. Too much sugar does not just make us fat; it can also make us
sick,said Laura Schmidt, a researcher at the Philip R. Lee Institute for Health Policy Studies at the
University of California San Francisco. Overall, Americans consume 15.5 per cent of calories from
added sugars; in Canada, the gure is 10.7 per cent. But there is a wide range: About 10 per cent of
adults consume 25 per cent or more of their calories from added sugar; another 72 per cent consume
between 10 and 25 per cent in the form of sugar; the balance, 18 per cent, have a diet that consists
of less than 10 per cent of calories from sugar. The new research, led by Quanhe Yang, a researcher
in the o ce of public health genomics at the U.S. Centers for Disease Control and Prevention,
suggests those levels are too high. The risk of CVD [cardiovascular disease] mortality increased
exponentially with increasing the usual percentage of calories from added sugar, Dr. Yang wrote.
Views range broadly on safe levels of sugar consumption, and the issue is highly controversial with
industry and regulators. The American Heart Association is far more concerned about the dangers
of excess sugar; it recommends that women get no more than ve per cent of their daily calories
from that source, and men not exceed 7.5 per cent. The Heart and Stroke Foundation of Canada
does not have specic guidelines on sugar consumption. Rather, it encourages Canadians to follow
Canadas Food Guide for Healthy Eating, which includes a vague recommendation that Canadians
limit excess fat, sugar and salt. But the Canadian Sugar Institute said the scientic consensus is
that there is no evidence of harm attributed to current sugar consumption levels. Dietary advice
must be based on the totality of evidence, not single studies suggesting an association between
individual dietary factors and disease.
The research was conducted using data from the National Health and Nutrition Examination
Survey (NHANES), which was conducted in stages between 1988 and 2010; the long-running research
project collects detailed nutritional information and tracks mortality. Data from more than 43,000
people were included in this analysis. Researchers focused on consumption of added sugars, a
category that includes all sugar, corn syrups, honey, and maple syrup added to foods. It does not
include sugars that naturally occur in fruits, vegetables and dairy products. The main sources of
85
added sugars, according to the CDC study, are sugar sweetened beverages like soft drinks and sports
drinks, 37 per cent; deserts like cakes and puddings, 14 per cent; fruit drinks, nine per cent; dairy
desserts, six per cent; candy, six per cent. Dr. Yang and the research team found that a person who
drinks an average of one sugar sweetened beverage daily has a 29 per cent higher risk of dying of
heart disease than a person who drinks just one a week. A single can of soda pop like Coke contains
35 grams of sugar and 140 calories.
Note that in the Problem step, we describe what we are trying to learn or what
questions we want to answer. The Plan step describes how the data are to be measured
and collected. In the Data step, the Plan is executed. The Analysis step corresponds
to what many people think Statistics is all about. We carry out both simple and complex
calculations to process the data into information. Finally, in the Conclusion step, we answer
the questions formulated at the Problem step.
PPDAC can be used in two ways - rst to actively formulate, plan and carry out investigations and second as a framework to critically scrutinize reported empirical investigations.
These reports include articles in the popular press (as in the above example), scientic
papers, government policy statements and various business reports. If you see the phrase
evidence based decision or evidence based management, look for an empirical study.
To discuss the steps of PPDAC in more detail we need to introduce a number of technical
terms. Every subject has its own jargon, i.e. words with special meaning, and you need to
learn the terms describing the details of PPDAC to be successful in this course.
86
3.2
1.
Types of Problems
Three common types of statistical problems that are encountered are described below.
Descriptive: The problem is to determine a particular attribute of a population.
Much of the function of o cial statistical agencies such as Statistics Canada involves
problems of this type. For example, the government needs to know the national
unemployment rate and whether it has increased or decreased over the past month.
Causative: The problem is to determine the existence or non-existence of a causal
relationship between two variates. For example:
Does taking a low dose of aspirin reduce the risk of heart disease among men over the
age of 50?
Does changing from assignments to multiple term tests improve student learning in
STAT 231?
Does second-hand smoke from parents cause asthma in their children.
Does compulsory driver training reduce the incidence of accidents among new drivers?
Predictive: The problem is to predict the response of a variate for a given unit. This
is often the case in nance or in economics. For example, nancial institutions need
to predict the price of a stock or interest rates in a week or a month because this
eects the value of their investments.
In the second type of problem, the experimenter is interested in whether one variate
x tends to cause an increase or a decrease in another variate Y . Where possible this is
conducted in a controlled experiment in which x is increased or decreased while holding
everything else in the experiment constant and we observe the changes in Y: As indicated
in Chapter 1, an experiment in which the experimenter manipulates the values of the
explanatory variates is referred to as an experimental study. On the other hand in the
study of whether second-hand smoke causes asthma, it is unlikely that the experimenter
87
would be able to manipulate the explanatory variate and so the experimenter needs to
rely on a potentially less informative observational study, one that depends on data that
is collected without the ability to control explanatory variates. We will see in Chapter
8 how an empirical study must be carefully designed in order to answer such causative
questions. Important considerations in an observational study are the design of the survey
and questionnaire, who to ask, what to ask, how many to ask, where to sample etc.
Dening the Problem
The rst step in describing the Problem is to dene the units and the target population
or target process.
Denition 14 The target population or process is the collection of units to which the conclusions will apply.
In Chapter 1, we considered a survey of teenagers in Ontario in a specic week to learn
about their smoking behaviour. In this example the units are teenagers in Ontario at the
time of the survey and the target population is all such teenagers.
In another example, we considered the comparison of two machines with respect to
the volume of liquid in cans being lled. The units are the individual cans. The target
population (or perhaps it is better to call it a process) is all such cans lled now and into
the future under current operating conditions. Sometimes we will be vague in specifying
the target population, i.e. cans lled under current conditions is not very clear. What
do we mean by current conditions, for example?
Denition 15 A variate is a characteristic of every unit.
For each teenager (unit) in the target population, the variate of primary interest is
whether or not the teenager smokes. Other variates of interest dened for each unit might
be age and sex. In the can-lling example, the volume of liquid in each can is a variate.
The machine that lled the can is another variate. A key point to notice is that the values
of the variates change from unit to unit in the population. There are usually many variates
associated with each unit. At this stage, we will be interested in only those that help specify
the questions of interest.
Denition 16 An attribute is a function of the variates over the target population.
We specify the questions of interest in the Problem in terms of attributes of the target
population. In the smoking example, one important attribute is the proportion of teenagers
in the target population. In the can-lling example, the attributes of interest were the
average volume and the variability of the volumes for all cans lled by each machine under
current conditions. Possible questions of interest (among others) are:
What proportion of teenagers in Ontario smoke?
88
Is the standard deviation of volumes of cans lled by the new machine less than that
of the old machine?
We can also ask questions about graphical attributes of the target population such as
the population histogram or a scatterplot of one variate versus another over the whole
population.
It is very important that the Problem step contain clear questions about one or more
attributes of the target population.
2.
Plan
In most cases, we cannot calculate the attributes of interest for the target population directly
because we can only examine a subset of the units in the target population. This may be
due to lack of resources and time, as in the smoking survey or a physical impossibility as
in the can-lling study where we can only look at cans available now and not in the future.
Or, in an even more di cult situation, we may be forced to carry out a clinical trial using
mice because it is unethical to use humans and so we do not examine any units in the target
population. Obviously there will be uncertainty in our answers. The purpose of the Plan
step is to decide what units we will examine (the sample), what data we will collect and
how we will do so. The Plan depends on the questions posed in the Problem step.
Denition 17 The study population or study process is the collection of units available to
be included in the study.
Often the study population is a subset of the target population (as in the teenage
smoking survey). However, in many medical applications, the study population consists of
laboratory animals whereas the target population consists of people. In the development of
new products, we may want to draw conclusions about a production process in the future
but we can only look at units produced in a laboratory in a pilot process. In this case, the
study units are not part of the target population. In many surveys, the study population
is a list of people dened by their telephone number. The sample is selected by calling
a subset of the telephone numbers. Therefore the study population excludes those people
without telephones or with unlisted numbers.
As indicated above, the study population is usually not identical to the target population.
Denition 18 If the attributes in the study population di er from those in the target population then the di erence is called study error.
We cannot quantify study error but must rely on context experts to know, for example,
that conclusions from an investigation using mice will be relevant to the human target
population. We can however warn the context experts of the possibility of such error,
especially when the study population is very dierent from the target population.
89
Denition 19 The sampling protocol is the procedure used to select a sample of units from
the study population. The number of units sampled is called the sample size.
In Chapter 2, we discussed modeling the data and often claimed that we had a random
sampleso that our model was simple. In practice, it is exceedingly di cult and expensive
to select a random sample of units from the study population and so other less rigorous
methods are used. Often we take what we can get. Sample size is usually driven by
economics or availability. We will show in later chapters how we can use the model to help
with sample size determination.
Denition 20 If the attributes in the sample di er from those in the study population the
di erence is called sample error or sampling error.
Even with random sampling, we are looking at only a subset of the units in the study
population. Diering sampling protocols are likely to produce dierent sample errors. Also,
since we do not know the values of the study population attributes, we cannot know the
sampling error. However, we can use the model to get an idea of how large this error might
be. These ideas are discussed in Chapter 4.
We must decide which variates we are going to measure or determine for the units in
the sample. For any attributes of interest, as dened in the Problem step, we will certainly
measure the corresponding variates for the units in the sample. As we shall see, we may
also decide to measure other variates that can aid the analysis. In the smoking survey,
we will try to determine whether each teenager in the sample smokes or not (this requires
a careful denition) and also many demographic variates such as age and sex so that we
can compare the smoking rate across age groups, sex etc. In experimental studies, the
experimenters assign the value of a variate to each unit in the sample. For example, in a
clinical trial, sampled units can be assigned to the treatment group or the placebo group
by the experimenters. When the value of a variate is determined for a given unit, errors
are often introduced by the measurement system which determines the value.
Denition 21 If the measured value and the true value of a variate are not identical the
di erence is called measurement error.
Measurement errors are usually unknown. In practice, we need to ensure that the
measurement systems used do not contribute substantial error to the conclusions. We may
have to study the measurement systems which are used in separate studies to ensure that
this is so.
90
Target Population
l
Study Population
#
Sample
#
Measured variate values
Study error
Sample error
Measurement error
3.
Data
The object of the Data step is to collect the data according to the Plan. Any deviations
from the Plan should be noted. The data must be stored in a way that facilitates the
Analysis.
The previous sections noted the need to dene variates clearly and to have satisfactory
methods of measuring them. It is di cult to discuss the Data step except in the context
of specic examples, but we mention a few relevant points.
Mistakes can occur in recording or entering data into a data base. For complex
investigations, it is useful to put checks in place to avoid these mistakes. For example,
if a eld is missed, the data base should prompt the data entry person to complete
the record if possible.
In many studies the units must be tracked and measured over a long period of time
(e.g. consider a study examining the ability of aspirin to reduce strokes in which
persons are followed for 3 to 5 years). This requires careful management.
When data are recorded over time or in dierent locations, the time and place for
each measurement should be recorded.
91
There may be departures from the study Plan that arise over time (e.g. persons may
drop out of a long term medical study because of adverse reactions to a treatment; it
may take longer than anticipated to collect the data so the number of units sampled
must be reduced). Departures from the Plan should be recorded since they may have
an important impact on the Analysis and Conclusion.
In some studies the amount of data may be extremely large, so data base design and
management is important.
Missing data and response bias
Suppose we wish to conduct a study to determine if ethnic residents of a city are satised
with police service in their neighbourhood. A questionnaire is prepared. A sample of 300
mailing addresses in a predominantly ethnic neighbourhood is chosen and a uniformed
police o cer is sent to each address to interview an adult resident. Is there a possible bias
in this study? It is likely that those who are strong supporters of the police are quite happy
to respond but those with misgivings about the police will either choose not to respond
at all or change some of their responses to favour the police. This type of bias is called
response bias. When those that do respond have a somewhat dierent characteristics than
the population at large, the quality of the data is threatened, especially when the response
rate (the proportion who do respond to the survey) is lower. For example in Canada in
2011, the long form of the Canadian Census (response rate around 98%) was replaced by
the National Household Survey (a voluntary version with similar questions, response rate
around 68%) and there was considerable discussion19 of the resulting response bias. See for
example the CBC story Census Mourned on World Statistics Day20 .
4.
Analysis
In Chapter 1 we discussed dierent methods of summarizing the data using numerical and
graphical summaries. A key step in formal analyses is the selection of an appropriate model
that can describe the data and how it was collected. In Chapter 2 we discussed methods
for checking the t of the model. We also need to describe the Problem in terms of the
model parameters and properties. You will see many more formal analyses in subsequent
chapters.
5.
Conclusions
The purpose of the Conclusion step is to answer the questions posed in the Problem. In
other words, the Conclusion is directed by the Problem. An attempt should be made
to quantify (or at least discuss) potential errors as described in the Plan step and any
limitations to the conclusions.
19
20
http://www.youtube.com/watch?v=0A7ojjsmSsY
http://www.cbc.ca/news/technology/story/2010/10/20/long-form-census-world-statistics-day.html
92
3.3
Case Study
Introduction
This case study is an example of more than one use of PPDAC which demonstrates some
real problems that arise with measurement systems. The documentation given here has
been rewritten from the original report to emphasize the underlying PPDAC framework.
Background
An automatic in-line gauge measures the diameter of a crankshaft journal on 100% of
the 500 parts produced per shift. The measurement system does not involve an operator
directly except for calibration and maintenance. Figure 3.1 shows the diameter in question.
The journal is a cylindricalpart of the crankshaft. The diameter of the journal must
be dened since the cross-section of the journal is not perfectly round and there may be
taper along the axis of the cylinder. The gauge measures the maximum diameter as the
crankshaft is rotated at a xed distance from the end of the cylinder.
93
Overall Project
A project is planned to reduce scrap/rework by reducing part-to-part variation in the
diameter. A rst step involves an investigation of the measurement system itself. There
is some speculation that the measurement system contributes substantially to the overall
process variation and that bias in the measurement system is resulting in the scrapping
and reworking of good parts. To decide if the measurement system is making a substantial
contribution to the overall process variability, we also need a measure of this attribute for
the current and future population of crankshafts. Since there are three dierent attributes
of interest, it is convenient to split the project into three separate applications of PPDAC.
Study 1
In this application of PPDAC, we estimate the properties of the errors produced by the
measurement system. In terms of the model, we will estimate the bias and variability due
to the measurement system. We hope that these estimates can be used to predict the future
performance of the system.
Problem
The target process is all future measurements made by the gauge on crankshafts to be
produced. The response variate is the measured diameter associated with each unit. The
attributes of interest are the average measurement error and the population standard deviation of these errors. We can quantify these concepts using a model (see below). A
detailed shbone diagram for the measurement system is also shown in Figure 3.2. In such
a diagram, we list explanatory variates organized by the major bones that might be responsible for variation in the response variate, here the measured journal diameter. We can
use the diagram in formulating the Plan.
Note that the measurement system includes the gauge itself, the way the part is loaded
into the gauge, who loads the part, the calibration procedure (every two hours, a master
part is put through the gauge and adjustments are made based on the measured diameter
of the master part; that is the gauge is zeroed), and so on.
Plan
To determine the properties of the measurement errors we must measure crankshafts with
known diameters. Known implies that the diameters were measured by an o-line measurement system that is very reliable. For any measurement system study in which bias is
an issue, there must be a reference measurement system which is known to have negligible
bias and variability which is much smaller than the system under study.
There are many issues in establishing a study process or a study population. For convenience, we want to conduct the study quickly using only a few parts. However, this
restriction may lead to study error if the bias and variability of the measurement system
94
J o u rn a l
Gauge
M e a s u re m e n ts
m a in te n a n c e
te m p e r a tu r e
a c tu a l s iz e
p o s itio n o f p a r t
c o n d itio n
w ear on head
d ir t
o u t- o f- r o u n d
M e a s u re d J o u rn a l D i a m e te r
tr a in in g
fr e q u e n c y
m a s te r u s e d
E n vi ro n m e n t
C a l i b ra ti o n
O p e ra t o r
change as other explanatory variates change over time or parts. We guard against this
latter possibility by using three crankshafts with known diameters as part of the denition
of the study process. Since the units are the taking of measurements, we dene the study
population as all measurements that can be taken in one day on the three selected crankshafts. These crankshafts were selected so that the known diameters were spread out over
the range of diameters Normally seen. This will allow us see if the attributes of the system
depend on the size of the diameter being measured. The known diameters which were used
were: 10, 0, and +10: Remember the diameters have been rescaled so that a diameter of
10 is okay.
No other explanatory variates were measured. To dene the sampling protocol, it
was proposed to measure the three crankshafts ten times each in a random order. Each
measurement involved the loading of the crankshaft into the gauge. Note that this was to
be done quickly to avoid delay of production of the crankshafts. The whole procedure took
only a few minutes.
The preparation for the data collection was very simple. One operator was instructed
to follow the sampling protocol and write down the measured diameters in the order that
they were collected.
Data
The repeated measurements on the three crankshafts are shown below. Note that due to
poor explanation of the sampling protocol, the operator measured each part ten times in
a row and did not use a random ordering. (Unfortunately non-adherence to the sampling
protocol often happens when real data are collected and it is important to consider the
eects of this in the Analysis and Conclusion steps.)
95
Crankshaft 1
10
8
12
12
8
10
11
10
12
10
Crankshaft 2
2 1
2 2
0 1
1 1
0 0
Crankshaft 3
9 11
8 12
10 9
12 10
10 12
Analysis
A model to describe the repeated measurement of the known diameters is
Yij =
+ Rij ;
Rij
G(0;
m)
independent
(3.1)
where i = 1 to 3 indexes the three crankshafts and j = 1; : : : ; 10 indexes the ten repeated
measurements. The parameter i represents the long term average measurement for crankshaft i. The random variables Rij (called the residuals) represent the variability of the
measurement system, while m quanties this variability. Note that we have assumed, for
simplicity, that the variability m is the same for all three crankshafts in the study.
We can rewrite the model in terms of the random variables Yij so that Yij G( i ; m ).
Now we can write the likelihood as in Example 2.3.2 and maximize it with respect to the
four parameters 1 , 2 , 3 , and m (the trick is to solve @`=@ i = 0, i = 1; 2; 3 rst). Not
surprisingly the maximum likelihood estimates for 1 , 2 , 3 are the sample averages for
each crankshaft so that
^ i = yi =
n
1 P
yij
10 j=1
for i = 1; 2; 3:
To examine the assumption that m is the same for all three crankshafts we can calculate
the sample standard deviation for each of the three crankshafts. Let
s
10
1 P
si =
(yij yi )2 for i = 1; 2; 3:
9 j=1
The data can be summarized as:
Crankshaft 1
Crankshaft 2
Crankshaft 3
yi
10:3
0:6
10:3
si
1:49
1:17
1:42
The estimate of the bias for crankshaft 1 is the dierence between the observed average
y1 and the known diameter value which is equal to 10 for crankshaft 1, that is, the
estimated bias is 10:3 ( 10) = 0:3. For crankshafts 2 and 3 the estimated biases are
0:6 0 = 0:6 and 10:3 10 = 0:3 respectively so the estimated biases in this study are all
small.
96
Note that the sample standard deviations s1 ; s2 ; s3 are all about the same size and
our assumption about a common value seems reasonable. (Note: it is possible to test this
assumption more formally.) An estimate of m is given by
r
s21 + s22 + s23
sm =
= 1:37
3
Note that this estimate is not the average of the three sample standard deviations but the
square root of the average of the three sample variances. (Why does this estimate make
sense? Is it the maximum likelihood estimate of m ? What if the number of measurements
for each crankshaft were not equal?)
Conclusion
The observed biases 0:3, 0:6, 0:3 appear to be small, especially when measured against
the estimate of m and there is no apparent dependence of bias on crankshaft diameter.
To interpret the variability, we can use the model (3.1). Recall that if Yij v G ( i ; m )
then
P ( i 2 m Yij
i + 2 m ) = 0:95
Therefore if we repeatedly measure the same journal diameter, then about 95% of the time
we would expect to see the observations vary by about 2 (1:37) = 2:74.
There are several limitations to these conclusions. Because we have carried out the
study on one day only and used only three crankshafts, the conclusion may not apply to
all future measurements (study error). The fact that the measurements were taken within
a few minutes on one day might be misleading if something special was happening at that
time (sampling error). Since the measurements were not taken in random order, another
source of sampling error is the possible drift of the gauge over time.
We could recommend that, if the study were to be repeated, more than three knownvalue crankshafts could be used, that the time frame for taking the measurements could be
extended and that more measurements be taken on each crankshaft. Of course, we would
also note that these recommendations would add to the cost and complexity of the study.
We would also insist that the operator be better informed about the Plan.
Study 2
The second study is designed to estimate the overall population standard deviation of the
diameters of current and future crankshafts (the target population). We need to estimate
this attribute to determine what variation is due to the process and what is due to the measurement system. A cause-and-eect or shbone diagram listing some possible explanatory
variates for the variability in journal diameter is given in Figure 3.3. Note that there are
many explanatory variates other than the measurement system. Variability in the response
variate is induced by changes in the explanatory variates, including those associated with
the measurement system.
97
M e a s u r e m e n ts
M ac hine
M e th o d
s p e e d o f r o ta tio n
m a in t e n a n c e
s e t- u p o f to o lin g
a n g le o f c u t
o p e r a to r
lin e s p e e d
c a lib r a tio n
c u ttin g to o l e d g e
J o u r n a l D i a m e te r
p o s itio n in g a u g e
hardnes s
d ir t o n p a r t
s e t- u p m e th o d
tr a in in g
quenchant
te m p e r a tu r e
o p e r a to r
c a s tin g c h e m is tr y
e n v ir o n m e n t
E n vi r o n m e n t
c a s tin g lo t
M a te r i a l
m a in te n a n c e
O p e r a to r
+ Ri ;
Ri
where Yi represents the distribution of the measurement of the ith diameter, represents
the study population mean diameter and the residual Ri represents the variability due to
sampling and the measurement system. We let quantify this variability. We have not
included a bias term in the model because we assume, based on our results from Study 1,
98
Figure 3.4: Histogram of 2000 measured values from the gauge memory
that the measurement system bias is small. As well we assume that the sampling protocol
does not contribute substantial bias.
The histogram of the 2000 measured diameters shows that there is considerable spread in
the measured diameters. About 4:2% of the parts require reworking and 1:8% are scrapped.
The shape of the histogram is approximately symmetrical and centred close to zero. The
sample mean is
P
1 2000
y=
yi = 0:82
2000 i=1
s=
which gives us an estimate of
P
1 2000
(yi
1999 i=1
y)2 = 5:17
Conclusion
The overall process variation is estimated by s. Since the sample contained 2000 parts
measured consecutively, many of the explanatory variates did not have time to change as
they would in the study populations Thus, there is a danger of sampling error producing
an estimate of the variation that is too small.
The variability due to the measurement system, estimated to be 1:37 in Study 1, is much
less than the overall variability which is estimated to be 5:17. One way to compare the two
standard deviations m and is to separate the total variability into the variability due
to the measurement system m and that due to all other sources. In other words, we are
interested in estimating the variability that would be present if there were no variability
99
in the measurement system ( m = 0). If we assume that the total variability arises from
two independent sources, the measurement system and all other sources, then we have
2 = 2 + 2 or
m
p
p
2
2
p =
m
where p quanties the variability due to all other uncontrollable variates (sampling variability). An estimate of p is given by
q
p
2
2
s
sm = (5:17)2 (1:37)2 = 4:99
Hence, eliminating all of the variability due to the measurement system would produce an
estimated variability of 4:99 which is a small reduction from 5:17. The measurement system
seems to be performing well and not contributing substantially to the overall variation.
100
Comments
Study 3 revealed that the measurement system had a serious long term problem. At rst,
it was suspected that the cause of the variability was the fact that the gauge was not
calibrated over the course of the study. Study 3 was repeated with a calibration before
each measurement. A pattern similar to that for Study 3 was seen. A detailed examination
of the gauge by a repairperson from the manufacturer revealed that one of the electronic
components was not working properly. This was repaired and Study 3 was repeated. This
study showed variation similar to the variation of the short term study (Study 1) so that
the overall project could continue. When Study 2 was repeated, the overall variation and
the number of scrap and reworked crankshafts was substantially reduced. The project was
considered complete and long term monitoring showed that the scrap rate was reduced to
about 0:7% which produced an annual savings of more than $100,000.
As well, three similar gauges that were used in the factory were put through the long
term test. All were working well.
Summary
An important part of any Plan is the choice and assessment of the measurement
system.
The measurement system may contribute substantial error that can result in poor
decisions (e.g. scrapping good parts, accepting bad parts).
We represent systematic measurement error by bias in the model. The bias can be
assessed only by measuring units with known values, taken from another reference
measurement system. The bias may be constant or depend on the size of the unit
being measured, the person making the measurements, and so on.
Variability can be assessed by repeatedly measuring the same unit. The variability
may depend on the unit being measured or any other explanatory variates.
Both bias and variability may be a function of time. This can be assessed by examining
these attributes over a su ciently long time span as in Study 3.
3.4
101
Chapter 3 Problems
1. Four weeks before a national election, a political party conducts a poll to assess what
proportion of eligible voters plan to vote and, of those, what proportion support the
party. This will determine how they run the rest of the campaign. They are able to
obtain a list of eligible voters and their telephone numbers in the 20 most populated
areas. They select 3000 names from the list and call them. Of these, 1104 eligible
voters agree to participate in the survey with the results summarized in the table
below.
Support Party
Plan to Vote YES
NO
YES
351
381
NO
107
265
Answer the questions below based on this information.
(a) Dene the Problem for this study. What type of Problem is this?
(b) What is the target population?
(c) What are the variates in this problem?
(d) What is the study population?
(e) What is the sample?
(f) What is a possible source of study error is?
(g) What is a possible source of sampling error?
(h) There are two attributes of interest in the target population. In each case,
describe the attribute and provide an estimate based on the given data.
1. U.S. to fund study of Ontario math curriculum, Globe & Mail, January 17,
2014, Caroline Alphonso - Education Reporter (article has been condensed)
The U.S. Department of Education has funded a $2.7-million (U.S.) project, led by
a team of Canadian researchers at Torontos Hospital for Sick Children. The study
will look at how elementary students at several Ontario schools fare in math using
the current provincial curriculum as compared to the JUMP math program, which
combines the conventional way of learning the subject with so-called discovery learning. Math teaching has come under scrutiny since OECD results that measured the
scholastic abilities of 15-year-olds in 65 countries showed an increasing percentage of
Canadian students failing the math test in nearly all provinces. Dr. Tracy Solomon
and her team are collecting and analyzing two years of data on students in primary
and junior grades from one school board, which she declined to name. The students
were in Grades 2 and 5 when the study began, and are now in Grades 3 and 6, which
means they will participate in Ontarios standardized testing program this year. The
research team randomly assigned some schools to teach math according to the Ontario
102
2. Suppose you wish to study the smoking habits of teenagers and young adults, in order
to understand what personal factors are related to whether, and how much, a person
smokes. Briey describe the main components of such a study, using the PPDAC
framework. Be specic about the target and study population, the sample, and the
variates you would collect.
3. Suppose you wanted to study the relationship between a persons restingpulse rate
(heart beats per minute) and the amount and type of exercise they get.
(a) List some factors (including exercise) that might aect resting pulse rate. You
may wish to draw a cause and eect (shbone) diagram to represent potential
causal factors.
(b) Describe briey how you might study the relationship between pulse rate and
exercise using (i) an observational study, and (ii) an experimental study.
4. A large company uses photocopiers leased from two suppliers A and B. The lease
rates are slightly lower for Bs machines but there is a perception among workers
that they break down and cause disruptions in work ow substantially more often.
103
Describe briey how you might design and carry out a study of this issue, with the
ultimate objective being a decision whether to continue the lease with company B.
What additional factors might aect this decision?
5. For a study like the one in Example 1.3.1, where heights x and weights y of individuals
are to be recorded, discuss sources of variability due to the measurement of x and y
on any individual.
104
4. ESTIMATION
4.1
106
4. ESTIMATION
given application we would need to design the data collection plan to ensure this
assumption is valid.
(3) Suppose in the model chosen the populaton mean is represented by the parameter .
The sample mean y is an estimate of , but not usually equal to it. How far away
from is y likely to be? If we take a sample of only n = 50 units, would we expect
the estimate y to be as good as y based on 150 units? (What does good mean?)
We focus on the third point in this chapter and assume that we can deal with the rst
two points with the methods discussed in Chapters 1 and 2.
4.2
Suppose that some attribute of interest for a population or process can be represented by
a parameter in a statistical model. We assume that can be estimated using a random
sample drawn from the population or process in question. Recall in Chapter 2 that a point
estimate of , denoted as ^, was dened as a function of the observed sample y1 ; : : : ; yn ,
^ = g(y1 ; : : : ; yn ):
For example
(4.1)
n
^ = y = 1 P yi
n i=1
is a point estimate of if y1 ; : : : ; yn is an observed random sample from a Poisson distribution with mean .
The method of maximum likelihood provides a general method for obtaining estimates,
but other methods exist. For example, if = E(Y ) = is the average (mean) value of y
in the population, then the sample mean ^ = y is an intuitively sensible estimate; it is the
maximum likelihood estimate of if Y has a G ( ; ) distribution but because of the Central
Limit Theorem it is a good estimate of more generally. Thus, while we will use maximum
likelihood estimation a great deal, you should remember that the discussion below applies
to estimates of any type.
The problem facing us in this chapter is how to determine or quantify the uncertainty
in an estimate. We do this using sampling distributions 21 , which are based on the following
idea. If we select random samples on repeated occasions, then the estimates ^ obtained from
the dierent samples will vary. For example, ve separate random samples of n = 50 persons
from the same male population described in Example 1.3.1 gave ve dierent estimates
^ = y of E(Y ) as:
1:723 1:743 1:734 1:752 1:736:
Estimates vary as we take repeated samples and therefore we associate a random variable
and a distribution with these estimates.
21
107
More precisely, we dene this idea as follows. Let the random variables Y1 ; : : : ; Yn
represent the observations in a random sample, and associate with the estimate ^ given by
(4.1) a random variable
~ = g(Y1 ; : : : ; Yn ):
The random variable ~ = g(Y1 ; : : : ; Yn ) is simply a rule that tells us how to process the
data to obtain a numerical value ^ = g(y1 ; : : : ; yn ) which is an estimate of the unknown
parameter for a given data set y1 ; : : : ; yn . For example
n
~ = Y = 1 P Yi
n i=1
is a random variable and ^ = y is a numerical value. We call ~ the estimator of corresponding to ^. (We will always use ^ to denote an estimate, that is, a numerical value, and
~ to denote the corresponding estimator, the random variable.)
Denition 22 A (point) estimator ~ is a random variable which is a function
~ = g(Y1 ; Y2 ; : : : ; Yn ) of the random variables Y1 ; Y2 ; : : : ; Yn . The distribution of ~ is called
the sampling distribution of the estimator.
n
1 P
Yi
n i=1
)=P
=P
n=
n=
(4.2)
p
G(0; 1). Clearly, as n increases, the probability (4.2)
where Z = (Y
)=( = n)
approaches one. Furthermore, if we know (even approximately) then we can nd the
probability for any given
and n. For example, suppose Y represents the height of a
male (in meters) in the population of Example 1.3.1, and that we take
= 0:01. That
108
4. ESTIMATION
P (j~
n = 100:
P (j~
0:01) = P ( 1:01
1:01) = 0:688
0:01) = P ( 1:43
1:43) = 0:847
This indicates that a larger sample is better in the sense that the probability is higher
that ~ will be within 0:01m of the true (and unknown) average height in the population.
It also allows us to express the uncertainty in an estimate ^ = y from an observed sample
y1 ; : : : ; yn by indicating the probability that any single random sample will give an estimate
within a certain distance of .
Example 4.2.2
In the Example 4.2.1 we were able to determine the distribution of the estimator exactly,
using properties of Gaussian random variables. Often we are not be able to do this and
in this case we could use simulation to study the distribution22 . For example, suppose we
have a random sample y1 ; : : : ; yn which we have assumed comes from an Exponential( )
distribution. The maximum likelihood estimate of is ^ = y. What is the sampling
distribution for ~ = Y ? We can examine the sampling distribution by using simulation.
This involves taking repeated samples, y1 ; : : : ; yn , giving (possibly dierent) values of y for
each sample as follows:
1. Generate a sample of size n. In R this is done using the statement y<-rexp(n; 1= ) :
(Note that in R the parameter is specied as 1= .)
2. Compute ^ = y from the sample. In R this is done using the statement ybar<-mean(y).
Repeat these two steps k times. The k values y1 ; : : : ; yk can then be considered as a
sample from the distribution of ~, and we can study the distribution by plotting a histogram
of the values.
The histogram in Figure 4.1 was obtained by drawing k = 10000 samples of size n = 15
from an Exponential(10) distribution, calculating the values y1 ; : : : ; y10000 and then plotting
the relative frequency histogram. What do you notice about the distribution particularly
with respect to symmetry? Does the distribution look like a Gaussian distribution?
The approach illustrated in the preceding example can be used more generally. The
main idea is that, for a given estimator ~, we need to determine its sampling distribution
in order to be able to compute probabilities of the form P (j~
j
) so that we can
quantify the uncertainty of the estimate.
22
This approach can also be used to study sampling from a nite population of N values, fy1 ; : : : ; yN g,
where we might not want to use a continuous probability distribution for Y .
109
0.16
0.14
0.12
Relative
Frequency
0.1
0.08
0.06
0.04
0.02
0
10
15
20
25
Figure 4.1: Relative frequency histogram of means from 10000 samples of size 15
from an Exponential(10) distribution
The estimates and estimators we have discussed so far are often referred to as point estimates and point estimators. This is because they consist of a single value or point. The
discussion of sampling distributions shows how to address the uncertainty in an estimate.
We also usually prefer to indicate explicitly the uncertainty in the estimate. This leads to
the concept of an interval estimate 23 , which takes the form
[L (y) ; U (y)]
where L (y) and U (y) are functions of the observed data y. Notice that this provides an interval with endpoints L and U both of which depend on the data. If we let L (Y) and U (Y)
represent the associated random variables then [L (Y) ; U (Y)] is a random interval. If we
were to draw many random samples from the same population and each time we constructed
the interval [L (y) ; U (y)] how often would the statement 2 [L (y) ; U (y)] be true? The
probability that the parameter falls in this random interval is P [L (Y)
U (Y)] and
hopefully this probability is large. This probability gives an indication how good the rule is
by which the interval estimate was obtained. For example if P [L (Y)
U (Y)] = 0:95
then this means that 95% of the time (that is, for 95% of the dierent samples we might
draw), the true value of the parameter falls in the interval [L (y) ; U (y)] constructed from
the data set y. This means we can be reasonably safe in assuming, on this occasion, and
for this data set, it does so. In general, uncertainty in an estimate is explicitly stated by
giving the interval estimate along with the probability P ( 2 [L (Y) ; U (Y)]).
23
110
4.3
4. ESTIMATION
The likelihood function can be used to obtain interval estimates for parameters in a very
straightforward way. We do this here for the case in which the probability model involves
only a single scalar parameter . Individual models often have constraints on the parameters. For example in the Gaussian distribution, the mean can be any real number 2 <
but the standard deviation must be positive, that is,
> 0: Similarly for the Binomial
model the probability of success must lie in the interval [0; 1]: These constraints are usually
identied by requiring that the parameter falls in some set , called the parameter space.
As mentioned in Chapter 2 we often rescale the likelihood function to have a maximum
value of one to obtain the relative likelihood function.
Denition 23 Suppose is scalar and that some observed data (say a random sample
y1 ; : : : ; yn ) have given a likelihood function L( ). The relative likelihood function R( ) is
dened as
L( )
R( ) =
for 2
L(^)
where ^ is the maximum likelihood estimate and
0
R( )
for all
2 :
is the set f : R( )
pg.
111
and the maximum likelihood estimate of is the sample proportion ^ = y=n. The relative
likelihood function is
y
(1
)n y
R( ) = y
for 0 < < 1:
^ (1 ^)n y
0
n=200
-1
log RL
-2
-3
n=1000
-4
-5
0.30
0.35
0.40
theta
0.45
0.50
1.0
0.8
n=200
RL
0.6
0.4
n=1000
0.2
0.0
0.30
0.35
0.40
theta
0.45
0.50
Figure 4.2: Relative likelihood function and log relative likelihood function for a
Binomial model
Figure 4.2 shows the relative likelihood functions R( ) for two polls:
Poll 1 : n = 200; y = 80
Poll 2 : n = 1000; y = 400:
In each case ^ = 0:40, but the relative likelihood function is more concentratedaround ^
for the larger poll (Poll 2). The 10% likelihood intervals also reect this:
Poll 1 : R( )
0:47
Poll 2 : R( )
0:43:
112
4. ESTIMATION
l(^)
for
Values of
Values of
Values of
Values of
inside a 10% likelihood interval are plausible in light of the observed data.
outside a 10% likelihood interval are implausible in light of the observed data.
outside a 1% likelihood interval are very implausible in light of the observed data.
Likelihood intervals have desirable properties. One is that they become narrower as the
sample size increases, thus indicating that larger samples contain more information about
. They are also easy to obtain by plotting R( ) or r( ) = log R( ). The idea of a likelihood
interval for a parameter can also be extended to the case of a vector of parameters . In
this case R( ) P gives likelihood regions for 24 .
The one apparent shortcoming of likelihood intervals so far is that we do not know how
probable it is that a given interval will contain the true parameter value. As a result we
also do not have a basis for the choice of p. Sometimes it is argued that values like p = 0:10
or p = 0:05 make sense because they rule out parameter values for which the probability of
the observed data is less than 1=10 or 1=20 of the probability when = ^. However, a more
satisfying approach is to apply the sampling distribution ideas in Section 4.2 to the interval
estimates. This leads to the concept of condence intervals, which we describe next.
24
4.4
113
Suppose we assume that the model chosen for the data y is correct and that the interval
estimate for the parameter is given by [L(y); U (y)]. To quantify the uncertainty in the
interval estimate we look at an important property of the corresponding interval estimator
[L(Y); U (Y)] called the coverage probability which is dened as follows.
Denition 26 The value
C( ) = P [L(Y)
U (Y)]
(4.3)
is called the coverage probability for the interval estimator [L(Y); U (Y)].
A few words are in order about the meaning of the probability statement in (4.3). The
parameter is an unknown xed constant associated with the population. It is not a
random variable and therefore does not have a distribution. The statement (4.3) can be
interpreted in the following way. Suppose we were about to draw a random sample of the
same size from the same population and the true value of the parameter was . Suppose
also that we knew that we would construct an interval of the form [L(y); U (y)] once we
had collected the data. Then the probability that will be contained in this new interval
is C( )25 .
How then does C( ) assist in the evaluation of interval estimates? In practice, we try
to nd intervals for which C( ) is fairly close to 1 (values 0:90, 0:95 and 0:99 are often
used) while keeping the interval fairly narrow. Such interval estimates are called condence
intervals.
When we use the observed data y; L(y) and U (y) are numerical values not random variables. We do
not know whether 2 [L(y); U (y)] or not. P [L(y)
U (y)] makes no more sense than P (1
3)
since L(y); ; U (y) are all numerical values: there is no random variable to which the probability statement
can refer.
26
See the video at www.watstat.com called What is a condence interval. See also the Java applet
http://www.math.uah.edu/stat/applets/MeanEstimateExperiment.html
114
4. ESTIMATION
To show that condence intervals exist, and that the condence coe cient sometimes
does not depend on the unknown parameter , we consider the following simple example.
Example 4.4.1 Gaussian distribution with known standard deviation
Suppose Y1 ; : : : ; Yn is a random sample from a G( ; 1) distribution. That is,
is unknown but sd (Yi ) = 1 is known. Consider the interval
where Y =
1
n
n
P
h
Y
1:96n
1=2
; Y + 1:96n
1=2
= E(Yi )
p
G( ; 1= n), then
i=1
P Y
=P
p
1:96= n
p
n Y
1:96
= P ( 1:96
p
Y + 1:96= n
1:96
1:96)
= 0:95
p
p
where Z
G(0; 1). Thus the interval [y 1:96= n; y + 1:96= n] is a 95% condence interval for the unknown mean . This is an example in which the condence coe cient
does not depend on the unknown parameter, an extremely desirable feature of an interval
estimator.
We repeat the very important interpretation of a 100p% condence interval (since so
many people get the interpretation incorrect!): If the procedure is used repeatedly then in
a fraction p of cases the constructed intervals will contain the true value of the unknown
parameter. If in Example 4.4.1 a particular sample of size n = 16 had observed mean
y = 10:4, then the observed 95% condence interval would be [y 1:96=4; y + 1:96=4], or
[9:91; 10:89]. We cannot say that the probability that 2 [9:91; 10:89] is 0:95, but we
have a high degree of condence (95%) that the interval [9:91; 10:89] contains .
Condence intervals become narrower as the size of the sample on which they are based
increases. For example, note the eect of n in Example 4.4.1. The width of the condence
p
interval is 2(1:96)= n which decreases as n increases. We noted this earlier for likelihood
intervals, and we will show in Section 4.6 that likelihood intervals are a type of condence
interval.
Recall that the coverage probability for the interval in the above example did not depend on the unknown parameter, a highly desirable property because wed like to know
the coverage probability while not knowing the value of the unknown parameter. We next
consider a general method for nding condence intervals which have this property.
115
Pivotal Quantities
Denition 28 A pivotal quantity Q = g(Y; ) is a function of the data Y and the unknown
parameter such that the distribution of the random variable Q is fully known. That is,
probability statements such as P (Q a) and P (Q b) depend on a and b but not on or
any other unknown information.
We now describe how a pivotal quantity can be used to construct a condence interval.
We begin with the statement P [a g(Y; ) b] = p where g(Y; ) is a pivotal quantity
whose distribution is completely known. Suppose that we can re-express the inequality
a g(Y; ) b in the form L(Y)
U (Y) for some functions L and U: Since
p = P [a
g(Y; )
b] = P [L(Y)
U (Y)]
= P ( 2 [L(Y); U (Y)]) ;
the interval [L (y) ; U (y)] is a 100p% condence interval for . The condence coe cient
for the interval [L (y) ; U (y)] is equal to p which does not depend on . The condence
coe cient does depend on a and b, but these are determined by the known distribution of
g(Y; ).
Example 4.4.2 Condence interval for the mean of a Gaussian distribution with
known standard deviation
Suppose Y = (Y1 ; : : : ; Yn ) is a random sample from the G( ; 0 ) distribution where
E (Yi ) = is unknown but sd (Yi ) = 0 is known. Since
Q = Q (Y; ) =
0=
G(0; 1)
0=
n
p
b 0= n
=P Y
so that
y
0=
n; y
b
Y
a
p
0=
0=
n ;
is a 95% condence interval for based on the observed data y = (y1 ; : : : ; yn ). Note that
there are innitely many pairs (a; b) giving P (a Q b) = 0:95. A common choice for the
Gaussian distribution is to pick points symmetric about zero, a = 1:96, b = 1:96. This
p
p
p
results in the interval [y 1:96 0 = n; y + 1:96 0 = n] or y 1:96 0 = n which turns out
to be the narrowest possible 95% condence interval.
116
4. ESTIMATION
p
p
The interval [y 1:96 0 = n; y + 1:96 0 = n] is often referred to as a two-sided condence interval. Note that this interval takes the form
point estimate
Many two-sided condence intervals we will encounter in this course will take a similar
form.
Another choice for a and b would be a = 1, b = 1:645, which gives the interval
p
p
[y 1:645 0 = n; 1). The interval [y 1:645 0 = n; 1) is usually referred to as a one-sided
condence interval. This type of interval is useful when we are interested in determining a
lower bound on the value of .
It turns out that for most distributions it is not possible to nd exact pivotal quantities
or condence intervals for whose coverage probabilities do not depend somewhat on the
true value of . However, in general we can nd quantities Qn = g(Y1 ; : : : ; Yn ; ) such that
as n ! 1, the distribution of Qn ceases to depend on or other unknown information.
We then say that Qn is asymptotically pivotal, and in practice we treat Qn as a pivotal
quantity for su ciently large values of n; more accurately, we call Qn an approximate pivotal
quantity.
Example 4.4.3 Approximate condence interval for Binomial model
Suppose Y
Binomial(n; ). From the Central Limit Theorem we know that for large
n, Q1 = (Y
n )=[n (1
)]1=2 has approximately a G(0; 1) distribution. It can also be
shown that the distribution of
Q = Q (Y ; ) =
Y
~
[n (1
n
~)]1=2
where ~ = Y =n, is also close to G(0; 1) for large n. Thus Q can be used as an approximate
pivotal quantity to construct condence intervals for . For example,
0:95 t P ( 1:96
=P
Q 1:96)
h
i1=2
1:96 ~(1 ~)=n
Thus
^
1:96
h
~ + 1:96 ~(1
^(1
^)
n
i1=2
~)=n
:
(4.5)
gives an approximate 95% condence interval for where ^ = y=n and y is the observed
data. As a numerical example, suppose we observed n = 100, y = 18. Then (4.5) gives
0:18 1:96 [0:18(0:82)=100]1=2 or [0:115; 0:255].
117
Remark: It is important to understand that condence intervals may vary a great deal
when we take repeated samples. For example, in Example 4.4.3, ten samples of size n = 100
which were simulated for a population with = 0:25 gave the following approximate 95%
condence intervals for :
[0:20; 0:38]
[0:14; 0:31]
[0:14; 0:31]
[0:10; 0:26]
[0:23; 0:42]
[0:21; 0:40]
[0:22; 0:41]
[0:15; 0:33]
[0:18; 0:36]
[0:19; 0:37]
For larger samples (larger n), the condence intervals are narrower and will have better
agreement. For example, try generating a few samples of size n = 1000 and compare the
condence intervals for .
Choosing a Sample Size
We have seen that condence intervals for a parameter tend to get narrower as the sample
size n increases. When designing a study we often decide how large a sample to collect
on the basis of (i) how narrow we would like condence intervals to be, and (ii) how much
we can aord to spend (it costs time and money to collect data). The following example
illustrates the procedure.
Example 4.4.5 Sample size and estimation of a Binomial probability
Suppose we want to estimate the probability from a Binomial experiment in which
Y v Binomial(n; ) distribution. We use the approximate pivotal quantity
Q=
Y n
[n~(1 ~]1=2
which was introduced in Example 4.4.3 and which has approximately a G(0; 1) distribution
to obtain condence intervals for . Here is a criterion that is widely used for choosing the
size of n: Choose n large enough so that the width of a 95% condence interval for is no
wider than 2 (0:03). Let us see where this leads and why this rule is used.
From Example 4.4.3, we know that
s
^(1 ^)
^ 1:96
n
is an approximate 0:95 condence interval for
s
2 (1:96)
^)
n
To make this condence interval narrower that 2 (0:03) (or even narrower, say 2 (0:025)),
we need n large enough so that
118
4. ESTIMATION
s
1:96
^(1
^)
n
or
n
1:96
0:03
0:03
^(1
^):
Of course we dont know what ^ is because we have not taken a sample, but we note that
the worst case scenario occurs when ^ = 0:5. So to be conservative, we nd n such that
n
1:96
0:03
(0:5)2 t 1067:1
Thus, choosing n = 1068 (or larger) will result in an approximate 95% condence interval
of the form ^ c, where c 0:03. If you look or listen carefully when polling results are
announced, youll often hear words like this poll is accurate to within 3 percentage points
19 times out of 20.What this really means is that the estimator ~ (which is usually given
in percentile form) approximately satises P (j~
j 0:03) = 0:95, or equivalently, that
^
the actual estimate is the centre of an approximate 95% condence interval ^ c, for
which c = 0:03. In practice, many polls are based on 1050 1100 people, giving accuracy
to within 3 percent with probability 0:95. Of course, one needs to be able to aord to
collect a sample of this size. If we were satised with an accuracy of 5 percent, then wed
only need n = 385 (show this). In many situations however this might not be su ciently
accurate for the purpose of the study.
Exercise: Show that to ensure that the width of the approximate 95% condence interval
is 2 (0:02) = 0:04 or smaller, you need n = 2401: What should n be to make a 99% condence interval less than 2 (0:02) = 0:04 or less?
Remark: Very large Binomial polls (n 2000) are not done very often. Although we can
in theory estimate very precisely with an extremely large poll, there are two problems:
1. It is di cult to pick a sample that is truly random, so Y
approximation.
2. In many settings the value of
one point in time.
Binomial(n; ) is only an
As a result, the real accuracy of a poll cannot generally be made arbitrarily high.
Sample sizes can be similarly determined so as to give condence intervals of some
desired length in other settings. We consider this topic again in Section 4.7 for the G ( ; )
distribution.
119
120
4.5
4. ESTIMATION
In this section we introduce two new distributions, the Chi-squared distribution and the
Student t distribution. These two distributions play an important role in constructing
condence intervals and the tests of hypotheses to be discussed in Chapter 5.
The
(Chi-squared) Distribution
Z1
dx for
> 0:
2k=2
1
x(k=2)
(k=2)
x=2
for x > 0
(4.6)
M (t) = (1
k=2
2t)
1
for t < :
2
(4.7)
Therefore
E(Y ) = M 0 (0) =
k
(1
2
and
E(Y 2 ) = M (2) (0) = k(
2t)
k=2 1
k
2
1)(1
( 2)jt=0 = k(1
2t)
k=2 2
2t)
k=2 1
jt=0 = k
( 2)jt=0 = k(k + 2)
so
V ar(Y ) = E(Y 2 )
k 2 = 2k:
The cumulative distribution function, F (x; k), can be given in closed algebraic form for
even values of k. In R the functions dchisq(x; k) and pchisq(x; k) give the probability density function f (x; k) and cumulative distribution function F (x; k) for the 2 (k) distribution.
A table with selected values is given at the end of these course notes.
Chisquared df =2
0.5
0.4
p.d.f.
p.d.f.
121
0.3
0.2
0.1
Chisquared df =4
10
Chisquared df =8
0.2
0.12
0.1
0.15
p.d.f.
p.d.f.
0.08
0.1
0.06
0.04
0.05
0.02
0
10
15
10
15
20
2 (k ):
i
Then
i=1
2t)
ki =2 .
Thus Ms (t) =
n
Q
i=1
Mi (t) = (1
n
P
2t)
n
P
ki =2
i=1
ki :
i=1
Theorem 30 If Z
2 (1).
p
( w)
w) =
p
p
( w) + (
w)
w 1=2
= p
e
2
which is the probability density function of a
2 (1)
w=2
1
w
2
1=2
for w > 0
122
4. ESTIMATION
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-5
0
x
Figure 4.4: Probability density functions for t (2) distribution (dashed red ) and
G (0; 1) distribution (solid blue)
Corollary 31 If Z1 ; : : : ; Zn are mutually independent G(0; 1) random variables and
n
P
2 (n).
S=
Zi2 ; then S
i=1
Proof. Since Zi
Theorem 29.
2 (1)
Students t Distribution
Students t distribution (or more simply the t distribution) has probability density function
f (x; k) = ck 1 +
x2
k
(k+1)=2
k+1
2
:
( k2 )
123
values of k, the graph of the probability density function f (x; k) is indistinguishable from
that of the G (0; 1) probability density function. The primary dierence, for small k such
as the one plotted, is in the tails of the distribution. The t probability density function has
fatter tailsor more area in the extreme left and right tails. Problem 22 at the end of this
chapter considers some properties of f (x; k).
Probabilities for the t distribution are available from tables at the end of these notes27 or
computer software. In R, the cumulative distribution function F (x; k) = P (T x; k) where
T t (k) is obtained using pt(x,k). For example, pt(1.5,10) gives P (T 1:5; 10) = 0:918.
The t distribution arises as a result of the following theorem involving the ratio of a
N (0; 1) random variable and an independent Chi-squared random variable. We will not
attempt to prove this theorem here.
Theorem 32 Suppose Z
G(0; 1) and U
2 (k)
independently. Let T =
4.6
qZ
U
k
. Then T
We will now show that likelihood intervals are also condence intervals. Recall the relative
likelihood R( ) = L( )=L(^) is a function of the maximum likelihood estimate ^. Replace
the estimate ^ by the random variable (the estimator) ~ and dene the random variable
( )
L( )
= 2l(~) 2l( )
( ) = 2 log
L(~)
where ~ is the maximum likelihood estimator. The random variable
likelihood ratio statistic. The following result can be proved:
( ) is called the
: 2l(^)
2l( )
c) t P ( ( )
c) = P 2l(~)
2l( )
c ;
o
c is an approximate 100p% condence interval.
124
4. ESTIMATION
we see that the approximate 100p% condence interval for based on the likelihood ratio
statistic is also a likelihood interval. (Recall that likelihood intervals must usually be found
numerically or by plotting R ( ) or r ( ).)
For example, for p = 0:95 we have
0:95 = P (W
so c = 3:841 and
Since
: 2l(^)
2l( )
2l( )
: R( )
= f : R( )
therefore a 15% likelihood interval for
.
o
3:841
o
3:841=2
0:147g
Exercise: Determine the likelihood intervals which correspond to approximate 90% and
99% condence intervals.
Conversely if we want to determine the approximate condence coe cient for a 100p%
likelihood interval we note that a 100p% likelihood interval dened by f ; R( ) pg can be
rewritten as
(
"
#
)
L( )
f ; R( ) pg =
: 2 log
2 log p
L(^)
n
o
=
: 2 log L(^) 2 log L( )
2 log p
n
o
=
: 2`(^) 2`( )
2 log p :
By Theorem 33 the condence coe cient for this interval can be approximated by
P( ( )
2 log p) t P (W
= P jZj
= 2P Z
125
2 log (0:1))
2 log (0:1)) where W v 2 (1)
p
2 log (0:1)
1 where Z v G (0; 1)
2:15)
1 = 0:96844
R()
0.5
0.4
0.3
0.2
0.1
0
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Figure 4.5: Relative likelihood function for Binomial with n = 100 and y = 40
126
4. ESTIMATION
4.7
Suppose that Y
G( ; ) models a response variate y in some population or process. A
random sample Y1 ; : : : ; Yn is selected, and we want to estimate the model parameters. We
have already seen in Section 2.2 that the maximum likelihood estimators of and 2 are
n
n
1 P
1 P
~=Y =
Yi and ~ 2 =
(Yi Y )2 :
n i=1
n i=1
2
which diers from ~ 2 only by the choice of denominator. Indeed if n is large there is very
little dierence between S 2 and ~ 2 . Note that the sample variance has the advantage that
it is an unbiased estimator, that is, E(S 2 ) = 2 since
E(S 2 ) =
=
=
=
1
n
1
1
1
1
n
2
E
E
n
P
Y )2
(Yi
i=1
n
P
)2
(Yi
n(Y
)2
i=1
and .
p
G(0; 1)
(4.8)
= n
would be a pivotal quantity that could be used to obtain condence intervals for . However,
is generally unknown. Fortunately it turns out that if we simply replace with either
the maximum likelihood estimator ~ or the sample variance S in Z, then we still have a
pivotal quantity. We will write the pivotal quantity in terms of S. The pivotal quantity is
Z=
T =
Y
p
S= n
(4.9)
Since S, unlike , is a random variable in (4.9) the distribution of T is no longer G(0; 1).
The random variable T actually has a t distribution which was introduced in Section 4.5.
Theorem 34 Suppose Y1 ; : : : ; Yn is a random sample from the G( ; ) distribution with
sample mean Y and sample variance S 2 . Then
T =
Y
p v t (n
S= n
1) :
(4.10)
127
p
= n
and
U=
(n
G(0; 1)
1)S 2
2
2 (n
We choose this function of S 2 since it can be shown that U
1). It can also be
28
shown that Z and U are independent random variables . By Theorem 32 with k = n 1,
we have
Y
p
Z
Y
= n
q = q
p
=
S= n
U
S2
t (n
1) :
In other words if we replace in the pivotal quantity (4.8) by its estimator S, the distribution of the resulting pivotal quantity has a t(n 1) distribution rather than a G(0; 1)
distribution. The degrees of freedom are inherited from the degrees of freedom of the Chisquared random variable U or from S 2 .
We now show how to use the t distribution to obtain a condence interval for when
is unknown. Since (4.10) has a t distribution with n 1 degrees of freedom which is
a completely known distribution, we can use this pivotal quantity to construct a 100p%
condence interval for . Since the t distribution is symmetric we determine the constant
a such that P ( a T a) = p using t tables or R. Since
p = P( a
= P
= P Y
a 100p% condence interval for
a)
Y
p
S= n
p
aS= n
a
p
Y + aS= n
is given by
y
p
p
as= n; y + as= n :
(4.11)
(Note that if we attempted to use (4.8) to build a condence interval we would have two
unknowns in the inequality since both and are unknown.) As usual the method used
to construct this interval implies that 100p% of the condence intervals constructed from
samples drawn from this population contain the true value of .
p
We note that this interval is of the form y as= n or
estimate
28
The proof of the remarkable result that, for a random sample from a Normal distribution, the sample
mean and the sample variance are independent random variables, is beyond the scope of this course.
128
4. ESTIMATION
is known
except that the standard deviation of the estimator is known in this case and the value of
a is taken from a G(0; 1) distribution rather than the t distribution.
Example 4.7.1 IQ test
Scores Y for an IQ test administered to ten year old children in a very large population
have close to a G( ; ) distribution. A random sample of 10 children in a particular large
inner city school obtained test scores as follows:
103; 115; 97; 101; 100; 108; 111; 91; 119; 101
10
P
yi = 1046 and
i=1
10
P
i=1
yi2 = 110072:
based on (4.11) is y
p
2:262s= 10; y
p
2:262s= 10 or
p i
2:262s= 10 :
For the given data y = 104:6 and s = 8:57, so the condence interval is 104:6
[98:47; 110:73].
6:13 or
Behaviour as n ! 1
As n increases, condence intervals behave in a largely predictable fashion. First the
estimated standard deviation gets closer29 to the true standard deviation . Second as
the degrees of freedom increase, the t distribution approaches the Gaussian so that the
quantiles of the t distribution approach that of the G(0; 1) distribution. For example, if
in Example 4.7.1 we knew that = 8:57 then we would use the 95% condence interval
p
p
y 1:96 (8:57) = n instead of y 2:262 (8:57) = n with n = 10. In general for large n, the
p
width of the condence interval gets narrower as n increases (but at the rate 1= n) so the
condence intervals shrink to include only the point y.
29
129
1)S 2
2
n
1 P
2
(Yi
Y )2
(4.12)
i=1
1 degrees of freedom.
While we will not prove this result, we should at least try to explain the puzzling number
n
P
of degrees of freedom n 1, which on the surface seems wrong since
(Yi Y )2 is the
i=1
sum of n squared Normal random variables. Does this contradict Corollary 31? It is in
fact true that each (Yi Y ) is a Normally distributed random variable, but not in general
standard Normally distributed and more importantly not independent! It is easy to see
n
P
that (Yi Y ); i = 1; 2; : : : ; n are not independent since
(Yi Y ) = 0 implies that the
i=1
Y =
nP1
(Yi
1 terms:
Y ):
i=1
Although there are n terms (Yi Y ); i = 1; 2; : : : ; n in the summand for S 2 there are really
only n 1 that are free (that is, linearly independent). This is an intuitive explanation for
the n 1 degrees of freedom both of the Chi-squared and of the t distribution. In both
130
4. ESTIMATION
cases, the degrees of freedom are inherited from S 2 and are related to the dimension of the
subspace inhabited by the terms in the sum for S 2 , that is, Yi Y ; i = 1; : : : ; n:
We will now show how we can use Theorem 35 to construct a 100p% condence interval
for the parameter 2 or . First note that (4.12) is a pivotal quantity since its distribution
is completely known. Using Chi-squared tables or R we can nd constants a and a such
that
P (a U b) = p
where U s
2 (n
1). Since
p = P (a
= P
= P
= P
U b)
(n 1)S 2
1)S 2
b
(n
r
2
(n
r
1) S 2
b
(n
1)S 2
a
(n
1) S 2
b
is
(n
(n
1)s2 (n 1)s2
;
b
a
(4.13)
is
1) s2
;
b
(n
#
1) s2
:
a
(4.14)
As usual the choice for a; b is not unique. For convenience, a and b are usually chosen such
that
1 p
(4.15)
P (U a) = P (U > b) =
2
where U s 2 (n 1). The intervals (4.13) and (4.14) are called equal-tailed condence
intervals. The choice (4.15) for a, b does not give the narrowest condence interval. The
narrowest interval must be found numerically. For large n the equal-tailed interval and the
narrowest interval are nearly the same.
Note that, unlike condence intervals for , the condence interval for 2 is not symmetric about s2 , the estimate of 2 . This happens of course because the 2 (n 1) distribution
is not a symmetric distribution.
In some applications we are interested in an upper bound on
(because small
is
good in some sense). In this case we take b = 1 and nd a such that P (a U ) = p or
q
(n 1)s2
P (U a) = 1 p so that a one-sided 100p% condence interval for is 0;
.
a
131
15
P
(yi
i=1
2 (14).
and the value 0:02 is not in the interval. Why are the intervals dierent? Both cover the
true value of the parameter for 95% of all samples so they have the same condence
coe cient. However the one-sided interval, since it allows smaller (as small as zero) values
on the left end of the interval, can achieve the same coverage with a smaller right end-point.
If our primary concern was for values of being too large, that is, for an upper bound for
the interval, then the one-sided interval is the one that should be used for this purpose.
132
4. ESTIMATION
Y vN
~=Y
Also
0;
Y Y
q
v t (n
S 1 + n1
1+
1
n
1)
is a pivotal quantity which can be used to obtain an interval of values for Y . Let a be the
value such that P ( a T a) = p which is obtained from t tables or by using R. Since
p = P( a
0
= P@ a
= P
therefore
Y
"
a)
Y Y
q
S 1 + n1
r
1
aS 1 +
n
as
aA
1
1 + ; y + as
n
Y + aS
r
1
1+
n
1
1+
n
(4.16)
is an interval of values for the future observation Y with condence coe cient p. The
interval (4.16) is called a 100p% prediction interval instead of a condence interval since
Y is not a parameter but a random variable. Note that the interval (4.16) is wider than a
100p% condence interval for mean . This makes sense since is an unknown constant
with no variability while Y is a random variable with its own variability V ar (Y ) = 2 .
Example 4.7.2 Revisited Optical glass lenses
Suppose in Example 4.7.2 a 95% prediction interval is required for a glass lense drawn at
random from the population of glass lenses. Now y = 25:009, s = 0:013 and for T s t (14)
we have P ( 2:1448
T
2:1448) = 0:95. Therefore a 95% prediction interval for this
new lense is given by
"
#
r
r
1
1
25:009 2:1448 (0:013) 1 + ; 25:009 + 2:1448 (0:013) 1 +
15
15
= [24:9802; 25:0378] :
4.8
Components of electronic products often must be very reliable, that is, they must perform
over long periods of time without failing. Consequently, manufacturers who supply components to a company that produces, e.g. personal computers, must satisfy the company
that their components are reliable.
Demonstrating that a component is highly reliable is di cult because if the component
is used under normalconditions it will usually take a very long time to fail. It is generally
not feasible for a manufacturer to carry out tests on components that last for years (or even
months, in most cases) and therefore they use what are called accelerated life tests. These
involve placing high levels of stress on the components so that they fail in much less than
the normal time. If a model relating the level of stress to the lifetime of the component is
known then such experiments can be used to estimate lifetime at normal stress levels for
the population from which the experimental units are taken.
We consider below some life test experiments on power supplies for personal computers, with ambient temperature being the stress factor. As the temperature increases, the
lifetimes of components tend to decrease and at a temperature of around 70 Celsius the
average lifetimes tend to be of the order of 100 hours. The normal usage temperature is
around 20 C. The data in Table 4.2 show the lifetimes (i.e. times to failure) yi of components tests at each of 40 , 50 , 60 and 70 C. The experiment was terminated after 600
hours and for temperatures 40 , 50 and 60 some of the 25 components being tested had
still not failed. Such observations are called censored observations: we only know in each
case that the lifetime in question was over 600 hours. In Table 4.2 the asterisks denote the
censored observations. Note the data have been organized so that the lifetimes are listed
rst followed by the censored times.
It is known from past experience that, at each temperature level, lifetimes are approximately Exponentially distributed; let us therefore suppose that at temperature t;
(t = 40; 50; 60; 70), component lifetimes Y have an Exponential distribution with probability density function
1
f (y; t ) = e y=( t ) for y 0
t
where E(Y ) =
30
May be omitted
134
4. ESTIMATION
Table 4.2: Lifetimes (in hours) from an accelerated life test experiment in PC
power supplies Temperature
70 C
2
5
9
10
10
11
64
66
69
70
71
73
75
77
97
103
115
130
131
134
145
181
242
263
283
60 C
1
20
40
47
56
58
63
88
92
103
108
125
155
177
209
224
295
298
352
392
441
489
600
600
600
50 C
55
139
206
263
347
402
410
563
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
40 C
78
211
297
556
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
600
Notes: Lifetimes are given in ascending order; asterisks( ) denote censored observations.
yi =
For the censored observations we only know that the lifetime is greater than 600. Since
P (Y ; ) = P (Y > 600; ) =
Z1
y=
dy = e
600=
600
600=
22 1
Q
i=1
25
Q
yi =
s=
i=23
25
P
i=1
Question 2 Assuming that the Exponential model is correct, the likelihood function for
t ; t = 40; 50; 60; 70 can be obtained using the method above and is given by
L( t ) =
kt
st =
= exp
(4.17)
t + 273:2
where t is the temperature in degrees Celsius and and are parameters. Plot the points
log ^t ; (t + 273:2) 1 for t = 40; 50; 60; 70. If the model is correct why should these
points lie roughly along a straight line? Do they?
Using the graph give rough point estimates of and . Extrapolate the line or use your
estimates of and to estimate 20 , the mean lifetime at t = 20 C which is the normal
operating temperature.
Question 5 Question 4 indicates how to obtain a rough point estimate of
20
= exp
20 + 273:2
136
4. ESTIMATION
70
Q
t=40
kt
st =
where t is given by (4.17). (Note that the product is only over t = 40; 50; 60; 70.) Outline
how you might attempt to get an interval estimate for 20 based on the likelihood function
for and . If you obtained an interval estimate for 20 , would you have any concerns
about indicating to the engineers what mean lifetime could be expected at 20 C? (Explain.)
Question 6 Engineers and statisticians have to design reliability tests like the one just
discussed, and considerations such as the following are often used:
Suppose that the mean lifetime at 20 C is supposed to be about 90,000 hours and that
at 70 C you know from past experience that it is about 100 hours. If the model (4.17) holds,
determine what and should be approximately and thus what is roughly equal to at
40 , 50 and 60 C. How might you use this information in deciding how long a period of time
to run the life test? In particular, give the approximate expected number of uncensored
lifetimes from an experiment that was terminated after 600 hours.
4.9
137
Chapter 4 Problems
thetahat<-5
n<-25
theta<-seq(3.7,6.5,0.001)
Rtheta<-exp(n*thetahat*log(theta/thetahat)+n*(thetahat-theta))
plot(theta,Rtheta,type="l")
R10<-0.10+0*theta
points(theta,R10,type="l") # draws a horizontal line at 0.10
title(main="Poisson Likelihood for ybar=5 and n=25")
Modify this code for larger sample sizes n = 100; n = 400 and observe what happens
to the width of the 10% likelihood interval.
138
4. ESTIMATION
0:03
P~
0:03, if
(b) Suppose that p = 0:40. Using an approximation determine how large n should
be in order to ensure
P
0:03
P~
0:03 = 0:95:
139
(a) Give an expression for the probability that x out of n samples will be negative, if
the nk people are a random sample from the population. State any assumptions
you make.
(b) Obtain a general expression for the maximum likelihood estimate ^ in terms of
n, k and x.
(c) Suppose n = 100, k = 10 and x = 89. Give the maximum likelihood estimate ^,
the relative likelihood function, and nd a 10% likelihood interval for .
(d) Discuss (or do it) how you would select an optimalvalue of k to use for pooled
testing, if your objective was not to estimate but to identify persons who are
infected, with the smallest number of tests. Assume that you know the value
of and the procedure would be to test all k persons individually each time a
pooled sample was positive. (Hint: Suppose a large number n of persons must
be tested, and nd the expected number of tests needed.)
7. Recall Problem 5 of Chapter 2.
(a) Plot the relative likelihood function R( ) and determine a 10% likelihood interval. The likelihood interval can be found from the graph of R( ) or by using the
function uniroot in R. Is very accurately determined?
(b) Suppose that we can nd out whether each pair of twins is identical or not, and
that it is determined that of 50 pairs, 17 were identical. Obtain the likelihood
function, the maximum likelihood estimate and a 10% likelihood interval for
in this case. Plot the relative likelihood function on the same graph as the one
in (a), and compare the accuracy of estimation in the two cases.
8. The lifetime T (in days) of a particular type of light bulb is assumed to have a
distribution with probability density function
f (t; ) =
1
2
3 2
t e
> 0:
(a) Suppose t1 ; t2 ; : : : ; tn is a random sample from this distribution. Show that the
likelihood function for is equal to
c
3n
exp
n
P
ti
for
>0
i=1
(b) Find the maximum likelihood estimate ^ and the relative likelihood function
R( ).
20
P
(c) If n = 20 and
ti = 996, graph R( ) and determine the 15% likelihood interval
i=1
for
140
4. ESTIMATION
interval can be obtained from the graph of R( ) or by using the function uniroot
in R.)
(d) Suppose we wish to estimate the mean lifetime of a light bulb. Show E(T ) = 3= .
R1
(Recall that xn 1 e x dx = (n) = (n 1)! for n = 1; 2; : : :). Find a 95%
0
(e) The probability p that a light bulb lasts less than 50 days is
p = p ( ) = P (T
50; ) = 1
50
[1250
+ 50 + 1]:
(Can you show this?) Thus p^ = p(^) = 0:580. Find an approximate 95%
condence interval for p from the approximate 95% condence interval for . In
the data referred to in part (c), the number of light bulbs which lasted less than
50 days was 11 (out of 20). Using a Binomial model, we can also obtain a 95%
condence interval for p. Find this interval. What are the pros and cons of the
second interval over the rst one?
9. The
(Chi-squared) distribution:
2 (10)
2 (4)
nd P (X
tables:
nd P (X > 15).
(iii) If X v 2 (40) nd P (X
24:4) and P (X
55:8). Compare these values
with P (X 24:4) and P (X 55:8) if X v N (40; 80).
(iv) If X v
0:025.
(v) If X v
2 (25)
2 (12)
tables:
(i) If X v
2 (1)
nd P (X
(ii) If X v
2 (2)
nd P (X
2k=2
1
y (k=2)
(k=2)
y=2
2 (k)
with prob-
for y > 0:
(a) Show that this probability density function integrates to one for any k 2 f1; 2; : : : :g.
2t)
k=2
for t <
1
2
141
(c) Plot the probability density function for k = 5, k = 10 and k = 25 on the same
graph. What do you notice?
11. In an early study concerning survival time for patients diagnosed with Acquired Immune Deciency Syndrome (AIDS), the survival times (i.e. times between diagnosis
30
P
yi = 11; 400 days. It is
of AIDS and death) of 30 male patients were such that
i=1
known that survival times were approximately Exponentially distributed with mean
days.
(a) Write down the likelihood function for and obtain the likelihood ratio statistic.
Use this to obtain an approximate 90% condence interval for . (Note: You will
need to determine this interval from a graph of the relative likelihood function
or by using the function uniroot in R.)
(b) Show that m = ln 2 is the median survival time. Using the interval obtained
in (a), give an approximate 90% condence interval for m.
12. Suppose Y v Exponential( ).
(a) Show that W = 2Y = has a 2 (2) distribution. (Hint: compare the probability
density function of W with (4.6).
(b) If Y1 ; : : : ; Yn is a random sample from the Exponential( ) distribution above,
prove that
n
P
2
U =2
Yi =
(2n) :
i=1
(Use the results in Section 4.5.) U is therefore a pivotal quantity, and can be
used to get condence intervals for .
79:08) = 0:90
142
4. ESTIMATION
14. Consider the data on weights of adult males and females from Chapter 1. (The data
are posted on the course webpage.)
(a) Determine whether is is reasonable to assume a Normal model for the female
heights and a dierent Normal model for the male heights.
(b) Obtain a 95% condence interval for the mean for the females and males separately. Does there appear to be a dierence in the means for females and males?
(We will see how to test this formally in Chapter 6.)
(c) Obtain a 95% condence interval for the standard deviation for the females and
males separately. Does there appear to be a dierence in the standard deviations?
15. Company A leased photocopiers to the federal government, but at the end of their
recent contract the government declined to renew the arrangement and decided to
lease from a new vendor, Company B. One of the main reasons for this decision was
a perception that the reliability of Company As machines was poor.
(a) Over the preceding year the monthly numbers of failures requiring a service call
from Company A were
16
22
14
28
25
19
19
15
23
18
12
29
Assuming that the number of service calls needed in a one month period has
a Poisson distribution with mean , obtain and graph the relative likelihood
function R( ) based on the data above.
(b) In the rst year using Company Bs photocopiers, the monthly numbers of service
calls were
13
7 12
9 15 17
10 13
8 10 12 14
Under the same assumption as in part (a), obtain R( ) for these data and graph
it on the same graph as used in (a). Do you think the governments decision was
a good one, as far as the reliability of the machines is concerned?
(c) Use the likelihood ratio statistic ( ) as an approximate pivotal quantity to
obtain an approximate 95% condence intervals for for each company. (Note:
the interval can be obtained from the graph of the relative likelihood function
or by using the function uniroot in R.)
(d) What conditions would need to be satised to make the assumptions and analysis
in (a) to (c) valid? What approximations are involved?
143
16. At the R.A.T. laboratory a large number of genetically engineered rats are raised for
conducting research. Twelve rats are selected at random and fed a special diet. The
weight gains (in grams) from birth to age 3 months of the rats fed this diet are:
55:3
54:8
65:9
60:7
59:4
62:0
62:1
58:7
64:5
62:3
67:6
61:1
yi = 734:4 and
(yi
y)2 = 162:12:
i=1
i=1
is assumed where
12
P
= G( ; );
i = 1; : : : ; 12
and
represent.
and . (You do not need to derive
(c) Let
S2 =
12
1 P
Yi
11 i=1
Y
p
S= 12
and W =
12
1 P
2
Yi
i=1
a) = 0:95
and show clearly how this can be used to construct a 95% condence interval
for : Construct a 95% condence interval for for the given data.
(e) Find b and c such that
P (W
b) = 0:05 = P (W
c) :
Show clearly how this can be used to construct a 95% condence interval for
2 . Construct a 90% condence interval for 2 for the given data.
(f) Check the t of the model using a qqplot.
144
4. ESTIMATION
17. Sixteen packages are randomly selected from the production of a detergent packaging
machine. Their weights (in grams) are as follows:
287
300
293
302
295
302
295
303
297
306
298
307
299
308
300
311
(a) Assuming that the weights are independent G( ; ) random variables, obtain
95% condence intervals for and .
(b) Let Y represent the weight
q of a future, independent, randomly selected package.
Since Y
G 0;
1+
1
n
and (n
Y Y
q
S 1 + n1
1) S 2 =
t (n
1) :
Use this pivotal and the given data to obtain a 95% prediction interval for Y .
18. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. University researchers placed 12 detectors in a chamber where they were exposed to 105
picocuries per liter of radon over 3 days. The readings given by the detectors were:
91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
Let yi = reading for the i0 th detector, i = 1; : : : ; 12. For these data
12
P
yi = 1249:6 and
i=1
where
and
(yi
y)2 = 971:43:
i=1
12
P
= G( ; );
i = 1; : : : ; 12 independently
145
20. A manufacturing process produces bers of varying lengths. The length of a ber Y
is a continuous random variable with p.d.f.
f (y; ) =
where
y=
2e
; y
0;
>0
is an unknown parameter.
(a) Let y1 ; y2 ; : : : ; yn be the lengths of n bers selected at random. Find the maximum likelihood estimate of based on these data. Be sure to show all your
work.
(b) Suppose Y1 ; Y2 ; : : : ; Yn are independent and identically distributed random variables with p.d.f. f (y; ) given above. If E (Yi ) = 2 and V ar (Yi ) = 2 2 then
nd E Y and V ar Y :
(c) Justify the statement
P
Y 2
p
2=n
1:96
1:96
t 0:95:
(d) Explain how you would use the statement in (c) to construct an approximate
95% condence interval for .
(e) Suppose n = 18 bers were selected at random and the lengths were:
6:19
1:41
For these data
7:92
10:76
18
P
1:23
3:69
8:13
1:34
4:29
6:80
1:04
4:21
3:67
3:44
9:87
2:51
10:34
2:08
and
i=1
2
1
and w2 = m=
2.
2
146
4. ESTIMATION
(b) Suppose that 1 = 1, 2 = 0:5 and n = m = 10. How would you rationalize
to a non-statistician why you were using the estimate (x + 4y) =5 instead of
(x + y) =2?
(c) Determine the standard deviation of ~ and of (X + Y )=2 under the conditions
of part (b). Why is ~ a better estimator?
22. Students t distribution: Suppose that Z and U are independent variates with
Z
N (0; 1) and U
Z
U=k
(k) :
Its distribution is called the t (Students) distribution with k degrees of freedom, and
we write X
t (k). It can be shown by change of variables that X has probability
density function
f (x; k) = p
k+1
2
x2
1+
k
k
2
k+1
2
The probability density function is symmetric about the origin, and is similar in
shape to the probability density function of N (0; 1) random variable but has more
probability in the tails.
(a) Plot the probability density function for k = 1 and k = 5:
(c) Show that f (x; k) is unimodal for all x.
(d) Show that lim f (x; k) !
k!1
p1
2
exp
1 2
2x ,
n!1
cj
g=0
(a) If fXn g and fYn g are two sequences of random variables with Xn ! c1 and
p
p
p
Yn ! c2 , show that Xn + Yn ! c1 + c2 and Xn Yn ! c1 c2 .
147
(b) Let X1 ; X2 ;
be independent and identically distributed random variables with
probability density function f (x; ). A point estimator ~n based on a random
p
sample X1 ; : : : ; Xn is said to be consistent for if ~n ! as n ! 1.
(i) Let X1 ; : : : ; Xn be independent and identically distributed Uniform(0; )
random variables. Show that ~n = max (X1 ; : : : ; Xn ) is consistent for .
(ii) Let X Binomial(n; ). Show that ~n = X=n is consistent for .
25. Challenge Problem: Refer to the denition of consistency in Problem 24(b). Difculties can arise when the number of parameters increases with the amount of data.
Suppose that two independent measurements of blood sugar are taken on each of n
individuals and consider the model
Xi1 ; Xi2
N ( i;
) for i = 1;
;n
2
where Xi1 and Xi2 are the independent measurements. The variance
estimated, but the i s are also unknown.
is to be
(a) Find the maximum likelihood estimator ~ 2 and show that it is not consistent.
2
(c) What does represent physically if the measurements are taken very close together in time?
26. Challenge Problem: Proof of Central Limit Theorem (Special Case) Suppose
Y1 ; Y2 ; : : : are independent random variables with E(Yi ) = ; V ar(Yi ) = 2 and that
they have the same distribution, whose moment generating function exists.
2
n (Y
P
i
i=1
)
n
n(Y
h
and note that its moment generating function is of the form 1 +
2
t2
2n
in
+ o(n) .
148
4. ESTIMATION
5. TESTS OF HYPOTHESES
5.1
Introduction
33 What
For an introduction to testing hypotheses, see the video called "A Test of Signicance" at
www.watstat.ca
149
150
5. TESTS OF HYPOTHESES
hypothesis, which may not always be specied. In many cases the alternative hypothesis is
simply that H0 is not true.
We will outline the logic of tests of hypotheses in the rst example, the claim that I have
ESP. In an eort to prove or disprove this claim, an unbiased observer tosses a fair coin
100 times and before each toss I guess the outcome of the toss. We count Y , the number
of correct guesses which we can assume has a Binomial distribution with n = 100. The
probability that I guess the outcome correctly on a given toss is an unknown parameter .
If I have no unusual ESP capacity at all, then we would assume = 0:5, whereas if I have
some form of ESP, either a positive attraction or an aversion to the correct answer, then
we expect 6= 0:5. We begin by asking the following questions in this context:
(1) Which of the two possibilities,
hypothesis?
= 0:5 or
(2) What observed values of Y are highly inconsistent with H0 and what observed values
of Y are compatible with H0 ?
(3) What observed values of Y would lead to us to conclude that the data provide no
evidence against H0 and what observed values of Y would lead us to conclude that
the data provide strong evidence against H0 ?
In answer to question (1), hopefully you observed that these two hypotheses ESP and
NO ESP are not equally credible and decided that the null hypothesis should be H0 : = 0:5
or H0 : I do not have ESP.
To answer question (2), we note that observed values of Y that are very small (e.g.
0 10) or very large (e.g. 90 100) would clearly lead us to to believe that H0 is false,
whereas values near 50 are perfectly consistent with H0 . This leads naturally to the concept
of a test statistic (also called a discrepancy measure) which is some function of the data
D = g(Y) that is constructed to measure the degree of agreement between the data
Y and the hypothesis H0 . It is conventional to dene D so that D = 0 represents the
best possible agreement between the data and H0 , and so that the larger D is, the poorer
the agreement. Methods of constructing test statistics will be described later, but in this
example, it seems natural to use D(Y ) = jY 50j.
Question (3) could be resolved easily if we could specify a threshold value for D, or
equivalently some function of D. In the given example, the observed value of Y was y = 52
and so the observed value of D is d = j52 50j = 2. One might ask what is the probability,
when H0 is true, that the discrepancy measure results in a value less than d. Equivalently,
what is the probability, assuming H0 is true, that the discrepancy measure is greater than
or equal to d? In other words we want to determine P (D
d; H0 ) where the notation
; H0 means assuming that H0 is true. We can compute this easily in the our given
5.1. INTRODUCTION
151
d; H0 ) = P (jY
50j
= P (jY
50j
=1
P (49
=1
j52
50j ; H0 )
2) where Y
Binomial(100; 0:5)
51)
100
(0:5)100
49
100
(0:5)100
50
100
(0:5)100
51
t 0:76:
How can we interpret this value in terms of the test of H0 ? Roughly 76% of claimants
similarly tested for ESP, who have no abilities at all but simply randomly guess, will
perform as well or better (that is, result in at least as large a value of D as the observed
value of 2) than I did. This does not prove I do not have ESP but it does indicate we
have failed to nd any evidence in these data to support rejecting H0 . There is no evidence
against H0 in the observed value d = 2, and this was indicated by the high probability
that, when H0 is true, we obtain at least this much measured disagreement with H0 . This
probability, 0:76 in this example, is called the p value or observed signicance level of the
test.
We now proceed to a more formal treatment of hypothesis tests. Two types of hypotheses that a statistician might use are:
(1) the hypothesis H0 : = 0 where it is assumed that the data Y have arisen from a
family of distributions with probability (density) function f (y; ) with parameter
(2) the hypothesis H0 : Y
f0 (y) where it is assumed that the data Y have a specied
probability (density) function f0 (y).
The ESP example is an example of the rst type. An example of the second type is we
assume a given data set is a random sample from a Exponential(1) distribution. Hypotheses
of the second type are not appropriate unless we have very good reasons, practical or
theoretical, to support them.
A statistical test of hypothesis proceeds as follows. First, assume that the hypothesis
H0 will be tested using some random data Y. We then adopt a test statistic or discrepancy
measure D(Y) for which, normally, large values of D are less consistent with H0 . Let
d = D (y) be the corresponding observed value of D. We then calculate
p
value = P (D
d; H0 ):
If the p value is close to zero then we are inclined to doubt that H0 is true, because if it
is true the probability of getting agreement as poor or worse than that observed
is small34 . This makes the alternative explanation, H0 is false, more appealing. In other
words, if the p value is small then there are two possible explanations:
34
by
some
graduate
assistant
bears
see
152
5. TESTS OF HYPOTHESES
(a) H0 is true but by chance we have observed data Y that indicate poor agreement with
H0 , or
(b) H0 is false.
The p value indicates how small the chance is in (a) above. If it is large, there is no
evidence for (b). If it is less than about 0:05, we usually interpret this as providing evidence
against H0 in light of the observed data. If it is very small, for example 0:001, this is taken
as very strong evidence against H0 in light of the observed data.
The following table gives a rough guideline for interpreting p values. These are only
guidelines for this course. The interpretation of p values must always be made in the
context of a given study.
values
Interpretation
No evidence against H0 based on the observed data.
Weak evidence against H0 based on the observed data.
Evidence against H0 based on the observed data.
Strong evidence against H0 based on the observed data.
Very strong evidence against H0 based on the observed data.
value = P (jY
100j
10) where Y
Binomial(200; 0:5)
which can be calculated using R or using the Normal approximation to the Binomial since
n = 200 is large. Using the Normal approximation (without a continuity correction since
it is not essential to have an exact value) we obtain
p
value = P (jY
= P
10) where Y
jY 100j
p
200 (0:5) (0:5)
t P (jZj
= 2 [1
100j
Binomial(200; 0:5)
!
10
p
200 (0:5) (0:5)
P (Z
1:41)] = 2 (1
0:92073) = 0:15854
5.1. INTRODUCTION
153
value = P (D
14; H0 )
= P (Y
=
44) where Y
180
X
180
y
y=44
1
6
Binomial(180; 1=6)
5
6
180 y
= 0:005
which provides strong evidence against H0 , and suggests that is bigger than 1=6. This is
an example of a one-sided test which is described in more detail below.
Example 5.1.2 Revisited
Suppose that in the experiment in Example 5.1.2 we observed y = 35 ones in n = 180
tosses. The p value (calculated using R) is now
p
value = P (Y
=
180
X
y=35
35; = 1=6)
180
y
1
6
5
6
180 y
= 0:18
and this probability is not especially small. Indeed almost one die in ve, though fair, would
show this level of discrepancy with H0 . We conclude that there is no evidence against H0
in light of the observed data.
Note that we do not claim that H0 is true, only that there is no evidence in light of the
data that it is not true. Similarly in the legal example, if we do not nd evidence against
H0 : defendant is innocent, this does not mean we have proven he or she is innocent, only
that, for the given data, the amount of evidence against H0 was insu cient to conclude
otherwise.
The approach to testing a hypothesis described above is very general and straightforward, but a few points should be stressed:
1. If the p value is very small then as indicated in the table there is strong evidence
against H0 in light of the observed data; this is often termed statistically
signicant evidence against H0 . While we believe that statistical evidence is best
154
5. TESTS OF HYPOTHESES
measured when we interpret p values as in the above table, it is common in some
of the literature to adopt a threshold value for the p value such as 0:05 and reject
H0 whenever the p-value is below this threshold. This may be necessary
when there are only two options for your decision. For example in a trial, a person is
either convicted or acquitted of a crime.
2. If the p value is not small, we do not conclude that H0 is true. We simply say there
is no evidence against H0 in light of the observed data. The reason for this
hedging is that in most settings a hypothesis may never be strictly true. (For
example, one might argue when testing H0 : = 1=6 in Example 5.1.2 that no real
die ever has a probability of exactly 1=6 for side 1.) Hypotheses can be disproved
(with a small degree of possible error) but not proved. Again, if we are limited to
two possible decisions, if you fail to reject H0 in the language above, you may say
that H0 is accepted when the p value is larger than the predetermined
threshold. This does not mean that we have determined that H0 is true, but that
there is insu cient evidence on hand to reject it35 .
3. Just because there is strong evidence (highly statistically signicant evidence)
against a hypothesis H0 , there is no implication about how wrong H0 is. In practice, we supplement a hypothesis test with an interval estimate that indicates the
magnitude of the departure from H0 . This is how we check whether a result is scientically signicant as well as statistically signicant.
4. So far we have not rened the conclusion when we do nd strong evidence against the
null hypothesis. Often we have in mind an alternative hypothesis. For example if
the standard treatment for pain provides relief in about 50% of cases, and we test, for
patients medicated with an alternative H0 : P (relief) = 0:5 we will obviously wish to
know, if we nd strong evidence against H0 , in what direction that evidence lies. If
the probability of relief is greater than 0:5 we might consider further tests or adopting
the drug, but if it is less, then the drug will be abandoned for this purpose. We will
try and adapt to this type of problem with our choice of discrepancy measure D.
5. It is important to keep in mind that although we might be able to nd evidence
against a given hypothesis, this does not mean that the dierences found are of
practical signicance. For example a patient person willing to toss a particular coin
one million times can almost certainly nd evidence against H0 : P (heads) = 0:5.
This does not mean that in a game involving a few dozens or hundreds of tosses that
H0 is not a tenable and useful approximation. Similarly, if we collect large amounts
of nancial data, it is quite easy to nd evidence against the hypothesis that stock or
stock index returns are Normally distributed. Nevertheless for small amounts of data
35
If the untimely demise of all of the prosecution witnesses at your trial leads to your acquittal, does this
prove your innocence?
155
and for the pricing of options, such an assumption is usually made and considered
useful.
A drawback with the approach to testing described so far is that we do not have a
general method for choosing the test statistic or discrepancy measure D. Often there are
intuitively obvious test statistics that can be used; this was the case in the examples in
this section. In Section 5.3 we will see how to use the likelihood function to construct a
test statistic in more complicated situations where it is not always easy to come up with
an intuitive test statistic.
A nal point is that once we have specied a test statistic D, we need to be able to
compute the p value for the observed data. Calculating probabilities involving D brings
us back to distribution theory. In most cases the exact p value is di cult to determine
mathematically, and we must use either an approximation or computer simulation. Fortunately, for the tests considered in Section 5.3 we can use approximations based on 2
distributions.
For the Gaussian model with unknown mean and standard deviation we use test statistics based on the pivotal quantities that were used in Chapter 4 for constructing condence
intervals.
5.2
Suppose that Y
G( ; ) models a variate y in some population or process. A random
sample Y1 ; : : : ; Yn is selected, and we want to test hypotheses concerning one of the two
parameters ( ; ). The maximum likelihood estimators of and 2 are
~=Y =
n
n
1 P
1 P
Yi and ~ 2 =
(Yi
n i=1
n i=1
Y )2 :
1
n
n
P
(Yi
1 i=1
Y
p v t (n
S= n
Y )2
1) :
We use this pivotal quantity to construct a test of hypothesis for the parameter
standard deviation is unknown.
when the
156
5. TESTS OF HYPOTHESES
jY
j
p 0
S= n
0,
(5.1)
jy
j
p0
s= n
(5.2)
be the value of D observed in a sample with mean y and standard deviation s, then
p
value = P (D
d; H0 is true)
= P (jT j
= 2 [1
d) = 1
P (T
d)]
P( d
where T
d)
t (n
1) :
(5.3)
Y
p 0
S= n
so that large values of D provide evidence against H0 in the direction of the alternative
> 0 . Under H0 :
= 0 the test statistic D has a t (n 1) distribution. Let the
observed value be
y
p0
d=
s= n
Then
p
value = P (D
= P (T
=1
d; H0 is true)
d)
P (T
d)
where T
t (n
1) :
In Example 5.1.2, the hypothesis of interest was H0 : = 1=6 where was the probability
that the upturned face was a one. If the alternative of interest is that is not equal to 1=6
36
Often when we test a hypothesis we have in mind an alternative, i.e. what if the hypothesis H0 is false.
In this case the alternative is 6= 0
157
then the alternative hypothesis is HA : 6= 1=6 and the test statistic D = jY n=6j is a
good choice. If the alternative of interest is that is bigger than 1=6 then the alternative
hypothesis is HA : > 1=6 and the test statistic D = max [(Y n=6); 0] is a better choice.
Example 5.2.1 Testing for bias in a measurement system
Two cheap scales A and B for measuring weight are tested by taking 10 weighings of a
one kg weight on each of the scales. The measurements on A and B are
A:
B:
1:026
1:011
0:998
0:966
1:017
0:965
1:045
0:999
0:978
0:988
1:004
0:987
1:018
0:956
0:965
0:969
1:010
0:980
1:000
0:988
value for A is
p
value = P (D
0:839;
= P (jT j
= 1)
0:839)
= 2 [1
P (T
= 2 [1
0:7884]
where T
t (9)
0:839)]
t 0:42
and thus there is no evidence of bias (that is, there is no evidence against H0 :
scale A based on the observed data.
For scale B, however, we obtain
p
value = P (D
3:534;
= P (jT j
= 2 [1
3:534)
P (T
= 1) for
= 1)
where T
t (9)
3:534)]
= 0:0064
and thus there is very strong evidence against H0 :
= 1. The observed data suggest
strongly that scale B is biased.
Finally, note that just because there is strong evidence against H0 for scale B, the degree
of bias in its measurements is not necessarily large enough to be of practical concern. In
fact, we can get a 95% condence interval for for scale B by using the pivotal quantity
T =
Y
p
S= 10
t (9) :
158
5. TESTS OF HYPOTHESES
For T t (9) we have P (T 2:2622) = 0:975, and a 95% condence interval for is given
p
by y 2:2622s= 10 = 0:981 0:012 or [0:969; 0:993]. Evidently scale B consistently understates the weight but the bias in measuring the 1 kg weight is likely fairly small (about
1% 3%).
Remark: The function t.test in R will give condence intervals and test hypotheses about
; for a data set y use t.test(y).
value
jY
j
p 0
S= n
if and only if P
jT j
if and only if P
jT j
jy
j
p 0 ; H0 :
s= n
jy
j
p0
s= n
jy
j
p0
s= n
jy
j
p0
s= n
if and only if 0
if and only if
0:05
=
is true
0:05 where T
t (n
0:05
1)
0:95
a where P (jT j
a) = 0:95
p
p
2 y as= n; y + as= n
which is a 95% condence interval for . In other words, the p value for testing H0 : = 0
is greater than or equal to 0:05 if and only if the value = 0 is inside a 95% condence
interval for (assuming we use the same pivotal quantity).
More generally, suppose we have data y, a model f (y; ) and we use the same pivotal
quantity to construct a condence interval for and a test of the hypothesis H0 : = 0 .
Then the parameter value = 0 is inside a 100q% condence interval for if and only if
the p value for testing H0 : = 0 is greater than 1 q.
Example 5.2.1 Revisited
For the weigh scale example a 95% condence interval for the mean for the second
scale was [0:969; 0:993]. Since = 1 is not in this interval we know that the p value for
testing H0 : = 1 would be less than 0:05. (In fact we showed the p value equals 0:0064
which is indeed less than 0:05.)
159
(n
n
1 P
2
Y )2 s
(Yi
i=1
(n
1)
to construct condence intervals for the parameter . We may also wish to test a hypothesis
such as H0 :
= 0 . One approach is to use a likelihood ratio test statistic which is
described in the next section. Alternatively we could use the test statistic
U=
(n
1)S 2
2
0
for testing H0 : = 0 . Large values of U and small values of U provide evidence against
H0 . (Why is this?) Now U has a Chi-squared distribution when H0 is true and the
Chi-squared distribution is not symmetric which makes the determination of large and
small values somewhat problematic. The following simpler calculation approximates the
p value:
1. Let u = (n
1)s2 =
2
0
where U s
2 (n
2 (n
u)
1).
where U s
value = 2P (U
value as
value = 2P (U
value as
u)
1).
Figure 5.1 shows a picture for a large observed value of u. In this case P (U
and the p value = 2P (U u).
u) >
1
2
Example 5.2.2
For the manufacturing process in Example 4.7.2, test the hypothesis H0 : = 0:008
(0:008 is the desired or target value of the manufacturer would like to achieve). Note that
since the value = 0:008 is outside the two-sided 95% condence interval for in Example
4.5.2, the p value for a test of H0 based on the test statistic U = (n 1)S 2 = 20 will be
less than 0:05. To nd the p value, we follow the procedure above:
1. u = (n
1)s2 =
2
0
160
5. TESTS OF HYPOTHESES
0.09
0.08
0.07
0.06
p.d.f.
0.05
0.04
P(U< u)
0.03
0.02
0.01
0
P(U> u)
0
10
15
20
25
30
value is
p
where U s
value = 2P (U
u) = 2P (U
36:67) = 0:0017
2 (14).
5.3
161
L( 0 )
:
L(^)
L( 0 )
= 2l(~)
L(~)
2 log
2l( 0 ):
(5.5)
( 0) =
"
#
L( 0 )
2 log
= 2l(^)
L(^)
( 0 ), denoted by
2l( 0 )
Recall that L ( ) = L ( ; y) is a function of the observed data y and therefore replacing y by the
corresponding random variable Y means that L ( ; Y) is a random variable. Therefore the random variable
L( 0 )=L(~) = L( 0 ; Y)=L(~; Y) is a function of Y in several places including ~ = g (Y).
162
5. TESTS OF HYPOTHESES
R( )
0 .8
m o re p l a u s i b l e v a l u e s
0 .6
0 .4
0 .2
le s s p la u s ib le
0
0 .2 5
0 .3
0 .3 5
0 .4
0 .4 5
0 .5
0 .5 5
0 .6
1 2
))
1 0
-2log(R(
le s s p la u s ib le
0
0 .2 5
m o re p l a u s i b l e v a l u e s
le s s p la u s ib le
0 .3
0 .3 5
0 .4
0 .4 5
0 .5
0 .5 5
0 .6
= 0 .3
0
value is then
p
( 0 )] where W s 2 (1)
p
= P jZj
( 0 ) where Z v G (0; 1)
h
i
p
= 2 1 P Z
( 0)
value t P [W
(5.6)
Let us summarize the construction of a test from the likelihood function. Let the random
variable (or vector of random variables) Y represent data generated from a distribution
with probability function or probability density function f (y; ) which depends on the
scalar parameter . Let be the parameter space (set of possible values) for . Consider
a hypothesis of the form
H0 : = 0
where 0 is a single point (hence of dimension 0). We can test H0 using as our test
statistic the likelihood ratio test statistic , dened by (5.5). Then large observed
values of ( 0 ) correspond to a disagreement between the hypothesis H0 : = 0 and
the data and so provide evidence against H0 . Moreover if H0 : = 0 is true, ( 0 ) has
approximately a 2 (1) distribution so that an approximate p value is obtained from (5.6).
The theory behind the approximation is based on a result which shows that under H0 , the
distribution of approaches 2 (1) as the size of the data set becomes large.
163
~
0
+ (1
1
1
~) log
!#
^
0
+ (1
1
1
^) log
( 0 ) is
!#
where ^ = y=n. If ^ and 0 are equal then ( 0 ) = 0. If ^ is either much larger or much
smaller than 0 , then ( 0 ) will be large in value.
Suppose we use the likelihood ratio test statistic to test H0 : = 0:5 for the ESP
example and the data in Example 5.1.1 which were n = 200 and y = 110 so that ^ = 0:55.
The observed value of the likelihood ratio statistic for testing H0 : = 0:5 is
0:55
0:5
+ (1
0:55) log
1 0:55
1 0:5
= 2:003
value is
= 2 [1
P (Z
1:42)] = 2 (1
0:9222)
= 0:1556
and there is no evidence against H0 : = 0:5 based on the data. Note that the test statistic
D = jY 100j used in Example 5.1.1 and the likelihood ratio test statistic (0:5) give
nearly identical results. This is because n = 200 is large.
Example 5.3.2 Likelihood ratio test statistic for Exponential model
Suppose y1 ; : : : ; yn are the observed values of a random sample from the Exponential( )
distribution. The likelihood function is
L( ) =
n
Q
f (yi ; ) =
i=1
n 1
Q
yi =
i=1
1
n
exp
n
1P
yi
for
for
> 0:
i=1
n log
n
1P
i=1
yi =
n log +
>0
164
5. TESTS OF HYPOTHESES
2l ( 0 ) = 2l Y
= 2n
= 2n
Y
Y
log Y +
Y
log
+ log
Y
0
2l ( 0 )
( 0 ) is
y
( 0 ) = 2n
log
Again we observe that, if ^ and 0 are equal then ( 0 ) = 0 and if ^ is either much larger
or much smaller than 0 , then ( 0 ) will be large in value.
The variability in lifetimes of light bulbs (in hours, say, of operation before failure) is
often well described by an Exponential( ) distribution where = E(Y ) > 0 is the average
(mean) lifetime. A manufacturer claims that the mean lifetime of a particular brand of
bulbs is 2000 hours. We can examine this claim by testing the hypothesis H0 : = 2000.
Suppose a random sample of n = 50 light bulbs was tested over a long period and that the
observed lifetimes were:
572
347
2090
5158
3638
with
50
P
2732
2739
371
5839
461
1363
411
1071
1267
2335
716
2825
1197
499
1275
231
147
173
137
3596
83
2100
2505
4082
1015
1206
3253
556
1128
2671
3952
2764
565
1513
849
3804
969
1933
8862
744
2713
1496
1132
2175
580
is ^ = y =
i=1
93840=50 = 1876:8. To check whether the Exponential model is reasonable for these data
we plot the empirical cumulative distribution function for these data and then superimpose
the cumulative distribution function for a Exponential(1876:8) random variable. See Figure
5.3. Since the agreement between the empirical cumulative distribution function and the
Exponential(1876:8) cumulative distribution function is quite good we assume the Exponential model to test the hypothesis that the mean lifetime the light bulbs is 2000 hours.
The observed value of the likelihood ratio test statistic for testing H0 : = 2000 is
(2000) = 2 (50)
1876:8
2000
log
1876:8
2000
= 0:1979:
165
1
0.9
Exponential(1876.8)
0.8
0.7
e.c .d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
1000
2000
7000
8000
9000
value is
value t P (W 0:1979) where W s 2 (1)
h
i
p
0:1979
where Z v G (0; 1)
=2 1 P Z
= 2 [1
P (Z
0:44)] = 2 (1
0:67003)
= 0:65994
The p value is large so we conclude that there is no evidence against H0 : = 2000
and no evidence against the manufacturers claim that is 2000 hours based on the data.
Although the maximum likelihood estimate ^ was under 2000 hours (1876:8) it was not
su ciently under to give evidence against H0 : = 2000.
Example 5.3.3 Likelihood ratio test of hypothesis for
Suppose Y
G( ; ) with probability density function
1
f (y; ; ) = p
2
exp
1
2
)2
(y
for G( ; ), known
for y 2 <:
Let us begin with the (rather unrealistic) assumption that the standard deviation has a
known value and so the only unknown parameter is . In this case the likelihood function
for an observed sample y1 ; y2 ; : : : ; yn from this distribution is
L( ) =
n
Q
i=1
f (yi ; ; ) = (2 )
n=2
exp
n
1 P
2
(yi
i=1
)2
for
2<
166
5. TESTS OF HYPOTHESES
or more simply
L( ) = exp
n
1 P
2
for
)2
(yi
for
2 <:
i=1
2 <:
i=1
n
1 P
2
)2
(yi
n
1 P
2
(yi
)=0
i=1
2
1
2
n
1 P
2
n
1 P
Yi :
n i=1
)2
(yi
i=1
n
P
y)2 + n(y
(yi
is
)2
i=1
)2 =
(yi
i=1
=
=
1
2
i=1
n
P
(yi
0)
2
0)
n
1 P
2
~ )2
(Yi
i=1
Y )2 + n(Y
(Yi
2
0)
2
0)
n(Y
)2 :
i=1
y)2 + n(y
i=1
n
P
n
P
(Yi
~ )2
since ~ = Y
i=1
2
0
p
= n
(5.7)
The purpose for writing the likelihood ratio statistic in the form (5.7) is to draw attention
to the fact that is the square of the standard Normal random variable Y =pn0 and therefore
has exactly a 2 (1) distribution. Of course it is not clear in general that the likelihood ratio
test statistic has an approximate 2 (1) distribution, but in this special case, the distribution
of is clearly 2 (1) (not only asymptotically but for all values of n).
38
n
P
(yi
i=1
c)2 =
n
P
(yi
i=1
y)2 + n(y
5.4
167
Let the data Y represent data generated from a distribution with probability or probability
density function f (y; ) which depends on the k-dimensional parameter . Let
be the
parameter space (set of possible values) for .
Consider a hypothesis of the form
H0 :
where 0
and 0 is of dimension p < k. For example H0 might specify particular values
for k p of the components of but leave the remaining parameters alone. The dimensions
of
and 0 refer to the minimum number of parameters (or coordinates) needed to
specify points in them. Again we test H0 using as our test statistic the likelihood ratio
test statistic , dened as follows. Let ^ denote the maximum likelihood estimate of
over so that, as before,
L(^) = max L( ):
2
(i.e. we maximize
) so that
L(^0 ) = max L( ):
2
2l(^0 ) =
"
L(^0 )
2 log
L(^)
(5.8)
denote an observed value of . If the observed value is very large, then there is evidence
against H0 (conrm that this means L(^) is much larger than L(^0 )). In this case it can
be shown that under H0 , the distribution of is approximately 2 (k p) as the size of
the data set becomes large. Again, large values of indicate evidence against H0 so the
p value is given approximately by
p
value = P (
; H0 ) t P (W
(5.9)
where W s 2 (k p).
The likelihood ratio test covers a great many dierent types of examples, but we only
provide a few here.
168
5. TESTS OF HYPOTHESES
B:
Essentially we have data from two Poisson distributions with possibly dierent parameters.
For convenience let (x1 ; : : : ; xn ) denote the observations for Company As photocopier which
are assumed to be a random sample from the model
P (X = x;
x exp (
A
A) =
A)
x!
for x = 0; 1; : : : and
> 0:
Similarly let (y1 ; : : : ; ym ) denote the observations for Company Bs photocopier which are
assumed to be a random sample from the model
P (Y = y;
y
B
B) =
B)
exp (
y!
for y = 0; 1; : : : and
>0
independently of the observations for Company As photocopier. In this case the parameter
vector is the two dimensional vector = ( A ; B ) and = f( A ; B ) : A > 0; B > 0g.
The note that the dimension of
is k = 2. Since the null hypothesis species that the
two parameters A and B are equal but does not otherwise specify their values, we have
> 0g which is a space of dimension p = 1.
0 = f( ; ) :
To construct the likelihood ratio test of H0 : A = B we need the likelihood function
for the parameter vector = ( A ; B ). We rst note that the likelihood function for A
only based on the data (x1 ; : : : ; xn ) is
L1 (
or more simply
A) =
n
Q
f (xi ;
i=1
L1 (
A)
n
Q
B) =
xi
A
i=1
A) =
m
Q
j=1
n
Q
i=1
exp (
xi
A
exp (
xi !
A)
A)
for
for
>0
> 0:
exp (
B)
for
> 0:
A;
xi
A
i=1
B)
= L1 (
exp (
A)
A)
m
Q
yj
B
j=1
=(
B
L2 (
A;
n
P
A;
B)
is
B)
exp (
B)
=(
169
B)
for (
is
xi log
i=1
m
P
j=1
yj
A;
B)
log
B:
(5.10)
The number of failures in twelve consecutive months for company A and company Bs
copiers are given below; there were the same number of copiers from each company in use
so n = m = 12
Company A:
Company B:
We note that
12
P
16
13
14
7
xi = 240 and
i=1
12
P
25
12
19
9
23
15
12
17
22
10
28
13
19
8
15
10
18
12
for (
A;
29
14
yj = 140.
j=1
A;
A
B)
and
=
B
12
+ 240 log
12
which maximize l(
A;
@l
= 0;
@ A
which gives two equations in two unknowns:
12 +
+ 140 log
B)
B)
2 :
@l
= 0;
@ B
240
=0
12 +
140
=0
39
l( ; ) =
12 + 240 log
12 + 140 log
24 + 380 log
for
> 0:
think of this as maximizing over each parameter with the other parameter xed.
170
5. TESTS OF HYPOTHESES
2l(^0 )
= 2l(20:0; 11:667)
= 2 (682:92
2l(15:833; 15:833)
669:60)
= 26:64
Finally, we compute the approximate p
p
value = P (
26:64; H0 )
Our conclusion is that there is very strong evidence against the hypothesis H0 : A = B .
The data indicate that Company Bs copiers have a lower rate of failure than Company
As copiers.
Note that we could also follow up this conclusion by giving a condence interval for the
mean dierence A
B since this would indicate the magnitude of the dierence in the
two failure rates. The maximum likelihood estimates ^ A = 20:0 average failures per month
and ^B = 11:67 failures per month dier a lot, but we could also give a condence interval
in order to express the uncertainty in such estimates.
Example 5.4.4 Likelihood ratio tests of hypotheses for for G( ; ) model for
unknown
Consider a test of H0 : = 0 based on a random sample y1 ; y2 ; : : : ; yn . In this case
the unconstrained parameter space is
= f( ; ) : 1 < < 1; > 0g, obviously a
2-dimensional space, but under the constraint imposed by H0 , the parameter must lie in
the space 0 = f( ; 0 ); 1 < < 1g a space of dimension 1. Thus k = 2, and p = 1.
The likelihood function is
L( ) = L( ; ) =
n
Q
f (Yi ; ; ) =
i=1
i=1
n
Q
n log( )
1
2
n
1 P
h
c = log (2 )
(yi
i=1
n=2
exp
1
2
)2 + c
(yi
)2
or
171
~=Y
n
1 P
~2 =
(Yi
n i=1
Y )2 :
0)
= 2l Y ; ~
2l Y ; 0
n
1 P
= 2n log(~ )
(Yi Y )2 + 2n log(
2
~ i=1
1
1
0
2
= 2n log
+
2
2 n~
~
~
0
~2
~2
1
log
:
=n
2
2
0
0)
n
1 P
(Yi
2
0 i=1
Y )2
This is not as obviously a Chi-squared random variable. It is, as one might expect, a
function of ~ 2 = 20 which is the ratio of the maximum likelihood estimator of the variance
divided by the value of 2 under H0 . In fact the value of ( 0 ) increases as the quantity
~ 2 = 20 gets further away from the value 1 in either direction.
The test proceeds by obtaining the observed value of ( 0 )
(
0) = n
^2
2
0
log
^2
2
0
value
172
5. TESTS OF HYPOTHESES
1; : : : ; k )
n!
y1 !
yk !
y1 y2
1 2
yk
k
for 0
yj
n where
k
P
yj = n:
j=1
Suppose we wish to test a hypothesis of the form: H0 : j = j ( ) where the probabilities j ( ) are all functions of an unknown parameter (possibly vector)
with dimension
dim( ) = p < k 1. The parameter in the original model is = ( 1 ; :::; k ) and the parak
P
meter space = f( 1 ; : : : ; k ) : 0
1; where
1. The
j
j = 1g has dimension k
j=1
k
Q
n!
j=1 y1 !
yk !
yj
j
or more simply
L( ) =
k
Q
j=1
yj
j :
value = P (
; H0 ) t P (W
) where W s
(k
p)
where
= 2l(^)
is the observed value of
Chapter 7.
2l(^0 )
5.5
173
Chapter 5 Problems
1. The accident rate over a certain stretch of highway was about = 10 per year for a
period of several years. In the most recent year, however, the number of accidents was
25. We want to know whether this many accidents is very probable if = 10; if not,
we might conclude that the accident rate has increased for some reason. Investigate
this question by assuming that the number of accidents in the current year follows a
Poisson distribution with mean and then testing H0 : = 10. Use the test statistic
D = max(0; Y 10) where Y represents the number of accidents in the most recent
year.
2. A woman who claims to have special guessing abilities is given a test, as follows: a
deck which contains ve cards with the numbers 1 to 5 is shu- ed and a card drawn
out of sight of the woman. The woman then guesses the card, the deck is reshu- ed
with the card replaced, and the procedure is repeated several times.
(a) Let be the probability the woman guesses the card correctly and let Y be
the number of correct guesses in n repetitions of the procedure. Discuss why
Y
Binomial(n; ) would be an appropriate model. If you wanted to test the
hypothesis that the woman is guessing at random what is the appropriate null
hypothesis H0 in terms of the parameter ?
(b) Suppose the woman guessed correctly 8 times in 20 repetitions. Calculate the
p-value for your hypothesis H0 in (a) and give a conclusion about whether you
think the woman has any special guessing ability.
(c) In a longer sequence of 100 repetitions over two days, the woman guessed correctly 32 times. Calculate the p-value for these data. What would you conclude
now?
3. The R function runif () generates pseudo random U(0; 1) random variables. The
command y
runif (n) will produce a vector of n values y1 ; : : : ; yn .
(a) Give a test statistic which could be used to test that the yi s, i = 1; : : : ; n are
consistent with a random sample from Uniform(0; 1).
(b) Generate 1000 yi s and carry out the test in (a).
4. A company that produces power systems for personal computers has to demonstrate
a high degree of reliability for its systems. Because the systems are very reliable
under normal use conditions, it is customary to stressthe systems by running them
at a considerably higher temperature than they would normally encounter, and to
measure the time until the system fails. According to a contract with one personal
computer manufacturer, the average time to failure for systems run at 70 C should
be no less than 1; 000 hours.
174
5. TESTS OF HYPOTHESES
From one production lot, 20 power systems were put on test and observed until failure
at 70 . The 20 failure times y1 ; : : : ; y20 were (in hours):
374:2
551:9
250:2
162:8
(Note:
20
P
544:0
853:2
678:1
1060:1
1113:9
3391:2
379:6
1501:4
509:4
297:0
1818:9
332:2
1244:3
63:1
1191:1
2382:0
i=1
46:0
41:5
46:6
39:6
41:3
42:0
44:8
45:8
47:8
48:9
44:5
46:6
45:1
42:9
42:9
47:0
44:5
43:7
(a) Assuming that the measurements are independent and G( ; ), obtain a 95%
condence interval for and test the hypothesis that = 45.
(b) Obtain a 95% condence interval for . Of what interest is this scientically?
6. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. University researchers placed 12 detectors in a chamber where they were exposed to 105
picocuries per liter of radon over 3 days. The readings given by the detectors were:
91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
Let yi = reading for the i0 th detector, i = 1; : : : ; 12. For these data
12
P
yi = 1249:6 and
i=1
where
and
(yi
y)2 = 971:43:
i=1
12
P
= G( ; );
i = 1; : : : ; 12 independently
= 105:
175
7. Data on the number of accidents at a busy intersection in Waterloo over the last 5
years indicated that the average number of accidents at the intersection was 3 accidents per week. After the installation of new tra c signals the number of accidents
per week for a 25 week period were recorded as follows:
4 5 0 4 2 0 1 4 1 3 1 1 2
2 2 1 1 3 2 3 2 0 2 2 3
Let yi = the number of accidents in week i; i = 1; 2; : : : ; 25: To analyse these data we
assume Yi has a Poisson distribution with mean ; i = 1; 2; : : : ; 25 independently.
(a) To decide whether the mean number of accidents at this intersection has changed
after the installation of the new tra c signals we wish to test the hypothesis H0 :
25
P
= 3: Why is the discrepancy measure D =
Yi 75 reasonable? Calculate
i=1
the exact p
c) where Z s N (0; 1) :
( 0 ) = 2n Y log
Y :
8. In the Wintario lottery draw, six digit numbers were produced by six machines that
operate independently and which each simulate a random selection from the digits
0; 1; : : : ; 9. Of 736 numbers drawn over a period from 1980-82, the following frequencies were observed for position 1 in the six digit numbers:
Digit (i):
Frequency (fi ):
0
70
1
75
2
63
3
59
4
81
5
92
6
75
7
100
8
63
9
58
Total
736
= 0:1;
176
5. TESTS OF HYPOTHESES
(a) Test this hypothesis using a likelihood ratio test. What do you conclude?
(b) The data above were for digits in the rst position of the six digit Wintario
numbers. Suppose you were told that similar likelihood ratio tests had in fact
been carried out for each of the six positions, and that position 1 had been
singled out for presentation above because it gave the largest observed value of
the likelihood ratio statistic . What would you now do to test the hypothesis
).)
j = 0:1; j = 0; 1; 2; : : : ; 9? (Hint: Find P (largest of 6 independent s is
9. Testing a genetic model: Recall the model for the M-N blood types of people,
discussed in Examples 2:4:2 and 2:5:2. In a study involving a random sample of n
persons the numbers Y1 ; Y2 ; Y3 (Y1 + Y2 + Y3 = n) who have blood types MM, MN
and NN respectively has a Multinomial distribution with joint probability function
f (y1 ; y2 ; y3 ) =
and since
n!
y1 !; y2 !; y3 !
y1 y2 y3
1 2 3
3
P
for yj = 0; 1; : : : ;
yj = n
j=1
= f( 1 ;
2; 3)
0;
3
P
has dimension two. The genetic model discussed earlier specied that 1 ;
be expressed in terms of only a single parameter ; 0 < < 1, as follows:
1
= 2 (1
);
= (1
pj = 1g
j=1
)2
2; 3
can
(5.11)
177
for each region. Test the hypothesis that the ve rates of birth defects are equal.
Pj :
yj :
2025
27
1116
18
3210
41
1687
29
2840
31
11. Challenge Problem: Likelihood ratio test statistics for Gaussian model
and unknown: Suppose that Y1 ; : : : ; Yn are independent G( ; ) observations.
(a) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
is given by
T2
( 0 ) = n log 1 +
n 1
p
where T = n(Y
0 )=S and S is the sample standard deviation. Note: you
will want to use the identity
n
P
(Yi
i=1
2
0) =
n
P
(Yi
Y )2 + n(Y
i=1
0)
(b) Show that the likelihood ratio test statistic for testing H0 :
can be written as ( 0 ) = U n log (U=n) n where
U=
See Example 5.4.4.
(n
1)S 2
2
0
:
=
( unknown)
178
5. TESTS OF HYPOTHESES
6. GAUSSIAN RESPONSE
MODELS
6.1
Introduction
A response variate Y is one whose distribution has parameters which depend on the value
of other variates. For the Gaussian models we have studied so far, we assumed that we had
a random sample Y1 ; Y2 ; : : : ; Yn from the same Gaussian distribution G( ; ). A Gaussian
response model generalizes this to permit the parameters of the Gaussian distribution for
Yi to depend on a vector xi of covariates (explanatory variates which are measured for
the response variate Yi ). Gaussian models are by far the most common models used in
statistics.
Denition 36 A Gaussian response model is one for which the distribution of the response
variate Y , given the associated vector of covariates x = (x1 ; x2 ; : : : ; xk ) for an individual
unit, is of the form
Y
G( (x) ; (x)):
G ( (xi ); (xi ))
for i = 1; : : : ; n independently.
In most examples we will assume (xi ) = is constant. This assumption is not necessary
but it does make the models easier to analyze. The choice of (x) is guided by past
information and on current data from the population or process. The dierence between
various Gaussian response models is in the choice of the function (x) and the covariates.
We often assume (xi ) is a linear function of the covariates. These models are called
Gaussian linear models and can be written as
Yi
with (xi ) =
k
P
j=1
j xij ;
179
(6.1)
180
where xi = (xi1 ; xi2 ; : : : ; xik ) is the vector of known covariates associated with unit i and
0 ; 1 ; : : : ; k are unknown parameters. These models are also referred to as linear regression models 40 , and the j s are called the regression coe cients.
Here are some examples of settings where Gaussian response models can be used.
Example 6.1.1 Can ller study
The soft drink bottle lling process of Example 1.5.2 involved two machines (Old and
New). For a given machine it is reasonable to represent the distribution for the amount of
liquid Y deposited in a single bottle by a Gaussian distribution.
In this case we can think of the machines as acting like a covariate, with and diering
for the two machines. We could write
Y
G(
O;
O)
G(
N;
N)
In this case there is no formula relating and to the machines; they are simply dierent.
Notice that an important feature of a machine is the variability of its production so we
have, in this case, permitted the two variance parameters to be dierent.
Example 6.1.2 Price versus size of commercial buildings41
Ontario property taxes are based on market value, which is determined by comparing
a property to the price of those which have recently been sold. The value of a property is
separated into components for land and for buildings. Here we deal with the value of the
buildings only but a similar analysis could be conducted for the value of the property.
Table 6.1: Size and Price of 30 Buildings
Size
3:26
3:08
3:03
2:29
1:83
1:65
1:14
1:11
1:11
1:00
40
Price
226:2
233:7
248:5
360:4
415:2
458:8
509:9
525:8
523:7
534:7
Size
0:86
0:80
0:77
0:73
0:60
0:48
0:46
0:45
0:41
0:40
Price
532:8
563:4
578:0
597:3
617:3
624:4
616:4
620:9
624:3
641:7
Size
0:38
0:38
0:38
0:38
0:38
0:34
0:26
0:24
0:23
0:20
Price
636:4
657:9
597:3
611:5
670:4
660:6
623:8
672:5
673:5
611:8
The term regression is used because it was introduced in the 19th century in connection with these
models, but we will not explain why it was used here.
41
This reference can be found in earlier course notes for Oldford and MacKay, STAT 231 Ch. 16
6.1. INTRODUCTION
181
A manufacturing company was appealing the assessed market value of its property,
which included a large building. Sales records were collected on the 30 largest buildings
sold in the previous three years in the area. The data are given in Table 6.1 and plotted in
Figure 6.1. They include the size of the building x (in m2 =105 ) and the selling price y (in
$ per m2 ). The purpose of the analysis is to determine whether and to what extent we can
determine the value of a property from the single covariate x so that we know whether the
assessed value appears to be too high. The building in question was 4:47 105 m2 , with
an assessed market value of $75 per m2 .
The scatterplot shows that the price y is roughly inversely proportional to the size x
but there is obviously variability in the price of buildings having the same area (size). In
this case we might consider a model where the price of a building of size xi is represented
by a random variable Yi , with
Yi s G (
1 xi ;
for i = 1; : : : ; n independently
for
700
650
600
550
500
Pric e
450
400
350
300
250
200
0
0.5
1.5
2.5
3.5
Size
182
The data below show the breaking strengths y of six steel bolts at each of ve dierent
bolt diameters x. The data are plotted in Figure 6.2.
Diameter x
Breaking
Strength
0:10
1:62
1:73
1:70
1:66
1:74
1:72
0:20
1:71
1:78
1:79
1:86
1:70
1:84
0:30
1:86
1:86
1:90
1:95
1:96
2:00
0:40
2:14
2:07
2:11
2:18
2:17
2:07
0:50
2:45
2:42
2:33
2:36
2:38
2:31
The scatterplot gives a clear picture of the relationship between y and x. A reasonable
model for the breaking strength Y of a randomly selected bolt of diameter x would appear
to be Y
G( (x); ). The variability in y values appears to be about the same for bolts of
dierent diameters which again provides some justication for assuming to be constant.
It is not obvious what the best choice for (x) would be although the relationship looks
slightly nonlinear so we might try a quadratic function
(x) =
where
0;
1;
1x
2
2x
2. 5
2. 4
2. 3
2. 2
S t rengt h
2. 1
1. 9
1. 8
1. 7
1. 6
0.05
0. 1
0.15
0. 2
0.25
0. 3
0.35
0. 4
0.45
D ia m e t e r
0. 5
0.55
183
G(0; ):
G( ; ) Model
In Chapters 4 and 5 we discussed estimation and testing hypotheses for samples from a
Gaussian distribution. Suppose that Y
G( ; ) models a response variate y in some
population or process. A random sample Y1 ; : : : ; Yn is selected, and we want to estimate
the model parameters and possibly to test hypotheses about them. We can write this model
in the form
Yi = + Ri where Ri G(0; ):
(6.2)
so this is a special case of the Gaussian response model in which the mean function is constant. The estimator of the parameter that we used is the maximum likelihood estimator
n
P
Y = n1
Yi . This estimator is also a least squares estimator. Y has the property that
i=1
n
P
(Yi
)2 =
i=1
n
P
Y )2 :
(Yi
i=1
You should be able to verify this. It will turn out that the methods for estimation, constructing condence intervals and tests of hypothesis discussed earlier for the single Gaussian
G( ; ) are all special cases of the more general methods derived in Section 6.5.
In the next section we begin with a simple generalization of (6.2) to the case in which
the mean is a linear function of a single covariate.
6.2
42 Many
n
Q
i=1
42
1
2
exp
1
2
(yi
xi )2
184
or more simply
n
L( ; ; ) =
exp
n log
n
1 P
(yi
xi )2 :
(yi
xi )2 :
i=1
n
1 P
2
i=1
=
=
=
n
1 P
2
i=1
n
P
(yi
xi ) =
(yi
xi ) xi =
(y
n
1 P
3
n
P
1
2
i=1
x) = 0
n
P
xi yi
i=1
xi
i=1
xi )2 = 0
(yi
(6.4)
n
P
i=1
x2i
=0
(6.5)
i=1
(6.6)
(6.7)
~
~ xi )2
(xi
x)xi
(6.8)
where
Sxx =
Sxy =
Syy =
n
P
x)2 =
(xi
i=1
n
P
n
P
i=1
(xi
x)(Yi
i=1
n
P
Y)=
n
P
(xi
x)Yi
i=1
Y )2
(Yi
i=1
Since
n
P
n
P
(xi
x)(xi
x) =
i=1
and
n
P
(xi
i=1
n
P
2 i=1
n
P
(xi
x)xi
i=1
x)(Yi
(Yi
~ xi )2 =
1
n
(Syy
x) = 0,
i=1
(xi
43
Y)=
n
P
n
P
(xi
i=1
(xi
i=1
x)Yi
n
P
x) =
n
P
(xi
x)xi
i=1
(xi
i=1
x) =
n
P
(xi
i=1
x)Yi
~ Sxy )
185
as the estimator of 2 rather than the maximum likelihood estimator ~ 2 given by (6.8)
since it can be shown that E Se2 = 2 . Note that Se2 can be more easily calculated using
Se2 =
which follows since
n
P
(Yi ~ ~ xi )2 =
i=1
n
P
(Yi
i=1
n
P
(Yi
1
n
Y + ~x
Y )2
~ Sxy )
(Syy
~ xi )2
2~
i=1
n
P
(Yi
Y ) (xi
x) + ~
i=1
= Syy
2 ~ Sxy + ~
= Syy
~ Sxy :
Sxy
Sxx
n
2 P
(xi
x)2
i=1
Sxx
Such estimates are called least squares estimates. To nd the least squares estimates we
need to solve the two equations
@g
@
@g
@
=
=
n
P
i=1
n
P
(yi
xi ) = n (y
(yi
xi ) xi =
i=1
n
P
x) = 0
xi yi
i=1
n
P
xi
i=1
n
P
i=1
x2i = 0:
simultaneously. We note that this is equivalent to solving the maximum likelihood equations
(6.4) and (6.5). In summary we have that the least squares estimates and the maximum
likelihood estimates obtained assuming the model (6.3) are the same estimates. Of course
the method of least squares only provides point estimates of the unknown parameters
and while assuming the model (6.3) allows us to obtain both estimates and condence
intervals for the unknown parameters. We now show how to obtain condence intervals
based on the model (6.3).
where ai =
(xi x)
Sxx
186
to make it clear that ~ is a linear combination of the Normal random variables Yi and is
therefore Normally distributed with easily obtained expected value and variance. In fact it
n
n
P
P
is easy to show that these non-random coe cients satisfy
ai = 0 and
ai xi = 1 and
n
P
i=1
i=1
i=1
E( ~ ) =
=
n
P
ai E(Yi ) =
i=1
n
P
n
P
ai xi
since
i=1
ai ( + xi )
i=1
n
P
ai = 0
i=1
since
n
P
ai xi = 1:
i=1
Similarly
V ar( ~ ) =
n
P
i=1
i=1
2
Sxx
a2i
since
n
P
i=1
a2i =
1
:
Sxx
In summary
~
;p
Sxx
Condence intervals for are important because the parameter represents the increase
in the mean value of Y , resulting from an increase of one unit in the value of x. As well, if
= 0 then x has no eect on Y (within this model).
Since
~ G
;
;p
Sxx
(n
2)Se2
2
(n
2)
(6.9)
and the fact that it can be shown that ~ and S 2 are independent random variables, it
follows that
~
p
v t (n 2) :
(6.10)
Se = Sxx
This pivotal quantity can be used to obtain condence intervals for
of hypotheses about .
187
p = P( a
a) = P
a) = p where
!
p
a
Se = Sxx
p
~ + aSe = Sxx ;
p
aSe = Sxx
= P ~
is given by
i
p
p
ase = Sxx ; ^ + ase = Sxx
0
p
Se = Sxx
with observed value
^
0
p
se = Sxx
and p
value given by
value = P @jT j
A
p
se = Sxx
13
0
^ 0
A5 where T v t (n
p
P @T
se = Sxx
= 2 41
2) :
Note also that (6.9) can be used to obtain condence intervals or tests for , but these
are usually of less interest than inference about or the other quantities below.
Remark: In regression models we often redenea covariate xi as x0i = xi c, where c is
n
n
P
P
a constant value that makes
x0i close to zero. (Often we take c = x, which makes
x0i
i=1
i=1
exactly zero.) The reasons for doing this are that it reduces round-o errors in calculations,
and that it makes the parameter more interpretable. Note that does not change if we
centre xi this way, because
E(Y jx) =
+ x=
+ (x0 + c) = ( + c) + x0 :
Thus, the intercept changes if we redene x, but not . In the examples we consider here
we have kept the given denition of xi , for simplicity.
188
+ x
We are often interested in estimating the quantity (x) = + x since it represents the
mean response at a specied value of the covariate x. We can obtain a pivotal quantity for
doing this. The maximum likelihood estimator of (x) obtains by replacing the unknown
values ; by their maximum likelihood estimators,
~ (x) = ~ + ~ x = Y + ~ (x
since ~ = Y
~ x. Since
x);
n
~ = Sxy = P (xi x) Yi
Sxx i=1 Sxx
n
P
x) =
ai Yi where ai =
i=1
1
+ (x
n
x)
(xi x)
:
Sxx
(6.11)
Since ~ (x) is a linear combination of Gaussian random variables it has a Gaussian distribution. We can use (6.11) to determine the mean and variance of the random variable ~ (x).
You should verify the following properties of the coe cients ai :
n
P
ai = 1,
i=1
Therefore
n
P
n
P
ai xi = x and
i=1
E[~ (x)] =
=
n
P
i=1
n
P
i=1
a2i =
1 (x x)2
+
:
n
Sxx
ai E(Yi )
ai ( + xi )
i=1
n
P
ai
i=1
+ x
since
n
P
ai xi
i=1
n
P
ai = 1 and
i=1
= (x):
n
P
ai xi = x
i=1
n
P
i=1
i=1
a2i
1 (x x)2
+
:
n
Sxx
Note that the variance of ~ (x) is smallest in the middle of the data, or when x is close to
x and much larger when (x x)2 is large.
189
G @ (x);
~ (x)
x)2
1 (x
+
n
Sxx
Se
(x)
+
(x x)2
Sxx
s t (n
A:
2)
(6.12)
which can be used to obtain condence intervals for (x) in the usual manner. Using
t-tables or R nd the constant a such that P ( a T a) = p where T s t (n 2). Since
0
1
~
(x)
(x)
q
p = P ( a T a) = P @ a
aA
(x x)2
1
Se n + Sxx
0
1
s
s
2
2
(x
x)
(x
x)
1
1
A;
= P @ ~ (x) aSe
+
(x) ~ (x) + aSe
+
n
Sxx
n
Sxx
(6.13)
where ^ (x) = ^ + ^ x,
s2e =
1
n
n
P
(yi
2 i=1
^ xi )2 =
1
n
(Syy
^ Sxy )
ase
1 (x)2
+
n
Sxx
, is given by (6.13)
(6.14)
In fact one can see from (6.14) that if x is large in magnitude (which means the average xi
is large), then the condence interval for will be very wide. This would be disturbing if
the value x = 0 is a value of interest, but often it is not. In the following example it refers
to a building of area x = 0, which is nonsensical!
Remark: The results of the analyses below can be obtained using the R function lm,
with the command lm(y
x). We give the detailed results below to illustrate how the
calculations are made. In R, summary(lm(y x)) gives a lot of useful output.
190
so we nd
^ = Sxy = 3316:68 = 144:5469;
Sxx
22:945
^ = y ^ x = 549:0 ( 144:5) (0:954) = 686:9159;
1
1
s2e =
(Syy ^ Sxy ) =
[489624:723 ( 144:5) ( 3316:68)] = 364:6199;
n 2
28
and se = 19:0950:
Note that ^ is negative which implies that the larger sized buildings tend to sell for less per
square meter. (The estimate ^ = 144:55 indicates a drop in average price of $144:55 per
square meter for each increase of one unit in x; remember xs units are m2 (105 )).
The line y = ^ + ^ x is often called the tted regression line for y on x. If we plot the
tted line on the same graph as the scatterplot of points (xi ; yi ), i = 1; : : : ; n as in Figure
6.3, we see the tted line passes close to the points.
700
650
600
550
500
Pric e
450
400
350
y=-144.5+686.9x
300
250
200
0
0.5
1.5
2.5
3.5
Size
Figure 6.3: Scatterplot and tted line for building price versus size
A condence interval for is not of major interest in the setting here, where the data
were called on to indicate a fair assessment value for a large building with x = 4:47. One
191
way to address this is to estimate (x) when x = 4:47. We get the maximum likelihood
estimate for (4:47) as
^ (4:47) = ^ + ^ (4:47) = $40:79
which we note is much below the assessed value of $75 per square meter. However, one
can object that there is uncertainty in this estimate, and that it would be better to give
a condence interval for (4:47). Using (6.13) and P ( 2:0484
T
2:0484) = 0:95 for
T s t (28) we get a 95% condence interval for (4:47) as
s
1
(4:47 x)2
^ (4:47) 2:0484se
+
30
Sxx
or $40:79 $29:58 or [$11:21; $70:37]. Thus the assessed value of $75 is outside this interval.
However (playing lawyer for the assessor), we could raise another objection: we are
considering a single building but we have constructed a condence interval for the average
of all buildings of size x = 4:47( 105 )m2 . The constructed condence interval is for a point
on the line, not a point Y generated by adding to + (4:47) the random error R s G (0; )
which has a non-negligible variance. This suggests that what we should do is predict the
y value for a building with x = 4:47, instead of estimating (4:47). We will temporarily
leave the example in order to develop a method to do this.
where R
G(0; )
~ (x) = Y
(x) + (x)
~ (x) = R + [ (x)
~ (x)] :
Since, R is independent of ~ (x) (it is not connected to the existing sample), this is the
sum of independent Normally distributed random variables and is consequently Normally
192
distributed. Moreover,
E [Y
~ (x)] = E fR + [ (x)
~ (x)]g
= E(R) + E [ (x)]
= 0 + (x)
E [~ (x)]
(x)
= 0:
Since Y and ~ (x) are independent we have
V ar [Y
1 (x x)2
+
n
Sxx
1 (x x)2
1+ +
:
n
Sxx
Thus
Y
~ (x)
G 0;
1 (x x)2
1+ +
n
Sxx
1=2
2) :
For an interval estimate with condence coe cient p we choose a such that
p = P ( a T a) where T s t (n 2). Since
1
0
Y
~
(x)
q
aA
p=P@ a
2
Se 1 + n1 + (xSxxx)
0
1
s
s
2
2
1 (x x)
1 (x x) A
= P @ ~ (x) aSe 1 + +
Y
~ (x) + aSe 1 + +
n
Sxx
n
Sxx
we obtain the interval
2
s
4 ^ (x)
ase
1 (x x)2
1+ +
; ^ (x) + ase
n
Sxx
3
1 (x x)2 5
1+ +
:
n
Sxx
(6.15)
This interval is usually called a 100p% prediction interval instead of a condence interval,
since Y is not a parameter but a future observation.
193
Gaussian
~
p
Se = sxx
Student t
~=
Se2 =
1
n 2
~x
~ (x) = ~ + ~ x
~ (x)
r
Se
~ (x)
Standard Deviation
h
df = n
1
Sxx
i1=2
Gaussian
Gaussian
(x) =
Student t
df = n
+ x
1
n
1
n
x2
Sxx
i1=2
(x x)2
Sxx
i1=2
(x)
(x x)2
1
+ S
n
xx
Mean or df
~ Sxy
Syy
~=Y
Se
Distribution
Gaussian
h
1+
1
n
(x x)2
Sxx
i1=2
~ (x)
(x x)2
1
1+ n
+ S
xx
(n 2)Se2
2
Student t
df = n
Chi-squared
df = n
194
Y can be positive or negative) in a setting where the price Y must be positive. Nonetheless,
the Gaussian model ts the data reasonably well. We might just truncate the prediction
interval and take it to be [0; $89:83].
Now we nd that the assessed value of $75 is inside this interval. On this basis its
di cult to say that the assessed value is unfair (though it is towards the high end of
the prediction interval). Note also that the value x = 4:47 of interest is well outside the
interval of observed x values which was [0:20; 3:26]) in the data set of 30 buildings. Thus any
conclusions we reach are based on an assumption that the linear model E (Y jx) = + x
applies beyond x = 3:26 at least as far as x = 4:47. This may or may not be true, but we
have no way to check it with the data we have.
There is a slight suggestion in Figure 6.3 that V ar(Y ) may be smaller for larger x values. There is not su cient data to check this either. We mention these points because an
important companion to every statistical analysis is a qualication of the conclusions based
on a careful examination of the applicability of the assumptions underlying the analysis.
Remark: Note from (6.13) and (6.15) that the condence interval for (x) and the prediction interval for Y are wider the further away x is from x. Thus, as we move further away
from the middle of the xs in the data, we get wider and wider intervals for (x) and Y .
Example 6.1.3 Revisited Strength of steel bolts
Recall the data given in Example 6.1.3, where Y represented the breaking strength of a
randomly selected steel bolt and x was the bolts diameter. A scatterplot of points (xi ; yi )
for 30 bolts suggested a nonlinear relationship between Y and x. A bolts strength might be
expected to be proportional to its cross-sectional area, which is proportional to x2 . Figure
6.4 shows a plot of points (x2i ; yi ) which looks quite linear. Because of this let us assign a
new variable name to x2 , say x1 = x2 . We then t a linear model
Yi
The tted regression line y = ^ + ^ x1 is shown on the scatterplot in Figure 6.4. The model
appears to t the data well.
More as a numerical illustration, let us construct a condence interval for , which
represents the increase in average strength (x1 ) from increasing x1 = x2 by one unit.
Using the pivotal quantity (6.10) and the fact that P ( 2:0484 T
2:0484) = 0:95 for
T s t (28), we obtain the 95% condence interval for as
p
^ 2:0484se = Sxx = 2:8378 0:2228:
or [2:6149; 3:0606].
195
2.5
2.4
2.3
2.2
Strength
y=1.67+2.84x
2.1
2
1.9
1.8
1.7
1.6
0
0.05
0.1
0.15
Diameter Squared
0.2
0.25
Figure 6.4: Scatterplot plus tted line for strength versus diameter squared
Models should always be checked. In problems with only one x covariate, a plot of
the tted line superimposed on the scatterplot of the data (as in Figures 6.3 and 6.4)
shows pretty clearly how well the model ts. If there are two or more covariates in the
model, residual plots, which are described below, are very useful for checking the model
assumptions.
Residuals are dened as the dierence between the observed response and the tted
values. Consider the simple linear regression model for which Yi G( i ; ) where
+ xi and Ri = Yi
G(0; ), i = 1; 2; : : : ; n independently. The residuals are
i =
i
given by
r^i = yi
^i
= yi
^ xi
for i = 1; : : : ; n:
The idea behind the r^i s is that they can be thought of as observed Ri s. This isnt
exactly correct since we are using ^ i instead of i in r^i , but if the model is correct, then
196
the r^i s should behave roughly like a random sample from the G(0; ) distribution. The
r^i s do have some features that can be used to check the model assumptions. Recall that
the maximum likelihood estimate of is ^ = y ^ x which implies that y ^ ^ x = 0 or
0=y
n
^ x = 1 P yi
n i=1
n
^ xi = 1 P r^i
n i=1
1
standardized
residual
0
-1
-2
-3
10
20
30
40
50
Figure 6.5: Residual plot for example in which model assumptions hold
Residual plots can be used to check the model assumptions. Here are three residual
plots which can be used:
(1) Plot points (xi ; r^i ); i = 1; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0 (see Figure
6.5).
(2) Plot points (^ i ; r^i ); i = 1; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0.
(3) Plot a Normal qqplot of the residuals r^i . If the model is satisfactory the points should
lie more or less along a straight line.
Departures from the expected pattern may suggest problems with the model. For
example, Figure 6.6 plot suggests the function i = (xi ) is not correctly specied whereas
Figure 6.7 suggests that the variance is non-constant.
197
1
standardized
residual
0
-1
-2
-3
10
20
30
40
50
1
standardized
residual
0
-1
-2
-3
50
60
70
80
90
100
Figure 6.7: Example of residual plot which indicates that assumption V ar (Yi ) =
does not hold
Reading these plots is something of an art and we should try not to read too much into
plots based on a small number of points.
198
2
1.5
1
standardized
residual
0.5
0
-0.5
-1
-1.5
-2
0.05
0.1
0.15
Diameter Squared
0.2
0.25
Figure 6.8: Standard residuals versus diameter squared for bolt data
Often we prefer to use standardized residuals
r^i =
r^i
yi ^ i
yi
=
=
se
se
^ ^ xi
se
for i = 1; : : : ; n:
Standardized residuals were used in Figures 6.6 and 6.7. The patterns in the plots are
unchanged whether we use r^i or r^i , however the r^i values tend to lie in the range ( 3; 3).
(Why is this?).
Example 6.1.3 Revisited Strength of steel bolts
Figure 6.8 shows a standardized residual plot for the steel bolt data where the explanatory variate is diameter squared. No deviation from the expected pattern is observed. This
is of course also evident from Figure 6.4.
A further check on the Gaussian distribution is shown in Figure 6.9 in which the empirical distribution function based on the standardized residuals is plotted together with the
G(0; 1) cumulative distribution function. A qqplot of the standardized residuals is given in
Figure 6.10.
Both gures indicate that there is reasonably good agreement with the Gaussian distribution. Remember that, since the quantiles of the Normal distribution change more rapidly
in the tails of the distribution, we expect the points at both ends of the line to lie further
from the line.
199
1
0.9
0.8
0.7
0.6
e.c.d.f.
0.5
0.4
0.3
0.2
0.1
0
-2
-1.5
-1
-0.5
0
0.5
Standardized Residuals
1.5
-1
-2
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
Standard Normal Quantiles
1.5
2.5
200
6.3
1;
2;
)=
nj
2 Q
Q
j=1 i=1
1
2
exp
1
2
yji
2
j
~1 =
~ j )2 :
Sp2 =
(n1
where
Sj2 =
1
nj
nj
P
(Yji
1 i=1
Yj )2 ; j = 1; 2:
are the sample variances obtained from the individual samples. The estimator Sp2 can be
written as a weighted average of the estimators Sj2 . In fact
Sp2 =
w1 S12 + w2 S22
w1 + w2
(6.16)
201
where the weights are wj = nj 1. Although you could substitute weights other than
nj 1 in (6.16)44 , when you pool various estimators in order to obtain one that is better
than any of those being pooled, you should do so with weights that relate to a measure of
precision of the estimators. For sample variances, the number of degrees of freedom is such
an indicator.
We will use the estimator Sp2 for 2 rather than ~ 2 since E Sp2 = 2 .
To determine whether the two populations dier and by how much we will need to generate
condence intervals for the dierence 1
2 . First note that the maximum likelihood
estimator of this dierence is Y 1 Y 2 which has expected value
E(Y 1
Y 2) =
and variance
2
V ar(Y 1
Y 2 ) = V ar(Y 1 ) + V ar(Y 2 ) =
n1
n2
1
1
+
n1 n2
Sp2
Theorem 37 If Y11 ; Y12 ; : : : ; Y1n1 is a random sample from the G( 1 ; ) distribution and
independently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from the G( 2 ; ) distribution then
(Y 1
Y 2)
q
1
n1
Sp
and
(n1 + n2
2
2)Sp2
1
n2
nj
2 P
1 P
2
2)
(Yji
j=1 i=1
v t (n1 + n2
Yj )2 v
2)
(n1 + n2
2)
a) = p and T s t (n1 + n2
2).
202
12:5
9:4
11:7
11:6
9:9
9:7
9:6
10:4
10:3
6:9
9:6
7:3
9:4
8:4
11:3
7:2
8:7
7:0
11:5
8:2
10:6
12:7
9:7
9:2
The objectives of the experiment were to test whether the average reectivities for paints A
and B are the same, and if there is evidence of a dierence, to obtain a condence interval
for their dierence. (In many problems where two attributes are to be compared we start
by testing the hypothesis that they are equal, even if we feel there may be a dierence. If
there is no statistical evidence of a dierence then we stop there.)
To do this it is assumed that, to a close approximation, the reectivity measurements Y1i ;
i = 1; : : : ; 12 for paint A are independent G( 1 ; 1 ) random variables, and independently
the measurements Y2i ; i = 1; : : : ; 12 for paint B are independent G( 2 ; 2 ) random variables.
We can test H : 1
2 = 0 and get condence intervals for 1
2 by using the pivotal
quantity
Y1 Y2 ( 1
2)
q
v t (22) :
(6.18)
1
1
Sp 12
+ 12
We have assumed45 that the two population variances are identical,
estimated by
12
12
P
1 P
s2p =
(y1i y1 )2 + (y2i y2 )2 :
22 i=1
i=1
To test H0 :
2
2
2,
with
Y1 Y2 0
Y1 Y2
q
= q
1
1
1
1
Sp 12
+ 12
Sp 12
+ 12
n1 = 12 y1 = 10:4
n2 = 12 y2 = 9:0
12
P
(y1i
i=1
12
P
(y2i
i=1
^ 2 = y1
y2 = 1:4 and s2p = 2:3964. The observed value of the test statistic
d=
jy1
q
sp
45
This gives ^ 1
is
2
1
1
12
y2 j
+
1
12
1:4
=q
2:3964
= 2:22
1
6
If the sample variances diered by a great deal we would not make this assumption. Unfortunately if
the variances are not assumed equal the problem becomes more di cult.
203
with
p
value = P (jT j
2:22) = 2 [1
P (T
2:22)] = 0:038
= P @ 2:074
=P
2:074Sp
2
12
^2
2:074sp
Y2 ( 1
q
1
1
Sp 12
+ 12
1
2)
2:074)A
2:074Sp
2
12
as
1
1
+
or [0:09; 2:71] :
12 12
This suggests that although the dierence in reectivity (and durability) of the paint is
statistically signicant, the size of the dierence is not really large relative to the sizes of
1 and 2 . (Look at ^ 1 = y1 = 14:08 and ^ 2 = y2 = 9:0. The relative dierences are of the
order of 10%).
Remark: The R function t.test will carry out the test above and will give condence
intervals for 1
2 . This can be done with the command t.test(y1 ,y2 ,var.equal=T),
where y1 and y2 are the data vectors from 1 and 2.
Y
q2
S12
n1
S22
n2
2)
(6.19)
204
small; the standard deviations s1 = 1:13 and s2 = 1:97 do not provide evidence against
the hypothesis that 1 = 2 if a likelihood ratio test is carried out. Nevertheless, let us
use (6.19) to obtain a 95% condence interval for 1
2 . This resulting approximate 95%
condence interval is
s
s2
s21
y1 y2 1:96
+ 2
(6.20)
n1 n2
For the given data this equals 1:4 1:24, or [0:16; 2:64] which is not much dierent than
the interval obtained assuming the two Gaussian distributions have the same standard
deviations.
Example 6.3.2 Scholastic Achievement Test Scores
Tests that are designed to measure the achievement of students are often given in various
subjects. Educators and parents often compare results for dierent schools or districts. We
consider here the scores on a mathematics test given to Canadian students in the 5th grade.
Summary statistics (sample sizes, means, and standard deviations) of the scores y for the
students in two small school districts in Ontario are as follows:
District 1:
District 2:
n1 = 278
n2 = 345
y1 = 60:2
y2 = 58:1
s1 = 10:16
s2 = 9:02
The average score is somewhat higher in District 1, but is this dierence statistically
signicant? We will give a condence interval for the dierence in average scores in a model
representing this setting. This is done by thinking of the students in each district as a
random sample from a conceptual large population of similar students writing similar
tests. We assume that the scores in District 1 have a G( 1 ; 1 ) distribution and that
the scores in District 2 have a G( 2 ; 2 ) distribution. We can then test the hypothesis
H0 : 1 = 2 or alternatively construct a condence interval for the dierence 1
2.
(Achievement tests are usually designed so that the scores are approximately Gaussian, so
this is a sensible procedure.)
Since n1 = 278 and n2 = 345 we use (6.20) to construct an approximate 95% condence
interval for 1
2 . We obtain
s
(10:16)2 (9:02)2
60:2 58:1 1:96
+
= 2:1 (1:96)(0:779) or [0:57; 1:63] :
278
345
Since 1
2 = 0 is outside the approximate 95% condence interval (can you show that
it is also outside the approximate 99% condence interval?) we can conclude there is fairly
strong evidence against the hypothesis H0 : 1 = 2 , suggesting that 1 > 2 . We should
not rely only on a comparison of their means. It is a good idea to look carefully at the data
and the distributions suggested for the two groups using histograms or boxplots.
The mean is a little higher for District 1 and because the sample sizes are so large, this
gives a statistically signicant dierence in a test of H0 : 1 = 2 . However, it would
205
be a mistake46 to conclude that the actual dierence in the two distributions is very large.
Unfortunately, signicant tests like this are often used to make claims about one group
or class or school is superior to another and such conclusions are unwarranted if, as is
often the case, the assumptions of the test are not satised.
= E(Y1i ) and
= E(Y2i )
We assume independence of the sample. How likely is it that marks in a class are independent of one
another and no more alike than marks between two classes or two dierent years?
47
See the video at www.watstat.ca called Paired Condence Intervals
48
Ask yourself if I had (another?) brother/sister, how tall would they grow to?
206
consumptions Y1i ; Y2i for the ith car are related, because factors such as size, weight and
engine size (and perhaps the driver) aect consumption. As in the preceding example
it would not be appropriate to treat the Y1i s (i = 1; : : : ; 50) and Y2i s (i = 1; : : : ; 50)
as two independent samples from larger populations. The observations have been paired
deliberately to eliminate some factors (like driver/ car size) which might otherwise eect
the conclusion. Note that in this example it may not be of much interest to consider E(Y1i )
and E(Y2i ) separately, since there is only a single observation on each car type for either
fuel.
Two types of Gaussian models are used to represent settings involving paired data.
The rst involves what is called a Bivariate Normal distribution for (Y1i ; Y2i ), and it could
be used in the fuel consumption example. This is a continuous bivariate model for which
each component has a Normal distribution and the components may be dependent. We
will not describe this model here49 (it is studied in third year courses), except to note one
fundamental property: If (Y1i ; Y2i ) has a Bivariate Normal distribution then the dierence
between the two is also Normally distributed;
Y1i
Y2i
N(
2;
(6.21)
G(
i;
2
1 );
and Y2i
G(
i;
2
2)
independently
where the i s are unknown constants. The i s represent factors specic to the dierent
pairs so that some pairs can have larger (smaller) expected values than others. This model
also gives a Gaussian distribution like (6.21), since
E(Y1i
Y2i ) =
V ar(Y1i
Y2i ) =
1
2
1
2
2
2
i s
cancel)
This model seems relevant for Example 6.3.2, where i refers to the ith car type.
Thus, whenever we encounter paired data in which the variation in variables Y1i and
Y2i is adequately modeled by Gaussian distributions, we will make inferences about 1
2
by working with the model (6.21).
49
For Stat 241: Let Y = (Y1 ; : : : ; Yk )T be a k 1 random vector with E(Yi ) = i and Cov(Yi ; Yj ) = ij ;
i; j = 1; : : : ; k: (Note: Cov(Yi ; Yi ) = ii = V ar(Yi ) = 2i :) Let = ( 1 ; : : : ; k )T be the mean vector and
1
be the k k symmetric covariance matrix whose (i; j) entry is ij : Suppose also that
exists. If the joint
T
1
1
1
p.d.f. of (Y1 ; : : : ; Yk ) is given by f (y1 ; : : : ; yk ) = (2 )k=2 j j1=2 exp
(y
)
(y
)
; y 2 <k where
2
y = (y1 ; : : : ; yk )T then Y is said to have a Multivariate Normal distribution. The case k = 2 is called
bivariate normal.
P
1 1401
(yi
1400 i=1
207
Y2i , i = 1; : : : ; 1401
208
interval for
2.
We note that it is slightly wider than the 95% condence interval [4:76; 5:03] obtained
using the pairings.
To see why the pairing is helpful in estimating the mean dierence 1
2 , suppose that
2
2
Y1i G( 1 ; 1 ) and Y2i G( 2 ; 2 ), but that Y1i and Y2i are not necessarily independent
(i = 1; 2; : : : ; n). The estimator of 1
2 is
Y1
and we have that E(Y1
Y2 ) =
V ar(Y1
Y2
and
Y2 ) = V ar(Y1 ) + V ar(Y2 )
=
2
1
2
2
12
2Cov(Y1 ; Y2 )
50
Subject
1
2
3
4
5
6
7
8
9
10
209
Yi
0:01
0:05
0:61
0:36
0:30
0:06
0:11
0:28
0:86
0:06
Table 6.3 shows the cholesterol levels y (in mmol per liter) for each subject, measured at
the end of each 6 week period. We let the random variables Y1i ; Y2i represent the cholesterol
levels for subject i on the high bre and low bre diets, respectively. Well also assume that
the dierences are represented by the model
Yi = Y1i
Y2i
G(
2;
) for i = 1; : : : ; 20:
The dierences yi are also shown in Table 6.3, and from them we calculate the sample mean
and standard deviation
y = 0:020 and s = 0:411:
Since P (T 2:093) = 1 0:025 = 0:975 where T s t (19), a 95% condence interval for
1
2 given by (6.17) is
p
p
y 2:093 s= n = 0:020 2:093 (0:411) = 20 = 0:020 0:192 or [ 0:212; 0:172]
This condence interval includes 1
2 = 0, and there is clearly no evidence that the high
bre diet gives a lower cholesterol level at least in the time frame represented in this study.
Remark: The results here can be obtained using the R function t.test.
Exercise: Compute the p-value for the test of hypothesis H0 :
statistic (5.1).
Final Remarks: When you see data from a comparative study (that is, one whose
objective is to compare two distributions, often through their means), you have to determine
whether it involves paired data or not. Of course, a sample of Y1i s and Y2i s cannot be from
a paired study unless there are equal numbers of each, but if there are equal numbers the
study might be either pairedor unpaired. Note also that there is a subtle dierence in
the study populations in paired and unpaired studies. In the former it is pairs of individual
units that form the population where as in the latter there are (conceptually at least)
separate individual units for Y1 and Y2 measurements.
210
6.4
G( i ; ) with (xi ) =
k
P
j xij
j=1
for i = 1; 2; : : : ; n independently.
(Note: To facilitate the matrix proof below we have taken 0 = 0 in (6.1). The estimator of
0 can be obtained from the result below by letting xi1 = 1 for i = 1; : : : ; n and 0 = 1 .)
For convenience we dene the n k (where n > k) matrix X of covariate values as
X = (xij ) for i = 1; : : : ; n and j = 1; 2; : : : ; k
and the n 1 vector of responses Yn 1 = (Y1 ; : : : ; Yn )T . We assume that the values xij
are non-random quantities which we observe. We now summarize some results about the
maximum likelihood estimators of the parameters = ( 1 ; : : : ; k )T and .
=(
1; : : : ;
T
k)
~2 =
and of
=(
T
k)
1; : : : ;
and
XT Y
n
1 P
(Yi
n i=1
~ i )2
are:
(6.22)
where ~ i =
k
P
~ xij
j
(6.23)
j=1
n
Q
i=1
1
2
exp
2
i)
(yi
where
k
P
j=1
j xij
n log
n
1 P
2
(yi
i=1
2
i) :
n
P
i=1
51
(yi
i ) xij
=0
211
for each j = 1; 2; : : : ; k. In terms of the matrix X and the vector y =(y1 ; :::; yn )T we can
rewrite this system of equations more compactly as
X T (y
X )= 0
or X T y = X T X :
Assuming that the k k matrix X T X has an inverse we can solve these equations to obtain
the maximum likelihood estimate of , in matrix notation as
^ = (X T X)
XT y
X T Y:
In order to nd the maximum likelihood estimator of , we take the derivative with respect
to and set the derivative equal to zero and obtain
@
@l
=
@
@
or
n log
n
n
1 P
3
n
1 P
2
2
i)
(yi
i=1
(yi
i=1
i)
=0
^i =
=0
as
k
P
^ xij
j
j=1
where
~i =
k
P
~ xij :
j
j=1
Recall that when we estimated the variance for a single sample from the Gaussian
distribution we considered a minor adjustment to the denominator and with this in mind
we also dene the following estimator52 of the variance 2 :
n
1 P
n
Se2 =
(Yi ~ i )2 =
~2:
n k i=1
n k
Note that for large n there will be small dierences between the observed values of ~ 2 and
Se2 .
52
212
Theorem 39
1. The estimators ~ j are all Normally distributed random variables with
expected value j and with variance given by the j 0 th diagonal element of the matrix
2 (X T X) 1 ; j = 1; 2; : : : ; k:
2. The random variable
W =
n~ 2
k)Se2
(n
(6.24)
k degrees of freedom.
i=1
(X T X) 1 X T .
n
P
i=1
n
P
i=1
n
P
i=1
Note that
k
P
l xil
bji E(Yi )
bji
where
k
P
l xil
l=1
bji
l=1
is the jth component of the vector BXX . But since BX is the identity matrix, this is
the jth component of the vector or j : Thus E( ~ j ) = j for all j. The calculation of
the variance is similar.
V ar( ~ j ) =
=
n
P
i=1
2
b2ji V ar(Yi )
n
P
i=1
b2ji
1;
that
n
P
i=1
diagonal element of the matrix (X T X) 1 . We will not attempt to prove part (3) here,
which is usually proved in a subsequent statistics course.
54
213
Remark: The maximum likelihood estimate ^ is also called a least squares estimate
of in that it is obtained by taking the sum of squared vertical distances between the
observations Yi and the corresponding tted values ^ i and then adjusting the values of the
estimated j until this sum is minimized. Least squares is a method of estimation in linear
models that predates the method of maximum likelihood. Problem 16 describes the method
of least squares.
Remark:55 From Theorem 39 we can obtain condence intervals and test hypotheses for
the regression coe cients using the pivotal
~
s t (n
p
Se cj
k)
(6.25)
1
k) :
p
ase cj
where
s2e =
1
n
n
P
i=1
(yi
p
se cj
^ + ase pcj
j
^ i )2 and ^ i =
is
k
P
^ xij :
j
j=1
p
p i
as cj ; ^ j + as cj
55
Recall: if Z
Let Z =
G(0; 1) and W
j
cj
,W =
(n k)S 2
2
and m = n
W=m s t (m).
214
We now consider a special case of the Gaussian response models. We have already
seen this case in Chapter 4, but it provides a simple example to validate the more general
formulae.
Single Gaussian distribution
Here, Yi G( ; ), i = 1; :::; n, i.e. (xi ) = and xi = x1i = 1; for all i = 1; 2; : : : ; n;
k = 1 we use the parameter instead of = ( 1 ). Notice that Xn 1 = (1; 1; : : : ; 1)T in
this case. This special case was also mentioned in Section 6.1. The pivotal quantity (6.25)
becomes
~
~
1
p
p 1 =
Se c1
S= n
since (X T X) 1 = 1=n. This pivotal quantity has the t distribution with n
You can also verify using (6.24) that
(n
1)S 2
2
has a Chi-squared(n
1) distribution.
k =n
1.
6.5
215
Chapter 6 Problems
1. Twelve female nurses working at a large hospital were selected at random and their
age (x) and systolic blood pressure (y) were recorded. The data are:
x:
y:
56
147
46
125
72
160
36
118
63
149
47
128
55
150
49
145
38
115
42
140
68
152
60
155
12
P
x)2 = 1550:67;
(xi
i=1
Syy =
12
P
(yi
i=1
12
P
(xi
x)(yi
y) = 1764:67
i=1
G( + xi ; ),
and .
2.
216
Temp ( F)
189:5
188:8
188:5
185:7
186:0
185:6
184:1
184:6
184:1
183:2
182:4
181:9
181:9
181:0
180:6
y
3:7
6:26
7:8
9:78
12:4
x
13:81
15:9
17:23
20:24
24:81
y
13:02
16
17:27
19:9
24:9
x
24:85
28:51
30:92
31:44
33:22
y
24:69
27:88
30:8
31:03
33:01
x
36:9
37:26
38:94
39:62
40:15
y
37:54
37:2
38:4
40:03
39:4
217
20
P
(yi
y)2
20
P
Sxx =
y = 23:5505
= 2820:862295
Sxy =
i=1
(xi
i=1
20
P
(xi
x)2 = 2818:946855
x)(yi
y) = 2818:556835
i=1
G( + xi ; ), i = 1; :::; 20
(a) Fit the model to these data and obtain 95% condence intervals for the slope
and test the hypothesis = 1. Why is this hypothesis of interest?
(b) Obtain 95% condence intervals for the intercept
= 0. Why is this hypothesis of interest?
(c) Use the plots discussed in Section 6.2 to check the adequacy of the model.
(c) Describe briey how you would characterize the cheap measurement processs
accuracy to a lay person.
(d) If the units to be measured have true concentrations in the range 0 40, do you
think that the cheap method tends to produce a value that is lower than the true
concentration? Support your answer based on the data and the assumed model.
5. Regression through the origin: Consider the model Yi v G( xi ; ); i = 1; :::::; n
independently.
(a) Show that
^=
n
P
xi yi
i=1
n
P
i=1
n
P
x i Yi
i=1
n
P
i=1
x2i
n
P
Hint: Write ~ in the form
ai Yi .
x2i
B
vNB
@ ;
2
n
P
i=1
x2i
C
C:
A
i=1
i=1
yi
^ xi
n
P
i=1
yi2
n
P
i=1
x i yi
n
P
i=1
x2i
218
n
P
1
n
1 i=1
yi
^ xi
2.
n
P
i=1
x2i
v t (n
1) :
0.
i=1
xi yi = 13984:5554
20
P
i=1
x2i = 14058:9097
20
P
i=1
yi2 = 13913:3833:
=1
(d) Check the model assumptions by examining the residual plot (xi ; r^i ), i = 1; 2; : : : ; 20
where r^i = yi ^ xi and a qqplot of the standardized residuals.
(e) Using the results of this analysis as well as the analysis in Problem 4 what would
conclude about using the model Yi
G( + xi ; ) versus Yi v G( xi ; ) for
these data?
7. The following data were recorded concerning the relationship between drinking (x =
per capita wine consumption) and y = death rate from cirrhosis of the liver in n = 46
219
states of the U.S.A. (for simplicity the data has been rounded):
x
5
4
3
7
11
9
6
3
12
7
14
12
46
P
i=1
46
P
y
41
32
39
58
75
60
54
48
77
57
81
34
xi = 533
(xi
x)2 = 2155:2
i=1
x
10
10
14
9
7
18
6
31
13
20
19
10
46
P
i=1
46
P
y
53
55
58
63
67
57
38
130
70
104
84
66
x
4
16
9
6
6
21
15
17
7
13
8
28
y
52
87
67
40
56
58
74
98
41
67
48
123
x
23
22
23
7
16
2
6
3
8
13
y
92
76
98
34
91
30
28
52
56
56
yi = 2925
y)2 = 24801
(yi
i=1
46
P
(xi
x) (yi
y) = 6175
i=1
140
rate of Chirrhosis
120
100
80
60
40
20
10
15
20
25
per capita consumption of wine
30
35
2.
and
G( + xi ; ),
220
8. Skinfold body measurements are used to approximate the body density of individuals.
The data on n = 92 men, aged 20-25, where x = skinfold measurement and Y = body
density are given in Appendix C as well as being posted on the course website. Run
the R code below.
Note: The R function lm, with the command lm(y~x) gives the detailed the calculations for linear regression. The command summary(lm(y~x)) also gives useful output.
>Dataset<-read.table("Skinfold Data.txt",header=T,sep="",strip.white=T)
# reads data and headers from file Skinfold Data.txt
>RegModel <-lm(BodyDensity~Skinfold,data=Dataset)
# runs regression Bodydensity=a+b*Skinfold
>summary(RegModel) # summary of output on next page
The output is as follows:
Call:
lm(formula = BodyDensity ~Skinfold, data = Dataset)
Residuals:
Min
1Q
Median
3Q
Max
-0.0251400 -0.0040412 -0.0001752 0.0041324 0.0192336
Coefficients:
Estimate
Std. Error t value
Pr(>jtj)
(Intercept) 1.161139 0.005429
213.90
<2e-16 ***
Skinfold
-0.062066 0.003353
-18.51 <2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.007877 on 90 degrees of freedom
Diagnostic plots.
>x<-Dataset$Skinfold
>y<-Dataset$BodyDensity
>muhat<-1.161139-0.062066*x
>plot(x,y)
>points(x,muhat,type="l")
>title(main="Scatterplot of Skinfold/BodyDensity with fitted line")
Residual Plots
>r<- RegModel$residuals
>x<- Dataset$Skinfold
>plot(x,r)
>title(main="residual plot:
>muhat=1.161139-0.062066*x
>plot(muhat,r)
Skinfold vs residual")
221
Treatment Group:
42
46
Control Group:
43
59
43
10
58
52
55
17
71
62
26
60
43
54
62
53
49
57
37
42
61
33
33
37
44
46
41
42
67
43
19
55
49
57
54
28
53
20
48
85
Let y1j = the DRP test score for the treatment group, j = 1; : : : ; 21: Let y2j = the
DRP test score for the control group, j = 1; : : : ; 23: For these data
y1 = 51:4762;
21
P
y1 )2 = 2423:2381; y2 = 41:5217;
(y1j
j=1
23
P
(y2j
y2 )2 = 6469:7391:
j=1
1;
= G(
1;
);
j = 1; : : : ; 21 independently
2;
= G(
1;
2;
);
and
1;
j = 1; : : : ; 23 independently
are unknown parameters.
and
represent.
222
2.
(d) Test the hypothesis of no dierence between the means, that is, test the hypothesis H0 : 1 = 2 .
10. To compare the mathematical abilities of incoming rst year students in Mathematics and Engineering, 30 Math students and 30 Engineering students were selected
randomly from their rst year classes and given a mathematics aptitude test. A summary of the resulting marks xi (for the math students) and yi (for the engineering
students), i = 1; : : : ; 30, is as follows:
Math students:
n = 30
Engineering students:
n = 30
x = 120
y = 114
30
P
i=1
30
P
(xi
x)2 = 3050
(yi
y)2 = 2937
i=1
Obtain a 95% condence interval for the dierence in mean scores for rst year Math
and Engineering students, and test the hypothesis that the dierence is zero.
11. A study was done to compare the durability of diesel engine bearings made of two
dierent compounds. Ten bearings of each type were tested. The following table gives
the times until failure (in units of millions of cycles):
Type I: y1i
Type II: y2i
y1 = 10:693;
3:03
3:19
10
P
5:53
4:26
(y1i
5:60
4:47
9:30
4:53
9:92
4:67
12:51
4:69
y1 )2 = 209:02961; y2 = 6:75;
i=1
12:95
12:78
10
P
15:21
6:79
(y2i
16:04
9:37
16:84
12:75
y2 )2 = 116:7974
i=1
(a) Assuming that Y , the number of million cycles to failure, has a Normal distribution with the same variance for each type of bearing, obtain a 90% condence
interval for the dierence in the means 1 and 2 of the two distributions.
(b) Test the hypothesis that
2.
(c) It has been suggested that log failure times are approximately Normally distributed, but not failure times. Assuming that the log Y s for the two types of
bearing are Normally distributed with the same variance, test the hypothesis
that the two distributions have the same mean. How does the answer compare
with that in part (b)?
(d) How might you check whether Y or log Y is closer to Normally distributed?
(e) Give a plot of the data which could be used to describe the data and your
analysis.
223
12. Fourteen welded girders were cyclically stressed at 1900 pounds per square inch and
the numbers of cycles to failure were observed. The sample mean and variance of the
log failure times were y = 14:564 and s2 = 0:0914. Similar tests on four additional
girders with repaired welds gave y = 14:291 and s2 = 0:0422. Log failure times are
assumed to be independent with a G( ; ) distribution. Assuming equal variances,
obtain a 90% condence interval for the dierence in mean log failure time.
13. Consider the data in Problem 9 of Chapter 1 on the lengths of male and female
coyotes.
(a) Construct a 95% condence interval the dierence in mean lengths for the two
sexes. State your assumptions.
(b) Estimate P (Y1 > Y2 ) (give the maximum likelihood estimate), where Y1 is the
length of a randomly selected female and Y2 is the length of a randomly selected
male. Can you suggest how you might get a condence interval?
(c) Give separate condence intervals for the average length of males and females.
14. To assess the eect of a low dose of alcohol on reaction time, a sample of 24 student
volunteers took part in a study. Twelve of the students (randomly chosen from the 24)
were given a xed dose of alcohol (adjusted for body weight) and the other twelve got
a nonalcoholic drink which looked and tasted the same as the alcoholic drink. Each
student was then tested using software that ashes a coloured rectangle randomly
placed on a screen; the student has to move the cursor into the rectangle and double
click the mouse. As soon as the double click occurs, the process is repeated, up to a
total of 20 times. The response variate is the total reaction time (i.e. time to complete
the experiment) over the 20 trials. The data are given below.
Alcohol Group:
1:33
1:55
1:43
1:35
y1 =
1:17
1:35
16:44
= 1:370;
12
Non-Alcohol Group:
1:68
1:30
1:85
y2 =
1:64
1:62
1:69
19:19
= 1:599;
12
1:17
12
P
(y1i
1:80
1:68
1:19
0:96
1:46
1:40
1:43
y1 )2 = 0:608
i=1
1:57
12
P
(y2i
1:82
1:41
1:78
y2 )2 = 0:35569
i=1
Analyze the data with the objective of seeing when there is any evidence that the
dose of alcohol increases reaction time. Justify any models that you use.
224
15. An experiment was conducted to compare gas mileages of cars using a synthetic oil
and a conventional oil. Eight cars were chosen as representative of the cars in general
use. Each car was run twice under as similar conditions as possible (same drivers,
routes, etc.), once with the synthetic oil and once with the conventional oil, the order
of use of the two oils being randomized. The average gas mileages were as follows:
Car
Synthetic: y1i
Conventional: y21
yi = y1i y2i
1
21:2
18:0
3:2
2
21:4
20:6
0:8
y1 = 23:6125
y2 = 22:5375
y = 1:075
3
15:9
14:2
1:7
8
P
(y1i
i=1
8
P
(y2i
i=1
8
P
(yi
4
37:0
37:8
0:8
5
12:1
10:6
1:5
6
21:1
18:5
2:6
7
24:5
25:9
1:4
8
35:7
34:7
1
y1 )2 = 535:16875
y2 )2 = 644:83875
y)2 = 17:135
i=1
(a) Obtain a 95% condence interval for the dierence in mean gas mileage, and
state the assumptions on which your analysis depends.
(b) Repeat (a) if the natural pairing of the data is (improperly) ignored.
(c) Why is it better to take pairs of measurements on eight cars rather than taking
only one measurement on each of 16 cars?
16. The following table gives the number of sta hours per month lost due to accidents
in eight factories of similar size over a period of one year and after the introduction
of an industrial safety program.
Factory i
After: y1i
Before: y2i
yi = y1i y2i
1
28:7
48:5
19:8
y=
2
62:2
79:2
17:0
15:3375 and
3
28:9
25:3
3:6
8
P
4
0:0
19:7
19:7
(yi
5
93:5
130:9
37:4
6
49:6
57:6
8:0
7
86:3
88:8
2:5
8
40:2
62:1
21:9
y)2 = 1148:79875:
i=1
There is a natural pairing of the data by factory. Factories with the best safety records
before the safety program tend to have the best records after the safety program as
well. The analysis of the data must take this pairing into account and therefore the
model
Yi v N ; 2 = G ( ; ) ; i = 1; : : : ; 8 independently
is assumed where
and
225
and
represent.
1
3:85
2:66
2
2:81
2:98
3
6:47
5:35
4
7:59
6:43
5
4:58
4:28
6
5:47
5:06
7
4:72
4:36
8
3:56
3:91
9
3:22
3:28
10
5:58
5:19
Set:
A:
B:
11
4:58
4:05
12
5:46
4:78
13
3:31
3:77
14
4:33
3:81
15
4:26
3:17
16
6:29
6:02
17
5:04
4:84
18
5:08
4:81
19
5:08
4:34
20
3:47
3:48
(a) Construct a 99% condence interval for the dierence in the average time to sort
with algorithms A and B, assuming a Gaussian model applies.
(b) Test the Gaussian model assumption using an appropriate plot.
(c) Suppose you are asked to estimate the probability that A will sort a randomly
selected list fast than B. Give a point estimate of this probability.
(d) Another way to estimate the probability p in part (b) is just to notice that of
the 20 sets of numbers in the study, A sorted faster on 15. Indicate how you
could also get a condence interval for p using this approach. (It is also possible
to get a condence interval using the Gaussian model.)
19. Challenge Problem: Let Y1 ; : : : ; Yn be a random sample from the G( 1 ; 1 ) distribution and let X1 ; : : : ; Xn be a random sample from the G( 2 ; 2 ) distribution.
Obtain the likelihood ratio test statistic for testing the hypothesis H0 : 1 = 2 and
show that it is a function of F = S12 =S22 , where S12 and S22 are the sample variances
from the y and x samples respectively.
20. Challenge Problem: Readings produced by a set of scales are independent and
Normally distributed about the true weight of the item being measured. A study
226
21. Challenge Problem: Least squares estimation. Suppose you have a model
where the mean of the response variable Yi given the covariates xi = (xi1 ; : : : ; xik )
has the form
i = E(Yi jxi ) = (xi ; )
where is a k 1 vector of unknown parameters. Then the least squares estimate
of
based on data (xi ; yi ); i = 1; : : : ; n is the value that minimizes the objective
function
n
P
S( ) =
[yi
(xi ; )]2
i=1
= (xi ; ) =
k
P
j=1
j xij :
7. MULTINOMIAL MODELS
AND GOODNESS OF FIT TESTS
7.1
Many important hypothesis testing problems can be addressed using Multinomial models.
Suppose the data arise from a Multinomial distribution with joint probability function
f (y1 ; : : : ; yk ;
where yj = 0; 1; : : : and
k
P
1; : : : ; k )
n!
y1 !
yk !
y1
1
yk
k
(7.1)
satisfy 0 <
<1
j=1
and
k
P
= 1, and we dene
= ( 1; : : : ;
k ).
j=1
that the probabilities are related in some way, for example that they are all functions of a
lower dimensional parameter , such that
H0 :
j(
) for j = 1; : : : ; k
(7.2)
k
Q
j=1
yj
j :
(7.3)
Let be the parameter space for . It was shown earlier that L( ) is maximized over
(of dimension m 1) by the vector ^ with ^j = yj =n, j = 1; : : : ; k. A likelihood ratio test
of the hypothesis (7.2) is based on the likelihood ratio statistic
"
#
~0 )
L(
= 2l(~) 2l(~0 ) = 2 log
;
(7.4)
L(~)
where ~0 maximizes L( ) under the hypothesis (7.2), which restricts to lie in a space
of dimension p. (Note that 0 is the space of all ( 1 ( ); 2 ( ); :::; k ( )) as
0
varies over its possible values.) If H0 is true (that is, if really lies in 0 ) and n is large the
227
228
value = P (
; H0 ) t P (W
) where W s
(k
p)
values
(7.5)
and
= 2l(^)
2l(^0 )
is the observed value of . This approximation is very accurate when n is large and none
of the j s is too small. When the observed expected frequencies under H0 are all at least
ve, it is accurate enough for testing purposes.
The test statistic (7.4) can be written in a simple form. Let ~0 = ( 1 (~ ); : : : ; k (~ ))
denote the maximum likelihood estimator of under the hypothesis (7.2). Then, by (7.4),
we obtain
= 2l(~)
=2
k
X
2l(~0 )
"
Yj log
j=1
~j
j (~ )
as
=2
k
X
Yj
Ej
Yj log
j=1
(7.6)
An alternative test statistic that was developed historically before the likelihood ratio
test statistic is the Pearson goodness of t statistic
D=
k
X
(Yj
j=1
Ej )2
Ej
(7.7)
The Pearson goodness of t statistic has similar properties to ; for example, their observed
values both equal zero when yj = ej = n j (^ ) for all j = 1; : : : ; k and are larger when
yj s and ej s dier greatly. It turns out that, like , the statistic D also has a limiting
2 (k
1 p) distribution when H0 is true.
The remainder of this chapter consists of the application of the general methods above
to some important testing problems.
7.2
229
Recall from Section 2.4 that one way to check the t of a probability distribution is by
comparing the observed frequencies fj and the expected frequencies ej = n^
pj . As indicated
there we did not know how close the observed and expected frequencies needed to be to
conclude that the model was adequate. It is possible to test the correctness of a model by
using the Multinomial model. We illustrate this through two examples.
Example 7.2.1 MM, MN, NN blood types
Recall Example 2.4.2, where people in a population are classied as being one of three
blood types MM, MN, NN. The proportions of the population that are these three types
are 1 , 2 , 3 respectively, with 1 + 2 + 3 = 1. Genetic theory indicates, however, that
the j s can be expressed in terms of a single parameter , as
1
= 2 (1
),
= (1
)2 :
(7.8)
Data collected on 100 persons gave y1 = 17, y2 = 46, y3 = 37, and we can use this to test
the hypothesis H0 that (7.8) is correct. (Note that (Y1 ; Y2 ; Y3 ) Multinomial(n; 1 ; 2 ; 3 )
with n = 100.) The likelihood ratio test statistic is given by (7.6), but we have to nd ~
and then the Ej s. The likelihood function under (7.8) is
L1 ( ) = L( 1 ( );
);
2 17
= c(
=c
2(
) [2 (1
80
3 ( ))
46
)] [(1
)2 ]37
)120
(1
3
X
yj log
j=1
yj
ej
= 2 17 log
17
16
+ 46 log
46
48
+ 37 log
37
36
= 0:17
value is
value = P (
t P (W
0:17; H0 )
0:17) = 0:68 where W s
(1)
230
pj ( ) =
aj
f (t; )dt = e
aj =
aj
1=
for j = 1; :::; k
(7.9)
100
29
27:6
100
200
22
20:0
200
300
300
12
14:4
400
400
10
10:5
600
10
13:1
600
800
> 800
8
7:6
9
6:9
j=1
It is possible to maximize L( ) mathematically. (Hint: rewrite L( ) in terms of the parameter = e 100= and nd ^ rst; then ^ = 100= log ^ .) This gives ^ = 310:0. The
expected frequencies, ej = 100pj (^) j = 1; : : : ; 7, are given in the table.
The observed value of the likelihood ratio statistic (7.6) is
2
7
X
yj log
j=1
value = P (
yj
ej
= 2 29 log
29
27:6
+ 8 log
8
7:6
= 1:91
value is
1:91; H0 ) t P (W
so there is no evidence against the model (7.9). Note that the reason the
freedom are 5 is because k 1 = 6 and p = dim( ) = 1.
(5)
2
degrees of
The goodness of t test just discussed has some arbitrary elements, since we could have
used dierent intervals and a dierent number of intervals. Theory has been developed on
how best to choose the intervals. For this course we only give rough guidelines which are:
chose 4 10 intervals, so that the observed expected frequencies under H0 are at least 5.
231
Observed
Frequency: fj
57
203
383
525
532
408
273
139
45
27
10
6
2608
Expected
Frequency: ej
54:3
210:3
407:1
525:3
508:4
393:7
254:0
140:5
68:0
29:2
11:3
5:8
2607:9
12
X
fj log
j=1
fj
ej
= 2 57 log
57
54:3
+ 203 log
203
210:3
+ 6 log
6
5:9
= 14:01
value is
value = P (
14:01; H0 ) t P (W
(10)
so there is no evidence against the hypothesis that a Poisson model ts these data.
The observed value of the goodness of t statistic is
2
12
X
(fj
j=1
ej )2
ej
value = P (
(57
(6
5:9)2
= 12:96
5:9
value is
12:96; H0 ) t P (W
(10)
so again there is no evidence against the hypothesis that a Poisson model ts these data.
232
9
X
fj log
j=1
fj
ej
= 2 21 log
21
52:72
+ 45 log
45
38:82
+ 8 log
value = P (
50:36; H0 ) t P (W
8
17:3
= 50:36:
value is
50:36) t 0 where W s
(7)
and there is very strong evidence against the hypothesis that an Exponential model ts
these data. This conclusion is not unexpected since, as we noted in Example 2.6.2, the
observed and expected frequencies are not in close agreement at all. We could have chosen
a dierent set of intervals for these continuous data but the same conclusion of a lack of t
would be obtained for any reasonable choice of intervals.
7.3
Often we want to assess whether two factors or variates appear to be related. One tool for
doing this is to test the hypothesis that the factors are independent and thus statistically
unrelated. We will consider this in the case where both variates are discrete, and take on
a fairly small number of possible values. This turns out to cover a great many important
settings.
Two types of studies give rise to data that can be used to test independence, and in
both cases the data can be arranged as frequencies in a two-way table. These tables are
also called contingency tables.
233
a P
b
P
yij = n and
i=1 j=1
ij
=1
i=1 j=1
and that the a b frequencies (Y11 ; Y12 ; : : : ; Yab ) follow a Multinomial distribution with
k = ab classes.
To test independence of the A and B classications, we consider the hypothesis
H0 :
where 0 <
< 1, 0 <
ij
=
a
P
< 1,
for i = 1; : : : ; a; j = 1; : : : ; b
i j
= 1,
i=1
b
P
j=1
(7.10)
= 1. Note that
and
and that (7.10) is the standard denition for independent events: P (Ai \Bj ) = P (Ai )P (Bj ).
We recognize that testing (7.10) falls into the general framework of Section 7.1, where
k = ab, and the dimension of the parameter space under (7.10) is p = (a 1) + (b 1) =
a + b 2. All that needs to be done in order to use the statistics (7.6) or (7.7) to test H0
is to obtain the maximum likelihood estimates ^ i , ^ j under the model (7.10), and then the
calculate the expected frequencies eij .
Under the model (7.10), the likelihood function for the yij s is proportional to
L1 ( ; ) =
a Q
b
Q
ij (
; )]yij
i j)
yij
i=1 j=1
a Q
b
Q
i=1 j=1
i=1
j=1
^i =
y+j
yi+ ^
; j=
n
n
and eij = n^ i ^ j =
yi+ y+j
;
n
(7.11)
234
where yi+ =
b
P
j=1
a
P
yij .
i=1
The observed value of the likelihood ratio statistic (7.6) for testing the hypothesis (7.10)
is then
a X
b
X
yij
=2
:
yij log
eij
i=1 j=1
The approximate p
p
The
value is computed as
value = P (
; H0 ) t P (W
degrees of freedom (a
k
1)(b
p = (ab
1)
where W s
((a
1)(b
1))
1) are determined by
(a
1+b
1) = (a
1)(b
1):
Rh+
Rh
Total
O
82 (77:3)
13 (17:7)
95
A
89 (94:4)
27 (21:6)
116
B
54 (49:6)
7 (11:4)
61
AB
19 (22:8)
9 (5:2)
28
Total
244
56
300
It is of interest to see whether these two classication systems are genetically independent.
The row and column totals in the table are also shown, since they are the values yi+ and
y+j needed to compute the eij s in (7.11). In this case we can think of the Rh types as the
A-type classication and the OAB types as the B-type classication in the general theory
above. Thus a = 2, b = 4 and the 2 degrees of freedom are (a 1)(b 1) = 3.
To carry out the test that a persons Rh and OAB blood types are statistically independent, we merely need to compute the eij s by (7.11). This gives, for example,
e11 =
(244)(95)
244(116)
= 77:3; e12 =
= 94:4
300
300
and, similarly, e13 = 49:6, e14 = 22:8, e21 = 17:7, e22 = 21:6, e23 = 11:4, e24 = 5:2.
It may be noted that ei+ = yi+ and e+j = y+j , so it is necessary to compute only
(a 1)(b 1) of the eij s using (7.11); the remainder can be obtained by subtraction from
row and column totals. For example, if we compute e11 , e12 , e13 here then e21 = 95 e11 ,
e22 = 116 e12 , and so on. (This is not an advantage if we are using a computer to calculate
the numbers; however, it does suggest where the degrees of freedom comes from.)
235
The observed value of the likelihood ratio test statistic is = 8:52, and the p value
is approximately P (W
8:52) = 0:036 where W s 2 (3) so there is evidence against the
hypothesis of independence. Note that by comparing the eij s and the yij s we get some
idea about the lack of independence, or relationship, between the two classications. We
see here that the degree of dependence does not appear large.
Testing Equality of Multinomial Parameters from Two or More Groups
A similar problem arises when individuals in a population can be one of b types B1 ; : : : ; Bb ,
but where the population is sub-divided into a groups A1 ; : : : ; Aa . In this case, we might
be interested in whether the proportions of individuals of types B1 ; : : : ; Bb are the same for
each group. This is essentially the same as the question of independence in the preceding
section: we want to know whether the probability ij that a person in population group i
is B-type Bj is the same for all i = 1; : : : ; a. That is, ij = P (Bj jAi ) and we want to know
if this depends on Ai or not.
Although the framework is supercially the same as the preceding section, the details
are a little dierent. In particular, the probabilities ij satisfy
i1
i2
ib
= 1 for each i = 1; : : : ; a
(7.12)
a;
(7.13)
236
Aspirin Group
Placebo Group
Total
Stroke
64 (75:6)
86(74:4)
150
No Stroke
176 (164:4)
150 (161:6)
326
Total
240
236
476
We can think of the persons receiving aspirin and those receiving placebo as two groups,
and test the hypothesis
H0 : 11 = 21 ;
where 11 = P (stroke) for a person in the aspirin group and 21 = P (stroke) for a person
in the placebo group. The expected frequencies under H0 : 11 = 21 are
eij =
(yi+ )(y+j )
476
for i = 1; 2:
This gives the values shown in the table. The observed value of the likelihood ratio statistic
is
2 X
2
X
yij
2
yij log
= 5:25
eij
i=1 j=1
value is
value t P (W
(1)
so there is evidence against H0 . A look at the yij s and the eij s indicates that persons
receiving aspirin have had fewer strokes than expected under H0 , suggesting that 11 < 21 .
This test can be followed up with estimates for 11 and 21 . Because each row of the
table follows a Binomial distribution, we have
^11 = y11 = 64 = 0:267 and ^21 = y21 = 86 = 0:364:
n1
240
n2
236
We can also give individual condence intervals for 11 and 21 . Based on methods derived
earlier we have an approximate 95% condence interval for 11 given by
r
(0:267) (0:733)
0:267 1:96
or [0:211; 0:323]
240
and an approximate 95% condence interval for 11 given by
r
(0:364) (0:636)
0:364 1:96
or [0:303; 0:425] :
240
237
11
21
21 )
~21 )=n2
Remark: This and other tests involving Binomial probabilities and contingency tables can
be carried out using the R function prop.test.
238
7.4
Chapter 7 Problems
Rust present
Rust absent
Total
Rust-Proofed
14
36
50
(a) Test the hypothesis that the probability of rust occurring is the same for the
rust-proofed cars as for those not rust-proofed. What do you conclude?
(b) Do you have any concerns about inferring that the rust-proong prevents rust?
How might a better study be designed?
2. Two hundred volunteers participated in an experiment to examine the eectiveness
of vitamin C in preventing colds. One hundred were selected at random to receive
daily doses of vitamin C and the others received a placebo. (None of the volunteers
knew which group they were in.) During the study period, 20 of those taking vitamin
C and 30 of those receiving the placebo caught colds. Test the hypothesis that the
probability of catching a cold during the study period was the same for each group.
3. Mass-produced items are packed in cartons of 12 as they come o an assembly line.
The items from 250 cartons are inspected for defects, with the following results:
Number defective:
Frequency observed:
0
103
1
80
2
31
3
19
4
11
5
5
6
1
Test the hypothesis that the number of defective items Y in a single carton has a
Binomial(12; p) distribution. Why might the Binomial not be a suitable model?
4. The numbers of service interruptions in a communications system over 200 separate
weekdays is summarized in the following frequency table:
Number of interruptions:
Frequency observed:
0
64
1
71
2
42
3
18
4
4
5
1
Test whether a Poisson model for the number of interruptions Y on a single day is
consistent with these data.
239
5. The table below records data on 292 litters of mice classied according to litter size
and number of females in the litter.
Litter
Size = n
ynj
1
2
3
4
Number
0
1
8 12
23 44
10 25
5 30
of females = j
2
3
4
13
48
34
13
22
Total number
of litters = yn+
20
80
96
96
(a) For litters of size n (n = 1; 2; 3; 4) assume that the number of females in a litter
of size n has Binomial distribution with parameters n and n = P (female). Test
the Binomial model separately for each of the litter sizes n = 2; n = 3 and
n = 4. (Why is it of scientic interest to do this?)
(b) Assuming that the Binomial model is appropriate for each litter size, test the
hypothesis that 1 = 2 = 3 = 4 .
6. A long sequence of digits (0; 1; : : : ; 9) produced by a pseudo random number generator
was examined. There were 51 zeros in the sequence, and for each successive pair of
zeros, the number of (non-zero) digits between them was counted. The results were
as follows:
1
2
2
2
4
1
26
3
2
7
6
1
0
21
16
8
20
5
4
18
10
4
2
3
2
22
2
8
0
13
12
0
1
0
22
15
10
6
7
7
0
4
14
2
3
0
19
2
4
5
Give an appropriate probability model for the number of digits between two successive
zeros, if the pseudo random number generator is truly producing digits for which
P (any digit = j) = 0:1; j = 0; 1; : : : ; 9, independent of any other digit. Construct a
frequency table and test the goodness of t of your model.
7. 1398 school children with tonsils present were classied according to tonsil size and
absence or presence of the carrier for streptococcus pyogenes. The results were as
follows:
Normal Enlarged Much enlarged
Carrier present
19
29
24
Carrier absent
497
560
269
Is there evidence of an association between the two classications?
240
8. The following data on heights of 210 married couples were presented by Yule in 1900.
Tall husband
Medium husband
Short husband
Tall wife
18
20
12
Medium wife
28
51
25
Short wife
19
28
9
Test the hypothesis that the heights of husbands and wives are independent.
9. In the following table, 64 sets of triplets are classied according to the age of their
mother at their birth and their sex distribution:
Mother under 30
Mother over 30
Total
3 boys
5
6
11
2 boys
8
10
18
2 girls
9
13
22
3 girls
7
6
13
Total
29
35
64
(a) Is there any evidence of an association between the sex distribution and the age
of the mother?
(b) Suppose that the probability of a male birth is 0.5, and that the sexes of triplets
are determined independently. Find the probability that there are x boys in a
set of triples (x = 0; 1; 2; 3), and test whether the column totals are consistent
with this distribution.
10. A study was undertaken to determine whether there is an association between the
birth weights of infants and the smoking habits of their parents. Out of 50 infants of
above average weight, 9 had parents who both smoked, 6 had mothers who smoked
but fathers who did not, 12 had fathers who smoked but mothers who did not, and
23 had parents of whom neither smoked. The corresponding results for 50 infants of
below average weight were 21, 10, 6, and 13, respectively.
(a) Test whether these results are consistent with the hypothesis that birth weight
is independent of parental smoking habits.
(b) Are these data consistent with the hypothesis that, given the smoking habits of
the mother, the smoking habits of the father are not related to birth weight?
11. Purchase a box of smarties and count the number of each of the colours: red, green.
yellow, blue, purple, brown, orange, pink. Test the hypothesis that each of the colours
has the same probability H0 : i = 18 ; i = 1; 2; :::; 8: The following R code57 can
be modied to give the two test statistics, the likelihood ratio test statistic
and
Pearsons Chi-squared D:
57
these are the frequencies of smarties for a large number of boxes consumed in Winter 2013.
241
242
8. CAUSAL RELATIONSHIPS
8.1
Establishing Causation
58
As mentioned in Chapters 1 and 3, many studies are carried out with causal objectives
in mind. That is, we would like to be able to establish or investigate a possible cause and
eect relationship between variables X and Y .
We use the word causesoften; for example we might say that gravity causes dropped
objects to fall to the ground, or that smoking causes lung cancer. The concept of
causation (as in X causes Y ) is nevertheless hard to dene. One reason is that the
strengths of causal relationships vary a lot. For example, on earth gravity may always
lead to a dropped object falling to the ground; however, not everyone who smokes gets lung
cancer.
Idealized denitions of causation are often of the following form. Let y be a response
variate associated with units in a population or process, and let x be an explanatory variate
associated with some factor that may aect y. Then, if all other factors that aect y
are held constant, let us change x (or observe dierent values of x) and see if y
changes. If y changes then we say that x has a causal eect on y.
In fact, this denition is not broad enough, because in many settings a change in x may
only lead to a change in y in some probabilistic sense. For example, giving an individual
person at risk of stroke a small daily dose of aspirin instead of a placebo may not necessarily
lower their risk. (Not everyone is helped by this medication.) However, on average the eect
is to lower the risk of stroke. One way to measure this is by looking at the probability a
randomly selected person has a stroke (say within 3 years) if they are given aspirin versus
if they are not.
Therefore, a better idealized denition of causation is to say that changing x should
result in a change in some attribute of the random variable Y (for example, its mean or
some probability such as P (Y > 0)). Thus we revise the denition above to say:
If all other factors that aect Y are held constant, let us change x (or observe
dierent values of x) and see if some specied attribute of Y changes. If the
specied attribute of Y changes then we say x has a causal eect on Y .
These denitions are unfortunately unusable in most settings since we cannot hold all
58
See the video at www.watstat.ca called "Causation and the Flying Spaghetti monster"
243
244
8. CAUSAL RELATIONSHIPS
other factors that aect y constant; often we dont even know what all the factors are.
However, the denition serves as a useful ideal for how we should carry out studies in order
to show that a causal relationship exists. We try to design studies so that alternative (to the
variate x) explanations of what causes changes in attributes of y can be ruled out, leaving
x as the causal agent. This is much easier to do in experimental studies, where explanatory
variables may be controlled, than in observational studies. The following are brief examples.
Example 8.1.1 Strength of steel bolts
Recall Example 6.1.3 concerning the (breaking) strength y of a steel bolt and the diameter x of the bolt. It is clear that bolts with larger diameters tend to have higher strength,
and it seems clear on physical and theoretical grounds that increasing the diameter causes
an increase in strength. This can be investigated in experimental studies like that in Example 6.1.3, when random samples of bolts of dierent diameters are tested, and their
strengths y determined.
Clearly, the value of x does not determine y exactly (dierent bolts with the same
diameter dont have the same strength), but we can consider attributes such as the average
value of y. In the experiment we can hold other factors more or less constant (e.g. the
ambient temperature, the way the force is applied; the metallurgical properties of the bolts)
so we feel that the observed larger average values of y for bolts of larger diameter x is due
to a causal relationship.
Note that even here we have to depart slightly from the idealized denition of cause
and eect. In particular, a bolt cannot have its diameter x changed, so that we can see
if y changes. All we can do is consider two bolts that are as similar as possible, and are
subject to the same explanatory variables (aside from diameter). This di culty arises in
many experimental studies.
Example 8.1.2 Smoking and lung cancer
Suppose that data have been collected on 10; 000 persons aged 40-80 who have smoked
for at least 20 years, and 10; 000 persons in the same age range who have not. There is
roughly the same distribution of ages in the two groups. The (hypothetical) data concerning
the numbers with lung cancer are as follows:
Smokers
Non-Smokers
Lung Cancer
500
100
No Lung Cancer
9500
9900
Total
10; 000
10; 000
There are many more lung cancer cases among the smokers, but without further information or assumptions we cannot conclude that a causal relationship (smoking causes
lung cancer) exists. Alternative explanations might explain some or all of the observed
dierence. (This is an observational study and other possible explanatory variables are not
245
controlled.) For example, family history is an important factor in many cancers; maybe
smoking is also related to family history. Moreover, smoking tends to be connected with
other factors such as diet and alcohol consumption; these may explain some of the eect
seen.
The last example illustrates that association (statistical dependence) between
two variables X and Y does not imply that a causal relationship exists. Suppose
for example that we observe a positive correlation between X and Y ; higher values of X
tend to go with higher values of Y in a unit. Then there are at least three explanations:
(i) X causes Y (meaning X has a causative eect on Y ),(ii) Y causes X, and (iii) some
other factor(s) Z cause both X and Y .
Well now consider the question of cause and eect in experimental and observational
studies in a little more detail.
8.2
Experimental Studies
Suppose we want to investigate whether a variate x has a causal eect on a response variate
Y . In an experimental setting we can control the values of x that a unit sees. In addition,
we can use one or both of the following devices for ruling out alternative explanations for
any observed changes in Y that might be caused by x:
(i) Hold other possible explanatory variates xed.
(ii) Use randomization to control for other variates.
These devices are mostly simply explained via examples.
Example 8.2.1 Aspirin and the risk of stroke
Suppose 500 persons that are at high risk of stroke have agreed to take part in a clinical
trial to assess whether aspirin lowers the risk of stroke. These persons are representative
of a population of high risk individuals. The study is conducted by giving some persons
aspirin and some a placebo, then comparing the two groups in terms of the number of
strokes observed.
Other factors such as age, sex, weight, existence of high blood pressure, and diet also
may aect the risk of stroke. These variates obviously vary substantially across persons and
cannot be held constant or otherwise controlled. However, such studies use randomization
in the following way: among the study subjects, who gets aspirin and who gets a placebo
is determined by a random mechanism. For example, we might ip a coin (or draw a
random number from f0; 1g), with one outcome (say Heads) indicating a person is to be
given aspirin, and the other indicating that they get the placebo.
The eect of this randomization is to balance the other possible explanatory variables
in the two treatment groups (aspirin and placebo). Thus, if at the end of the study we
observe that 20% of the placebo subjects have had a stroke but only 9% of the aspirin
246
8. CAUSAL RELATIONSHIPS
subjects have, then we can attribute the dierence to the causative eect of the aspirin.
Heres how we rule out alternative explanations: suppose you claim that its not the aspirin
but dietary factors and blood pressure that cause this observed eect. I respond that the
randomization procedure has lead to those factors being balanced in the two treatment
groups. That is, the aspirin group and the placebo group both have similar variations in
dietary and blood pressure values across the subjects in the group. Thus, a dierence in
the two groups should not be due to these factors.
Example 8.2.2 Driving speed and fuel consumption
It is thought that fuel consumption in automobiles is greater at speeds in excess of 100
km per hour. (Some years ago during oil shortages, many U.S. states reduced speed limits
on freeways because of this.) A study is planned that will focus on freeway-type driving,
because fuel consumption is also aected by the amount of stopping and starting in town
driving, in addition to other factors.
In this case a decision was made to carry out an experimental study at a special paved
track owned by a car company. Obviously a lot of factors besides speed aect fuel consumption: for example, the type of car and engine, tire condition, fuel grade and the driver.
As a result, these factors were controlled in the study by balancing them across dierent
driving speeds. An experimental plan of the following type was employed.
84 cars of eight dierent types were used; each car was used for 8 test drives.
the cars were each driven twice for 600 km on the track at each of four speeds:
80,100,120 and 140 km/hr.
8 drivers were involved, each driving each of the 8 cars for one test, and each driving
two tests at each of the four speeds.
the cars had similar initial mileages and were carefully checked and serviced so as to
make them as comparable as possible; they used comparable fuels.
the drivers were instructed to drive steadily for the 600 km. Each was allowed a 30
minute rest stop after 300 km.
the order in which each driver did his or her 8 test drives was randomized. The track
was large enough that all 8 drivers could be on it at the same time. (The tests were
conducted over 8 days.)
The response variate was the amount of fuel consumed for each test drive. Obviously
in the analysis we must deal with the fact that the cars dier in size and engine type, and
their fuel consumption will depend on that as well as on driving speed. A simple approach
would be to add the fuel amounts consumed for the 16 test drives at each speed, and to
compare them (other methods are also possible). Then, for example, we might nd that
the average consumption (across the 8 cars) at 80, 100, 120 and 140 km/hr were 43.0,44.1,
247
45.8 and 47.2 liters, respectively. Statistical methods of testing and estimation could then
be used to test or estimate the dierences in average fuel consumption at each of the four
speeds. (Can you think of a way to do this?)
Exercise: Suppose that statistical tests demonstrated a signicant dierence in consumption across the four driving speeds, with lower speeds giving lower consumption. What (if
any) qualications would you have about concluding there is a causal relationship?
8.3
Observational Studies
In observational studies there are often unmeasured factors that aect the response Y . If
these factors are also related to the explanatory variable x whose (potential) causal eect
we are trying to assess, then we cannot easily make any inferences about causation. For
this reason, we try in observational studies to measure other important factors besides x.
For example, Problem 1 at the end of Chapter 7 discusses an observational study on
whether rust-proong prevents rust. It is clear that an unmeasured factor is the care a car
owner takes in looking after a vehicle; this could quite likely be related to whether a person
decides to have their car rust-proofed.
The following example shows how we must take note of measured factors that aect Y .
Example 8.3.1 Graduate studies admissions
Suppose that over a ve year period, the applications and admissions to graduate studies
in Engineering and Arts faculties in a university are as follows:
Engineering
Arts
Total
No. Applied
1000
200
1000
1800
2000
2000
No. Admitted
600
150
400
800
1000
950
% Admitted
60%
75%
40%
44%
50%
47:5%
Men
Women
Men
Women
Men
Women
We want to see if females have a lower probability of admission than males. If we looked
only at the totals for Engineering plus Arts, then it would appear that the probability a
male applicant is admitted is a little higher than the probability for a female applicant.
However, if we look separately at Arts and Engineering, we see the probability for females
being admitted appears higher in each case! The reason for the reverse direction in the
totals is that Engineering has a higher admission rate than Arts, but the fraction of women
applying to Engineering is much lower than for Arts.
248
8. CAUSAL RELATIONSHIPS
In cause and eect language, we would say that the faculty one applies to (i.e. Engineering or Arts) is a causative factor with respect to probability of admission. Furthermore,
it is related to the sex (male or female) of an applicant, so we cannot ignore it in trying to
see if sex is also a causative factor.
Remark:
The feature illustrated in the example above is sometimes called Simpsons
Paradox. In probabilistic terms, it says that for events A; B1 ; B2 and C1 ; : : : ; Ck , we can
have
P (AjB1 Ci ) > P (AjB2 Ci ) for each i = 1; : : : ; k
but have
P (AjB1 ) < P (AjB2 )
(Note that P (AjB1 ) =
k
P
i=1
on what P (Ci jB1 ) and P (Ci jB2 ) are.) In the example above we can take B1 = fperson
is femaleg, B2 = fperson is maleg, C1 = fperson applies to Engineeringg, C2 = fperson
applies to Artsg, and A = fperson is admittedg.
Exercise: Write down estimated probabilities for the various events based on Example
8.3.1, and so illustrate Simpsons paradox.
Epidemiologists (specialists in the study of disease) have developed guidelines or criteria
which should be met in order to argue that a causal association exists between a risk factor
x and a disease (represented by a response variable Y = I(person has the disease), for
example). These include
the need to account for other possible risk factors and to demonstrate that x and Y
are consistently related when these factors vary.
the demonstration that association between x and Y holds in dierent types of settings
the existence of a plausible scientic explanation
Similar criteria apply to other areas.
8.4
Clobrate Study
In the early seventies, the Coronary Drug Research Group implemented a large medical
trial59 in order to evaluate an experimental drug, clobrate, for its eect on the risk of
heart attacks in middle-aged people with heart trouble. Clobrate operates by reducing
the cholesterol level in the blood and thereby potentially reducing the risk of heart disease.
59
The Coronary Drug Research Group, New England Journal of Medicine (1980), pg. 1038.
M eas ur ement
249
M ater ial
Per s onnel
f ollow -up t im e
dos e
drug
doc t or
age
s t res s
m ent al healt h
diet
pers onalit y t y pe
gender
ex erc is e
s m ok ing s t at us
drink ing s t at us
m edic at ions
f am ily his t ory
phy s ic al t rait s
pers onal his t ory
F a t a l H e a rt A t t a c k
m et hod of adm inis t rat ion
w eat her
dos e
loc at ion
w ork env ironm ent
w hen t ak en
Envir onment
M ethods
250
8. CAUSAL RELATIONSHIPS
planatory variates other than the focal explanatory variate. See the shbone diagram
above.)
Administer treatments in identical capsules in a double-blinded fashion. (In this context, double-blind means that neither the patient nor the individual administering the
treatment knows if it is clobrate or placebo; only the person heading the investigation knows. This is to avoid dierential reporting rates from physicians enthusiastic
about the new drug - a form of measurement error.)
Follow patients for 5 years and record the occurrence of any fatal heart attacks experienced in either treatment group.
251
Problem:
Investigate the occurrence of fatal heart attacks in the group of patients assigned to
clobrate who were adherers.
The remaining parts of the problem stage are as before.
Plan:
Compare the occurrence of heart attacks in patients assigned to clobrate who maintained the designated treatment schedule with the patients assigned to clobrate who
abandoned their assigned treatment schedule.
Note that this is a further reduction of the study population.
Data:
In the clobrate group, 708 patients were adherers and 357 were non-adherers. The
remaining 38 patients could not be classied as adherers or non-adherers and so were
excluded from this analysis. Of the 708 adherers, 106 had a fatal heart attack during
the ve years of follow up. Of the 357 non-adherers, 88 had a fatal heart attack during
the ve years of follow up.
Analysis:
The proportion of adherers suering from subsequent heart attack is given by 106=708 =
0:15 while this proportion for the non-adherers is 88=357 = 0:25.
Conclusions:
It would appear that clobrate does reduce mortality due to heart attack for high
risk patients if properly administered.
However, great care must be taken in interpreting the above results since they are
based on an observational plan. While the data were collected based on an experimental plan, only the treatment was controlled. The comparison of the mortality
rates between the adherers and non-adherers is based on an explanatory variate (adherence) that was not controlled in the original experiment. The investigators did not
decide who would adhere to the protocol and who would not; the subjects decided
themselves.
Now the possibility of confounding is substantial. Perhaps, adherers are more health
conscious and exercised more or ate a healthier diet. Detailed measurements of these
variates are needed to control for them and reduce the possibility of confounding.
252
8.5
8. CAUSAL RELATIONSHIPS
Chapter 8 Problems
1. In an Ontario study, 50267 live births were classied according to the babys weight
(less than or greater than 2.5 kg.) and according to the mothers smoking habits (nonsmoker, 1-20 cigarettes per day, or more than 20 cigarettes per day). The results were
as follows:
No. of cigarettes
0
1 20 > 20
Weight 2:5
1322
1186
793
27036 14142 5788
Weight > 2:5
(a) Test the hypothesis that birth weight is independent of the mothers smoking
habits.
(b) Explain why it is that these results do not prove that birth weights would increase
if mothers stopped smoking during pregnancy. How should a study to obtain
such proof be designed?
(c) A similar, though weaker, association exists between birth weight and the amount
smoked by the father. Explain why this is to be expected even if the fathers
smoking habits are irrelevant.
2. One hundred and fty Statistics students took part in a study to evaluate computerassisted instruction (CAI). Seventy-ve received the standard lecture course while
the other 75 received some CAI. All 150 students then wrote the same examination.
Fifteen students in the standard course and 29 of those in the CAI group received a
mark over 80%.
(a) Are these results consistent with the hypothesis that the probability of achieving
a mark over 80% is the same for both groups?
(b) Based on these results, the instructor concluded that CAI increases the chances
of a mark over 80%. How should the study have been carried out in order for
this conclusion to be valid?
3. (a) The following data were collected some years ago in a study of possible sex bias
in graduate admissions at a large university:
Male applicants
Female applicants
Admitted
3738
1494
Not admitted
4704
2827
Test the hypothesis that admission status is independent of sex. Do these data
indicate a lower admission rate for females?
253
(b) The following table shows the numbers of male and female applicants and the
percentages admitted for the six largest graduate programs in (a):
Program
Men
Applicants % Admitted
A
B
C
D
E
F
825
560
325
417
191
373
Women
Applicants % Admitted
62
63
37
33
28
6
108
25
593
375
393
341
82
68
34
35
24
7
Test the independence of admission status and sex for each program. Do any of
the programs show evidence of a bias against female applicants?
(c) Why is it that the totals in (a) seem to indicate a bias against women, but the
results for individual programs in (b) do not?
4. To assess the (presumed) benecial eects of rust-proong cars, a manufacturer randomly selected 200 cars that were sold 5 years earlier and were still used by the original
buyers. One hundred cars were selected from purchases where the rust-proong option package was included, and one hundred from purchases where it was not (and
where the buyer did not subsequently get the car rust-proofed by a third party).
The amount of rust on the vehicles was measured on a scale in which the responses
Y are assumed roughly Gaussian, as follows:
1. Rust-proofed cars: Y
G(
1;
2. Non-rust-proofed cars: Y
G( 2 ; )
Sample means and variances from the two sets of cars were found to be (higher
y means more rust)
1. y1 = 11:7
s1 = 2:1
2. y2 = 12:0
s2 = 2:4
(a) Test the hypothesis that there is no dierence in
and
2.
(b) The manufacturer was surprised to nd that the data did not show a benecial
eect of rust-proong. Describe problems with their study and outline how you
might carry out a study designed to demonstrate a causal eect of rust-proong.
5. In randomized clinical trials that compare two (or more) medical treatments it is
customary not to let either the subject or their physician know which treatment they
have been randomly assigned. (These are referred to as double blind studies.)
254
8. CAUSAL RELATIONSHIPS
Discuss why not doing this might not be a good idea in a causative study (i.e. a
study where you want to assess the causative eect of one or more treatments).
9. REFERENCES AND
SUPPLEMENTARY
RESOURCES
9.1
References
R.J. Mackay and R.W. Oldford (2001). Statistics 231: Empirical Problem Solving (Stat
231 Course Notes)
C.J. Wild and G.A.F. Seber (1999). Chance Encounters: A First Course in Data Analysis
and Inference. John Wiley and Sons, New York.
J. Utts (2003). What Educated Citizens Should Know About Statistics and Probability.
American Statistician 57,74-79
9.2
255
256
p.f./p.d.f.
Discrete
Mean
ny p y q ny
0 p 1, q 1 p
y 0, 1, 2, . . . , n
Bernoullip
p y 1 p 1y
0 p 1, q 1 p
y 0, 1
yk1
y
np
pkqy
y 0, 1, 2, . . .
Geometricp
pq y
0 p 1, q 1 p
y 0, 1, 2, . . .
r N, n N
Poisson
0
kq
p2
pe t q n
pe t q
p
1qe t
t ln q
q
p2
p
1qe t
t ln q
Nr
ny
N
n
nr
N
n Nr 1 Nr Nn
N1
intractible
e e 1
y 0, 1, 2, . . . , min(r, n
e y
y!
y 0, 1, . . .
Multinomialn, 1 , . . . k
n!
y 1 !y 2 !...y k !
i 1
i1
11 22 . . . k k
i 0,
kq
p
q
p
r
y
npq
p1 p
0 p 1, q 1 p
HypergeometricN, r, n
m.g.f.
p.f.
Binomialn, p
Negative Binomialk, p
Variance
n 1 , . . . , n k
y i 0, 1, . . . ;
yi n
VarY i n i 1 i
i1
Continuous
p.d.f.
Uniforma, b
fy
1
ba
,ayb
ab
2
ba 2
12
e bt e at
bat
t0
Exponential
fy
e y/ , y 0
1
2
N, 2 or G(,
fy
, 0
y
fy
Chi-squared(k
k0
1
1t
t 1/
2
2
e y /2
1
y k/21 e y/2 ,
2 k/2 k/2
a1 x
where a x
2k
y0
2 2
e t t /2
1 2t k/2
t 1/2
e dx
Student t
k 0
y2
fy c k 1 k k1/2
y where
c k k1
/ k 2k
2
0
if k 1
k
k2
if k 2
undefined
Formulae
n
1
n
yi
s2
i1
n
1
n1
y i y 2
S yy y i y y 2i ny 2
i1
S xx x i x 2
i1
i1
i1
S xy x i x y i y x i x y i
i1
1
S yy S xy
y i x i 2 n2
xi
i1
1
s 2e n2
1
n
i1
s 2p
i1
n 1 1s 21 n 2 1s 22
n 1 n 2 2
Pivotals/Test Statistics
Random variable
Y
/ n
Chi-squared
df n 1
Student t
df n 1
Gaussian
Student t
df n 2
Y x
Gaussian
x x
n1S 2
2
Y
S/ n
S xy
S xx
1
S xx
S e / S xx
xx
Se
1
n
xx 2
S xx
Y x
Y x
S e 1 1n
xx 2
S xx
n2S 2e
2
Y 1 Y 2 1 2
Sp
1
n1
n12
n 1 n 2 2S 2p
2
x i x Y i
1/2
1
S xx
1
n
2
Sx
Gaussian
x x
1
n
S
xx
Student t
df n 2
Gaussian
Student t
df n 2
Chi-squared
df n 2
Student t
df n 1 n 2 2
Chi-squared
df n 1 n 2 2
i1
1/2
xx
xx 2
1 1n
1/2
xx 2
S xx
1/2
Approximate Pivotals
1 /n
Y
Y /n
0.25
0.20
0.15
F(x)
0.10
0.05
0.00
-4
-3
-2
-1
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.50000
0.53983
0.57926
0.61791
0.65542
0.69146
0.72575
0.75804
0.78814
0.81594
0.84134
0.86433
0.88493
0.90320
0.91924
0.93319
0.94520
0.95543
0.96407
0.97128
0.97725
0.98214
0.98610
0.98928
0.99180
0.99379
0.99534
0.99653
0.99744
0.99813
0.99865
0.99903
0.99931
0.99952
0.99966
0.99977
0.50399
0.54380
0.58317
0.62172
0.65910
0.69497
0.72907
0.76115
0.79103
0.81859
0.84375
0.86650
0.88686
0.90490
0.92073
0.93448
0.94630
0.95637
0.96485
0.97193
0.97778
0.98257
0.98645
0.98956
0.99202
0.99396
0.99547
0.99664
0.99752
0.99819
0.99869
0.99906
0.99934
0.99953
0.99968
0.99978
0.50798
0.54776
0.58706
0.62552
0.66276
0.69847
0.73237
0.76424
0.79389
0.82121
0.84614
0.86864
0.88877
0.90658
0.92220
0.93574
0.94738
0.95728
0.96562
0.97257
0.97831
0.98300
0.98679
0.98983
0.99224
0.99413
0.99560
0.99674
0.99760
0.99825
0.99874
0.99910
0.99936
0.99955
0.99969
0.99978
0.51197
0.55172
0.59095
0.62930
0.66640
0.70194
0.73565
0.76730
0.79673
0.82381
0.84849
0.87076
0.89065
0.90824
0.92364
0.93699
0.94845
0.95818
0.96638
0.97320
0.97882
0.98341
0.98713
0.99010
0.99245
0.99430
0.99573
0.99683
0.99767
0.99831
0.99878
0.99913
0.99938
0.99957
0.99970
0.99979
0.51595
0.55567
0.59484
0.63307
0.67003
0.70540
0.73891
0.77035
0.79955
0.82639
0.85083
0.87286
0.89251
0.90988
0.92507
0.93822
0.94950
0.95907
0.96712
0.97381
0.97932
0.98382
0.98745
0.99036
0.99266
0.99446
0.99585
0.99693
0.99774
0.99836
0.99882
0.99916
0.99940
0.99958
0.99971
0.99980
0.51994
0.55962
0.59871
0.63683
0.67364
0.70884
0.74215
0.77337
0.80234
0.82894
0.85314
0.87493
0.89435
0.91149
0.92647
0.93943
0.95053
0.95994
0.96784
0.97441
0.97982
0.98422
0.98778
0.99061
0.99286
0.99461
0.99598
0.99702
0.99781
0.99841
0.99886
0.99918
0.99942
0.99960
0.99972
0.99981
0.52392
0.56356
0.60257
0.64058
0.67724
0.71226
0.74537
0.77637
0.80511
0.83147
0.85543
0.87698
0.89617
0.91309
0.92785
0.94062
0.95154
0.96080
0.96856
0.97500
0.98030
0.98461
0.98809
0.99086
0.99305
0.99477
0.99609
0.99711
0.99788
0.99846
0.99889
0.99921
0.99944
0.99961
0.99973
0.99981
0.52790
0.56750
0.60642
0.64431
0.68082
0.71566
0.74857
0.77935
0.80785
0.83398
0.85769
0.87900
0.89796
0.91466
0.92922
0.94179
0.95254
0.96164
0.96926
0.97558
0.98077
0.98500
0.98840
0.99111
0.99324
0.99492
0.99621
0.99720
0.99795
0.99851
0.99893
0.99924
0.99946
0.99962
0.99974
0.99982
0.53188
0.57142
0.61026
0.64803
0.68439
0.71904
0.75175
0.78230
0.81057
0.83646
0.85993
0.88100
0.89973
0.91621
0.93056
0.94295
0.95352
0.96246
0.96995
0.97615
0.98124
0.98537
0.98870
0.99134
0.99343
0.99506
0.99632
0.99728
0.99801
0.99856
0.99896
0.99926
0.99948
0.99964
0.99975
0.99983
0.53586
0.57534
0.61409
0.65173
0.68793
0.72240
0.75490
0.78524
0.81327
0.83891
0.86214
0.88298
0.90147
0.91774
0.93189
0.94408
0.95449
0.96327
0.97062
0.97670
0.98169
0.98574
0.98899
0.99158
0.99361
0.99520
0.99643
0.99736
0.99807
0.99861
0.99900
0.99929
0.99950
0.99965
0.99976
0.99983
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0000
0.2533
0.5244
0.8416
1.2816
0.0251
0.2793
0.5534
0.8779
1.3408
0.0502
0.3055
0.5828
0.9154
1.4051
0.0753
0.3319
0.6128
0.9542
1.4758
0.1004
0.3585
0.6433
0.9945
1.5548
0.1257
0.3853
0.6745
1.0364
1.6449
0.1510
0.4125
0.7063
1.0803
1.7507
0.1764
0.4399
0.7388
1.1264
1.8808
0.2019
0.4677
0.7722
1.1750
2.0537
0.2275
0.4959
0.8064
1.2265
2.3263
0.005
0.000
0.010
0.072
0.207
0.412
0.676
0.989
1.344
1.735
2.156
2.603
3.074
3.565
4.075
4.601
5.142
5.697
6.265
6.844
7.434
10.520
13.787
17.192
20.707
24.311
27.991
35.534
43.275
51.172
59.196
67.328
0.01 0.025
0.004
0.103
0.352
0.711
1.146
1.635
2.167
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.672
9.391
10.117
10.851
14.611
18.493
22.465
26.509
30.612
34.764
43.188
51.739
60.391
69.126
77.929
0.05
0.016
0.211
0.584
1.064
1.610
2.204
2.833
3.490
4.168
4.865
5.578
6.304
7.042
7.790
8.547
9.312
10.085
10.865
11.651
12.443
16.473
20.599
24.797
29.051
33.350
37.689
46.459
55.329
64.278
73.291
82.358
0.1
0.064
0.446
1.005
1.649
2.343
3.070
3.822
4.594
5.380
6.179
6.989
7.807
8.634
9.467
10.307
11.152
12.002
12.857
13.716
14.578
18.940
23.364
27.836
32.345
36.884
41.449
50.641
59.898
69.207
78.558
87.945
0.2
0.148
0.713
1.424
2.195
3.000
3.828
4.671
5.527
6.393
7.267
8.148
9.034
9.926
10.821
11.721
12.624
13.531
14.440
15.352
16.266
20.867
25.508
30.178
34.872
39.585
44.313
53.809
63.346
72.915
82.511
92.129
0.3
0.275
1.022
1.869
2.753
3.656
4.570
5.493
6.423
7.357
8.296
9.237
10.182
11.129
12.078
13.030
13.983
14.937
15.893
16.850
17.809
22.616
27.442
32.282
37.134
41.995
46.864
56.620
66.396
76.188
85.993
95.808
0.4
0.000
0.020
0.115
0.297
0.554
0.872
1.239
1.647
2.088
2.558
3.054
3.571
4.107
4.660
5.229
5.812
6.408
7.015
7.633
8.260
11.524
14.953
18.509
22.164
25.901
29.707
37.485
45.442
53.540
61.754
70.065
0.001
0.051
0.216
0.484
0.831
1.237
1.690
2.180
2.700
3.247
3.816
4.404
5.009
5.629
6.262
6.908
7.564
8.231
8.907
9.591
13.120
16.791
20.569
24.433
28.366
32.357
40.482
48.758
57.153
65.647
74.222
df\p
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
35
40
45
50
60
70
80
90
100
0.5
0.6
0.7
0.8
0.9
0.95
0.975
0.99
0.455
0.708
1.074
1.642
2.706
3.842
5.024
6.635
1.386
1.833
2.408
3.219
4.605
5.992
7.378
9.210
2.366
2.946
3.665
4.642
6.251
7.815
9.348 11.345
3.357
4.045
4.878
5.989
7.779
9.488 11.143 13.277
4.352
5.132
6.064
7.289
9.236 11.070 12.833 15.086
5.348
6.211
7.231
8.558 10.645 12.592 14.449 16.812
6.346
7.283
8.383
9.803 12.017 14.067 16.013 18.475
7.344
8.351
9.525 11.030 13.362 15.507 17.535 20.090
8.343
9.414 10.656 12.242 14.684 16.919 19.023 21.666
9.342 10.473 11.781 13.442 15.987 18.307 20.483 23.209
10.341 11.530 12.899 14.631 17.275 19.675 21.920 24.725
11.340 12.584 14.011 15.812 18.549 21.026 23.337 26.217
12.340 13.636 15.119 16.985 19.812 22.362 24.736 27.688
13.339 14.685 16.222 18.151 21.064 23.685 26.119 29.141
14.339 15.733 17.322 19.311 22.307 24.996 27.488 30.578
15.338 16.780 18.418 20.465 23.542 26.296 28.845 32.000
16.338 17.824 19.511 21.615 24.769 27.587 30.191 33.409
17.338 18.868 20.601 22.760 25.989 28.869 31.526 34.805
18.338 19.910 21.689 23.900 27.204 30.144 32.852 36.191
19.337 20.951 22.775 25.038 28.412 31.410 34.170 37.566
24.337 26.143 28.172 30.675 34.382 37.652 40.646 44.314
29.336 31.316 33.530 36.250 40.256 43.773 46.979 50.892
34.336 36.475 38.859 41.778 46.059 49.802 53.203 57.342
39.335 41.622 44.165 47.269 51.805 55.758 59.342 63.691
44.335 46.761 49.452 52.729 57.505 61.656 65.410 69.957
49.335 51.892 54.723 58.164 63.167 67.505 71.420 76.154
59.335 62.135 65.227 68.972 74.397 79.082 83.298 88.379
69.334 72.358 75.689 79.715 85.527 90.531 95.023 100.430
79.334 82.566 86.120 90.405 96.578 101.880 106.630 112.330
89.334 92.761 96.524 101.050 107.570 113.150 118.140 124.120
99.334 102.950 106.910 111.670 118.500 124.340 129.560 135.810
0.995
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
26.757
28.300
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997
46.928
53.672
60.275
66.766
73.166
79.490
91.952
104.210
116.320
128.300
140.170
Student t Quantiles
df \ p
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
50
60
70
80
90
100
>100
0.6
0.3249
0.2887
0.2767
0.2707
0.2672
0.2648
0.2632
0.2619
0.2610
0.2602
0.2596
0.2590
0.2586
0.2582
0.2579
0.2576
0.2573
0.2571
0.2569
0.2567
0.2566
0.2564
0.2563
0.2562
0.2561
0.2560
0.2559
0.2558
0.2557
0.2556
0.2550
0.2547
0.2545
0.2543
0.2542
0.2541
0.2540
0.2535
0.7
0.7265
0.6172
0.5844
0.5686
0.5594
0.5534
0.5491
0.5459
0.5435
0.5415
0.5399
0.5386
0.5375
0.5366
0.5357
0.5350
0.5344
0.5338
0.5333
0.5329
0.5325
0.5321
0.5317
0.5314
0.5312
0.5309
0.5306
0.5304
0.5302
0.5300
0.5286
0.5278
0.5272
0.5268
0.5265
0.5263
0.5261
0.5247
0.8
1.3764
1.0607
0.9785
0.9410
0.9195
0.9057
0.8960
0.8889
0.8834
0.8791
0.8755
0.8726
0.8702
0.8681
0.8662
0.8647
0.8633
0.8620
0.8610
0.8600
0.8591
0.8583
0.8575
0.8569
0.8562
0.8557
0.8551
0.8546
0.8542
0.8538
0.8507
0.8489
0.8477
0.8468
0.8461
0.8456
0.8452
0.8423
0.9
3.0777
1.8856
1.6377
1.5332
1.4759
1.4398
1.4149
1.3968
1.3830
1.3722
1.3634
1.3562
1.3502
1.3450
1.3406
1.3368
1.3334
1.3304
1.3277
1.3253
1.3232
1.3212
1.3195
1.3178
1.3163
1.3150
1.3137
1.3125
1.3114
1.3104
1.3031
1.2987
1.2958
1.2938
1.2922
1.2910
1.2901
1.2832
0.95
6.3138
2.9200
2.3534
2.1318
2.0150
1.9432
1.8946
1.8595
1.8331
1.8125
1.7959
1.7823
1.7709
1.7613
1.7531
1.7459
1.7396
1.7341
1.7291
1.7247
1.7207
1.7171
1.7139
1.7109
1.7081
1.7056
1.7033
1.7011
1.6991
1.6973
1.6839
1.6759
1.6706
1.6669
1.6641
1.6620
1.6602
1.6479
0.975
12.7062
4.3027
3.1824
2.7764
2.5706
2.4469
2.3646
2.3060
2.2622
2.2281
2.2010
2.1788
2.1604
2.1448
2.1314
2.1199
2.1098
2.1009
2.0930
2.0860
2.0796
2.0739
2.0687
2.0639
2.0595
2.0555
2.0518
2.0484
2.0452
2.0423
2.0211
2.0086
2.0003
1.9944
1.9901
1.9867
1.9840
1.9647
0.99
31.8205
6.9646
4.5407
3.7469
3.3649
3.1427
2.9980
2.8965
2.8214
2.7638
2.7181
2.6810
2.6503
2.6245
2.6025
2.5835
2.5669
2.5524
2.5395
2.5280
2.5176
2.5083
2.4999
2.4922
2.4851
2.4786
2.4727
2.4671
2.4620
2.4573
2.4233
2.4033
2.3901
2.3808
2.3739
2.3685
2.3642
2.3338
0.995
0.999
0.9995
63.6567 318.3088 636.6192
9.9248 22.3271 31.5991
5.8409 10.2145 12.9240
4.6041
7.1732
8.6103
4.0321
5.8934
6.8688
3.7074
5.2076
5.9588
3.4995
4.7853
5.4079
3.3554
4.5008
5.0413
3.2498
4.2968
4.7809
3.1693
4.1437
4.5869
3.1058
4.0247
4.4370
3.0545
3.9296
4.3178
3.0123
3.8520
4.2208
2.9768
3.7874
4.1405
2.9467
3.7328
4.0728
2.9208
3.6862
4.0150
2.8982
3.6458
3.9651
2.8784
3.6105
3.9216
2.8609
3.5794
3.8834
2.8453
3.5518
3.8495
2.8314
3.5272
3.8193
2.8188
3.5050
3.7921
2.8073
3.4850
3.7676
2.7969
3.4668
3.7454
2.7874
3.4502
3.7251
2.7787
3.4350
3.7066
2.7707
3.4210
3.6896
2.7633
3.4082
3.6739
2.7564
3.3962
3.6594
2.7500
3.3852
3.6460
2.7045
3.3069
3.5510
2.6778
3.2614
3.4960
2.6603
3.2317
3.4602
2.6479
3.2108
3.4350
2.6387
3.1953
3.4163
2.6316
3.1833
3.4019
2.6259
3.1737
3.3905
2.5857
3.1066
3.3101
262
APPENDIX A: ANSWERS TO
ASSORTED PROBLEMS
Chapter 1
1.1. (a) average and median a + by; a + bm
^ respectively. (b) No relation in general but if
P
P
2
all yi 0; then median(vi ) = m
^ : (c) As a rule (yi m)
^ 6= 0 but (yi y) = 0:
0
(d) a(y0 ) = ny+y
^ 0 ) = y( n+1 ) if
n+1 ! 1 as y0 ! 1: (e). Suppose n is odd. Then m(y
2
y0 > y( n+1 ) so it does not change as y0 ! 1:
2
1.2. (a) both are multiplied by jbj (c) As y0 increases to innity, so does the sample standard deviation s (d) Once y0 is larger than Q(:75); it has no eect on the interquartile
range as it increases.
1.3. The sample skewness and kurtosis remain unchanged.
1.4 For the revenues, sample mean = ( 7) (2500) + 1000 =
( 7)2 (5500)2 = (38500)2 , range = (7) (7500) = 52500
1.5 (b) y = 2:014, median = 2:3 (c) s = 4:3047, IQR = 5:1
(d) P pk = 0:6184
(f)
(0:6) = 5:7
= 0:03408
1.6. The empirical c.d.f. is constructed by rst ordering the data (smallest to largest) to
obtain the order statistic: 0.01 0.39 0.43 0.45 0.52 0.63 0.72 0.76.85 0.88. Then the
empirical c.d.f. is in gure 12.1
1.9 (b) Five number summary for female coyotes: 71:0 85:5 89:75 93:5 102:5
Five number summary for male coyotes: 78:0 87:0 92:0 96:0 105:0
(c) Female coyotes: x = 89:24; s21 = 42:87887
Male coyotes: y = 92:06; s22 = 44:83586
263
264
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Data
41
10
2^
= 0:000275
n
Q
( + 1)yi = ( + 1)n
i=1
n
Q
i=1
yi
d
l( ) =
d
n
1+
n
P
log(yi ) = 0 solving,
i=1
^ =
n
P
log(yi )
i=1
2.3
(a)
(b)
In both cases if we maximize over
2.4 (2x1 + x2 )=(2n)
99
9
100
10
10
(1
10
(1
)90
)90
265
2.5 (a)
1
2
P (M M ) = P (F F ) =
P (M F ) = 1
1+
4
1
2
1
2
(1
)=
1+
4
where
L( ) =
n!
n1 !n2 !n3 !
(b)
n1
1+
4
n2
1+
4
n3
where n = n1 + n2 + n3
or more simply
)n3 :
L ( ) = (1 + )n1 +n2 (1
Maximizing L ( ) gives ^ = (n1 + n2
n3 = 18, ^ = 0:28.
2.6 (a) If there is adequate mixing of the tagged animals, the number of tagged animals
caught in the second round is a random sample selected without replacement so
follows a hypergeometric distribution (see the Stat 230 Notes).
(b)
L(N + 1)
(N + 1 k)(N + 1 n)
=
L(N )
(N + 1 k n + y)(N + 1)
and L(N ) reaches its maximum within an integer of kn=y:
(c) The model requires su cient mixing between captures that the second stage is
a random sample. If they are herd animals this model will not t well.
2.7 The joint p.d.f. of the observations y1 ; y2 ; :::; yn is given by
n
Q
n 2y
Q
i
f (yi ; ) =
i=1
yi2 =
i=1
= 2n (
n
Q
yi )
i=1
1
n
exp
i=1
n log( )
yi2
n
1P
i=1
Solving
l0 ( ) =
n
1 P
2
n
1P
i=1
n
1P
exp
i=1
yi2
yi2 :
>0
> 0:
yi2 = 0
266
2.8 (a) The probability that a randomly selected family has k children is k = k ;
k = 1; 2; ::: and 0 = 11 2 . The joint distribution of (Y0 ; Y1 ; :::) the number of
families with k children respectively is (f0 ; f1 ; : : :) is Multinomial(n; 0 ; 1 ; :::).
Therefore
P (Y0 = f0 ; Y1 = f1 ; :::) =
n!
f0 !f1 !:::
f0 f1
0 1 :::
n!
f0 !f1 !:::
1 2
1
l( ) = f0 log
= f0 log (1
max
P
f0 ln(1
) + log( )T where T =
Solving l0 ( ) = 0 we obtain
l0 ( ) =
f0
2f0
+
1 2
1
kfk
k=1
kfk log( )
k=1
2 )
f0 max
Q
[(f0 + 3T )2
4T
1
(f0 + 3T )
+ T = 0 or ^ =
kfk :
8T 2 ]1=2
(b) We assume that the probability that a randomly selected family has k children is
k
; k = 1; 2; :::. Suppose for simplicity there are N dierent families where N is
very large. Then the number of families that have y children is N (probability
a family has y children) = N y for y = 1; 2; :::: and there is a total of yN y
1
P
children in families of y children and a total of
yN y children altogether.
y=1
= cy y ;
y = 1; 2; :::
y=1
Note that
1
P
y=0
y 1
1
P
y=1
Therefore c = (1
)2 = :
1
(1
)2
and
1
P
y=1
(1
)2
267
Heights of Elderly Women
0.07
0.06
0.05
Density
0.04
0.03
0.02
0.01
0
140
145
150
155
160
165
Height
170
175
180
185
n
P
i=1
n
P
yi
ti
i=1
2.10 (a) The frequency histogram in Figure 12.2. The data appear to be approximately
Normally distributed.
(b) The sample mean y = 159:77 and the sample standard deviation s = 6:03 for the
data. The number of observations in the interval (y s; y + s) = (153:75; 165:80)
was 244 or 69.5% and actual number of observations in the interval (y 2s; y +
2s) = (147:72; 171:83) was 334 or 95.2%, very close to what one would expect if
the data were Normally distributed with these parameters.
(c) The interquartile range for the data on elderly women is IQR = q(0:75)
q(0:25) = 164 156 = 8 . It is easy to see from the Normal tables that if Y is a
N ( ; 2 ) random variable then P (
0:675 < Y < + 0:675 ) = 0:5: It follows
that for the Normal distribution the interquartile range is IQR = 2(0:675 ) =
1:35 : Notice that for this data IQR = 1:33s so this relationship is almost exact.
(d) The ve-number summary for the data is given by y(1) ; q(0:25); q(0:5); q(0:75);
y(n) = 142; 156; 160; 164; 178
(e) The boxplot in Figure 12.3 resembles that for the Normal with approximately
equal quantiles and symmetry.
(f) The qqplot in Figure 12.4 is approximately linear indicating that the data is
approximately Normally distributed. The steplikebehaviour of the plot is due
to the rounding of the data to the nearest cm.
2.11 (a) ^ = 1:744, ^ = 0:0664 (M) ^ = 1:618, ^ = 0:0636 (F)
268
175
170
165
Height
160
155
150
145
175
170
165
S ample Quantiles
160
155
150
145
140
-3
-2
-1
0
S tandard N ormal Quantiles
269
(b) 1:659 and 1:829 (M) 1:536 and 1:670 (F)
(c) 0:098 (M) and 0:0004 (F)
(d) 11=50 = 0:073 (M) 0 (F)
2.12 The qqplots are given in Figures 12.5 and 12.6. Note that the qqplot for Y = log(X) is
far more linear indicating that Y = log(X) is much closer to the Normal distribution.
50
100
150
-2
-1
Theoretical Quantiles
3
2
0
-2
-1
Th eo r etic al Q u an tile s
270
100!
(
20!15!22!43!
)20 [ (1
)]15 [(1
) ]22 [(1
)]43
) (1
) + 42 log( ) + 58 log(1
= 0 or ^ =
35
100
^ ; 100 (1
^ ) ^ ; 100 (1
^) 1
which can be compared with 20; 15; 22; 43. The dierences are of the order of 5 or
so. This is not too far (measured for example in terms of Binomial(100; 0:2) standard
deviations from the theoretical frequencies so the model may t.
2.14 (a)
P (Y > C; ) =
y=
dy = e
C=
(b) For the ith piece that failed at time yi < C; the contribution to the likelihood is
1
e yi = : For those pieces that survive past time C; the contribution is the probability
of this event, P (Y > C; ) = e C= : Therefore the likelihood is the product of these
L( ) =
k 1
Q
yi =
C=
n k
i=1
l( ) =
k log( )
k
1P
yi
(n
k)
i=1
k)C :
271
2.15 The likelihood function is
L( ; ) =
n
Y
[ (xi )]yi
e
yi !
(xi )
i=1
and, ignoring the terms yi ! which do not contain the parameters, the log likelihood is
l( ; ) =
n h
P
yi ( + xi )
e(
+ xi )
i=1
xi )
=0
+ xi )
=0
For a given set of data we can solve this system of equations numerically but not
explicitly.
2.17 (a) The sample median of the distribution is approximately 0:5.
(b) The IQR for these data is approximately 0:4.
(c) The frequency histogram of the data would be approximately symmetric about
the sample mean.
(d) The frequency histogram would most resemble a Uniform probability density
function.
Chapter 3
3.1 (a) The Problem is to determine the proportion of eligible voters who plan to vote
and, of those, the proportion who plan to support the party. This is a decriptive
Problem.
(b) The target population is all eligible voters. This would include those eligible
voters in all regions and those with/without telephone numbers on the list.
(c) A variate is whether or not an eligible voter plans to vote or whether or not an
eligible voter supports the party.
(d) The study population is all eligible voters on the list.
(e) The sample is the 1104 eligible voters who responded to the questions.
(f) A possible source of study error is that the polling rmed only called eligible
voters in urban areas. Urban eligible voters may have dierent views that rural
eligible voters this is a dierence between the target and study populations.
Eligible voters with phones may have dierent views than those without.
272
Chapter 4
4.3 (a) The method which resulted in the interval [42:8; 47:8] would contain the true value
of the parameter in 95% of random samples drawn from this population.
(b) An approximate 95% condence interval for the proportion of Canadians whose
mobile phone is a smartphone is
"
#
r
r
p^(1 p^)
p^(1 p^)
p^ 1:96
; p^ + 1:96
n
n
#
"
r
r
(0:45) (0:55)
(0:45) (0:55)
; 0:45 + 1:96
= 0:45 1:96
1000
1000
= [0:41917; 0:48083]:
4.4 (b)
Pe
P ( 0:03
t P
Since P ( 1:96
0:03) = P (0:37n Y
0:43n)
0:03n
0:03n
p
p
Z
= 0:95 where Z
0:24n
0:24n
p0:03n
0:24n
N (0; 1)
4.5 The distribution is Binomial(n; p) and the approximate 99% condence interval based
on a Normal approximation is given by:
s
r
64
( 28936 )
p^(1 p^)
64
p^ 2:58
or
2:58 29000 29000
n
29000
29000
4.6 (a) The probability a group tests negative is p = (1
out of n groups test negative is
n x
p (1
x
(b) ^ = 1
p)n
x = 0; 1; :::; n
273
4.7 (a) For the data n1 = 16, n2 = 16 and n3 = 18, ^ = 0:28 and
R( ) =
(1 + )32 (1
(1 + 0:28)32 (1
)18
; 0<
0:28)18
<1
)33
; 0<
0:34)33
17 (1
(0:34)17 (1
<1
We use
uniroot(function(x)(x^17*(1-x)^33/c-0.1),lower=0,upper=0.3), where c =
(0:34)17 (1 0:34)33 to obtain the 10% likelihood interval [0:209; 0:490]. Since
this interval is much narrower than the interval in (a) this indicates is more
accurately determined by the second model.
0.8
0.6
R()
0.4
0.2
0.1
0.2
0.3
0.4
0.5
0.6
n 1
Q
i=1 2
3 2
ti exp (
ti ) =
n
1 Q
t2
2n i=1 i
3n
exp
n
P
i=1
ti
0.7
274
3n
n
P
exp
ti ;
> 0:
i=1
l( ) = 3n log
ti
i=1
dl
3n
=
d
n
P
ti :
i=1
P
ti .
Solving l( ) = 0, we obtain the maximum likelihood estimate ^ = 3n=
n
i=1
3n
exp 3n 1
> 0:
0.8
R()
0.6
0.4
0.2
0
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
1 R1 3 3 t
1 R1
t e dt =
( t)3 e ( t) dt
20
20
1 R1 3 x
x e dx (by letting x = t)
2 0
1
1
3
(4) = 3! =
2
2
is
3
3
0:0463 ; 0:0768
= [39:1; 64:8].
275
(e)
p( ) = P (T
3
2500
= 1
3R
50
50) =
dt
100
50
+ 50 + 1)e
50
(1250
t2 e
2
50
50
Condence intervals are [0:408; 0:738] (using model) and [0:332; 0:768] (using
Binomial). The Binomial model involves fewer assumptions but gives a less
precise (wider) interval.
(Note: the rst condence interval can be obtained directly from the approximate
condence interval for in part (c).)
4.10 (a)
R1
0
1
2
k
2
k
2
y2
y
2
R1 y
2
0
dy =
k
2
R1 k
x2
k
2
y
2
dy
let x =
y
2
dx
k
2
k
2
k
2
=1
(b)
M (t) = E[eY t ] =
R1
0
=
2
k
2
R1
( k2 ) 0
1
2
y2
k
2
k
2
k
2
( k2 )( 12 t)
k
k 1
= (2 2 (
t) 2 ) 1
2
k
= (1 2t) 2
2
( 12 t)y
dy
( k2 )
y2
R1 k
x2
y
2
eyt dy
dx by letting x = (
1
2
t)y
Therefore
k
(1
2
k
00
M (0) = E[Y 2 ] =
2
0
M (0) = E[Y ] =
2t)
= k 2 + 2k
V ar(Y ) = k 2 + 2k
k 2 = 2k
k
2
k
+1
2
( 2)jt=0 = k
(1
2t)
k
2
( 2
2)jt=0
0.15
276
0.00
0.05
f(y)
0.10
11111
11111 1111111 k=5
111 111111
1
1
111
11
111
11
111
1
1
111
1
1
1
111
1
111
11
1
111 11111111111111111111111111111111
1
1
11111111
11111
1
111111
1
11111111
1
1
111111
1
1
1 1111
1
1
111111
1
1
1
1
1
1
111111
1
1111
1
1
1
111111
1
1
1
1
1111
111111
1
11
1111
1
111111k=10
1111111
1
1
1
11111
1
1111111
11111111111
1
1
1
1
1
1111111
1111111111
11111
1
1
1
1
1
1
1
1
1
1
1
1111111111111111
111111
1
1
1111111
1
1111111111
111
111111111
1
111111111 1111111111111111111111
1
1
1111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1111111111111111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
k=25
1 1111111111
11111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111
11
11111
11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
0
10
15
20
n
P
yi , ^ =
i=1
1
n
n
P
( )
i=1
(b) P (X
m) = 1 e m = 0:5 => m =
log(0:5) = log 2 and condence
interval is [197:9; 361:3] by using the condence interval of obtained in (a).
4.12 (a) Using the cumulative distribution function of the Exponential distribution F (y) =
1 e y= ; we have
P (W
2Y
w) = P
=P
w
2
=1
w =(2 )
=1
w=2
d
Taking derivative dw
on both sides gives the probability density function as
which can be easily veried as the pdf of a 2 (2) random variable.
:
1
2e
w
2
(b) Another way of doing this is by using the moment generating function (try do it
using the hint on your own)
n
2t P
MU (t) = E eU t = E exp
i=1
n
Q
(1
t)
Yi
n
Q
2t
Yi
i=1
i=1
0
(by letting t = 2t ) = (1
random variable.
2t)
2 (2n)
277
(c)
P (43:19
W
= P
and by substituting
n
P
79:08) = P
n
2 P
Yi
79:08 i=1
43:19
n
2P
Yi
i=1
n
P
2
Yi
43:19 i=1
79:08
= 0:9
i=1
4.13 (a) If Y is the number who support this information then Y v Binomial(n; ). An
approximate 95% condence interval is given by
r
(0:7) (0:3)
0:7 1:96
or [0:637; 0:764] :
200
4.14 (a) Qqplots of the weights for females and males separately are shown in Figures
12.10 and 12.11. In both cases the points lie reasonably along a straight line so it is
reasonable to assume a Normal model for each data set.
QQ Plot of Sample Data vers us Standard Normal
120
110
100
90
80
70
60
50
40
30
-3
-2
-1
0
1
Standard Normal Quantiles
278
100
90
80
70
60
50
40
-3
-2
-1
0
1
Standard Norm al Quantiles
Note that since the value for t (149) is not available in the t-tables we used
P ( 1:9647 T 1:9647) = 0:95 where T v t (100). Using R we obtain
P ( 1:976 T 1:976) = 0:95 where T v t (149) : The intervals will not change substantially. We note that the intervals have no values in common. The mean weight
for males is higher than the mean weight for females.
(c) To obtain condence intervals for the standard deviations we note that the pivotal quantity (n 1) S 2 = 2 = 149S 2 = 2 has a 2 (149) distribution and the Chisquared tables stop at degrees of freedom = 100. Since E 149S 2 = 2 = 149 and
V ar 149S 2 = 2 = 2 (149) = 298 we use 149S 2 = 2 v N (149; 298) approximately to
279
construct an approximate 95% condence interval given by
"s
s
#
149s2
149s2
p
p
;
149 + 1:96 298
149 1:96 298
#
"r
r
149s2
149s2
:
=
;
182:8348
115:1652
For the females we obtain
"r
#
r
149 (156:4806)
149 (156:4806)
;
182:8348
115:1652
i
hp
p
127:5228; 202:4536 = [11:2926; 14:2286] :
=
#
r
149 (165:2162)
149 (165:2162)
;
182:8348
115:1652
i
hp
p
134:6418; 213:7558 = [11:6035; 14:6204] :
=
n
P
xi = 20 (b) ^ =
i=1
1
n
n
P
i=1
1.0
4.15 (a) ^ =
0.6
0.4
0.0
0.2
R(Theta)
0.8
10
15
20
25
30
theta
Figure 12.12: Relative Likelihood Functions for Company A and Company B Photocopiers
(c) Note that in the following, the likelihood interval is found by solving
( )=
2 log R( )
3:841 or R( )
280
For Company A (the curve on the right hand side of the graph):
[ 12^ + 240 log ^
( )=2
( 12 + 240 log )]
( )=2
( 12 + 140 log )]
12
1 P
2
1=2
y)2
162:12
12
1=2
= 3:68:
Y
p
has a t distribution with 11 degrees of freedom.
S= 12
Yi
i=1
12
1 P
(yi
12 i=1
is ^ = y = 734:4=12 = 61:2.
is
2:20
2:20
=P ~
p
2:20S= 12
p
~ + 2:20S= 12
s=
12
1 P
(yi
11 i=1
1=2
y)2
162:12
11
1=2
= 3:84:
281
(e) From Chi-squared tables P (W
2 (11). Since
0:90 = P
= P
12
1 P
4:57
Yi
19:68
i=1
12
1 P
Yi
19:68 i=1
19:68) where W v
4:57) = 0:05 = P (W
12
1 P
Yi
4:57 i=1
is
12
1 P
(yi
19:68 i=1
12
1 P
(yi
4:57 i=1
y)2 ;
y)2 :
is
162:12 162:12
;
= [8:24; 35:48] :
19:68
4:57
4.17 (a) [296:91; 303:47]; [4:55; 9:53]
(b) [286:7; 313:7]
4.18 Since
S 1+
nd a such that P (T
p=2 where T v t (n
!
Y Y
a
1=2
S 1 + n1
1
aS 1 +
n
1=2
s=
1
n
n
P
1 i=1
1
as 1 +
n
1=2
1=2
1=2
(yi
1). Since
1
Y + aS 1 +
n
where
1)
a) = 1
p=P
=P
v t (n
1 1=2
n
y)2
12
1 P
(yi
11 i=1
1=2
y)2
= 9:3974:
282
= [82:6148; 125:6519] :
2
1:96
4.19 Use 2 t 80
= (1:96)2 8 = 30:73. Since 10
10 = 8 and d = 1. Hence, n
d
observations have already been taken, the manufacturer should be advised to take at
least 21 more additional measurements. This calculation depends on an estimate of
and the value 1:96 is from the Normal tables so the manufacturer should be advised
to take more than 21 additional measurments. Since (2)2 9 = 36, an additional 26
measurements seems reasonable.
n y
Q
i
i=1
or more simply
2e
yi =
n
Q
2n
yi
exp
i=1
2n
L( ) =
exp
n
1P
yi ;
>0
i=1
n
1P
yi ;
> 0:
i=1
n
1P
2n log
yi ;
>0
i=1
and
l0 ( ) =
2n
n
1 P
Now l0 ( ) = 0 if
i=1
yi =
1
2
n
P
yi
2n
> 0:
i=1
n
1 P
1
yi = y:
2n i=1
2
(Note a First Derivative Test could be used to conrm that l ( ) has an absolute
maximum at = y=2.) The maximum likelihood estimate of is
^ = y=2:
(b)
E Y =E
and
V ar Y = V ar
n
1 P
Yi
n i=1
n
1 P
Yi
n i=1
n
n
1 P
1 P
1
E (Yi ) =
2 = (2n ) = 2
n i=1
n i=1
n
n
n
1 P
1 P
V
ar
(Y
)
=
2
i
n2 i=1
n2 i=1
1
2n
n2
2 2
:
n
283
(c) Since Y1 ; Y2 ; : : : ; Yn are independent and identically distributed random variables
then by the Central Limit Theorem
If Z v N (0; 1)
Y 2
p
has approximately a N (0; 1) distribution.
2=n
P ( 1:96
1:96) = 0:95:
Therefore
P
(d)
1:96
Y 2
p
2=n
1:96
t 0:95:
!
p
Y 2
p
1:96
0:95 t P
1:96 = P Y 1:96 2=n
2=n
p
p
= P Y =2 0:98 2=n
Y =2 + 0:98 2=n :
Y + 1:96
p
2=n
(e) For these data the maximum likelihood estimate of is ^ = y=2 = 88:92= (2
2:47 and the approximate 95% condence interval for is
"
r
r #
2
2
2:47 0:98 (2:47)
; 2:47 + 0:98 (2:47)
= [1:66; 3:28] :
18
18
18) =
4.21 (c)
V ar(~ ) = V ar
Therefore
V ar
Therefore
X + 4Y
5
1 1
1
V ar(X) + 16V ar(Y ) =
+ 16
25
25 10
0:25
10
= 0:02:
= 0:1414
X +Y
2
1
1
V ar(X) + V ar(Y ) =
4
4
1
0:25
+
10
10
= 0:03125:
= 0:1768
4.22 (a) The graph of a t-distribution with 1 degree of freedom (k = 1) and the graph of
a t-distribution with 5 degrees of freedom (k = 5) appear in Figure 12.13:
(b) X follows t-distribution with 15 degree of freedom (d) P ( a
Tail probability = 0:01 From t-distribution table,a = 2:602 (e) P (x
t-distribution table, b = 1:753
x
a) = 0:98.
b) = 0:95: From
284
0 .4
k= 5
0 .3 5
0 .3
0 .2 5
0 .2
0 .1 5
0 .1
0 .0 5
k= 1
0
-6
-4
-2
5.2 (a) Assume Y is Binomial(n; ) and H0 : = 51 : (b) n = 20: We can use D = jY 4j:
Then p value = P (Y = 0; = 15 ) + P (Y
8; = 15 ) = 0:032 so there is evidence
against the model or H0 : (c) n = 100; D = jY
20j; and the observed value is
1
d = j32 20j: Then the p value = P (Y
32; = 5 ) + P (Y
8; = 15 ) = 0:006 so
1
there is strong evidence against the model or against H0 : = 5
5.3 A test statistic that could be used will be to test the mean of the generated sample.
The mean should be closed to 0:5 if the random number generator is working well.
5.4 (a) The likelihood ratio statistic gives = 0:0885 and p
evidence against the hypothesis H0 : = 100:
5.5 (a) y = 44:405; s2 = 5:734; P ( 2:09 T 2:09) = 0:95 and the 95% condence
interval for is
h
p
p i
y 2:09s= 20; y + 2:09s= 20 = [43:28; 45:53]
(b) The condence interval for
is [1:82; 3:50].
285
5.6 To test the hypothesis H0 :
105
p
S= 12
D=
where
12
1 P
S=
Yi
11 i=1
T =
1=2
Y 105
p
v t (11)
S= 12
jy 105j
j104:13 105j
p
p
=
= 0:3194
s= 12
9:40= 12
and
p
value = P (D
d; H0 )
= P (jT j
= 2 [1
0:3194)
P (T
where T v t (11)
0:3194)]
= 2 (0:3777)
= 0:7554 (calculated using R).
Alternatively using the t-tables in the Course Notes we have P (T
P (T 0:54) = 0:7 so
2 (1
0:7)
value
2 (1
or 0:6
value
0:8:
0:6)
In either case since the p value is much larger than 0:1 and we would conclude that,
based on the observed data, there is no evidence against the hypothesis H0 : = 105.
(Note: This does not imply the hypothesis is true!)
5.7 (a) If H0 :
25
P
i=1
Yi
75 =
25
P
i=1
Yi
25
P
Yi
i=1
286
25
P
Yi
= 75.
i=1
25
P
i=1
d=
25
P
yi
75 = j51
i=1
75j = 24
and
p
value = P (D d; H0 )
25
P
= P
Yi 75
=
i=1
51 75x e 75
P
x!
98 75x e
P
= 1
x!
x=52
= 0:006716:
x=0
24; H0
1 75x e
P
x!
x=99
75
75
Since 0:001 < 0:006716 < 0:01 we would conclude that, based on the data, there is
strong evidence against the hypothesis H0 : = 3:
(b) If Yi has a Poisson distribution with mean
pendently then by the Central Limit Theorem
Y E Y
q
V ar Y
Y
=p
=n
3j =
51
25
3 = j2:04
3j = 0:96
and also
jy 3j
0:96
p
=p
= 2:77:
3=25
3=25
287
Therefore
p
value = P (D
d; H0 )
= P
t P
jZj
= 2 [1
0:96; H0
!
0:96
p
where Z s N (0; 1)
3=25
P (Z
2:77)]
= 0:005584
The approximate p-value of 0:005584 is close to the p-value 0:006716 calculated in (a)
which is the exact p-value. Since we are only interested in whether the p value is
bigger than 0:1 or between 0:1 and 0:05 etc. we are not as worried about how good
the approximation is. In this example the conclusion about H0 is the same for the
approximate p value as it is for the exact p-value.
(d) The observed value of the likelihood ratio test statistic for testing H0 :
(3) = 2 (25) 2:04 log
2:04
3
value = P ( (3)
8:6624; H0 )
+3
2:04 = 8:6624
and
p
P (Z
2:94)]
= 0:00328
The p
5.8 (a) Since this question is tedious to do manually, here is the R code for it:
> data<-c(70,75,63,59,81,92,75,100,63,58)
> L<-dmultinom(data,prob=data)
> L1<-dmultinom(data,prob=rep(1,10)) #This is L ^
> lambda<-2*(log(L)-log(L1))
> pvalue<-1-pchisq(lambda,9)
= 23:605; p value = 0:005
(b) p value = 1 (0:995)6 = 0:03
5.9 ^ = (0:18; 0:5; 0:32) by maximizing the likelihood
L( ) =
18 50 32
1 2 3
subject to
=1
= 3 is
288
2 18
)]50 (1
[2 (1
)2
32
= 2 log
L(^)
L( o (^ ))
= 0:04
and p value = P (U > 0:04) t 0:84 where U v 2 (1). Note that there are essentially
two parameters in the full model and one in the model under H0 so the dierence is
1. There is no evidence against the model H0 :
18
41
29
31
146
27
; 5580
; 16050
; 8435
; 14200
) by usual maximum likelihood method. ^0 = 54390
5.10 ^ = ( 10125
by solving maximum likelihood with constraint.
The observed value of the likelihood ratio statistic is = 3:73 and p value = P (W
3:73) = 0:44 where W ~ 2 (4). There is no evidence that the rates are not equal.
5.11 (a) ~ = Y ; ~ 2 =
1
n
n
P
(Yi
i=1
~ 20
=
~2
so that
(
1
n
1
n
Y ); ^ 0 =
n
P
(Yi
i=1
n
P
2
0; ~0 =
0)
=
(Yi
Y)
n
P
1
n
(Yi
n
P
0)
(Yi
i=1
Y ) + n(Y
i=1
i=1
n
P
(Yi
and
= n log ~ 20 =~ 2 .
2
0)
Y)
i=1
6
n(Y
6
0 ) = n log 41 + P
n
(Yi
i=1
27
0) 7
Y)
5 = n log 1 +
T2
n 1
Chapter 6
6.1.
(a) ^ =
Sxy
Sxx
= 1:138; ^ = y
^ x = 80:78
289
6.2
(a) (i) Here n = 30; ^ = 1:667; ^ = 2:838; se = 0:0515, Sxx = 0:2244; x = 0:3: Recall
this was a regression of the form E(Yi ) = + x1i where x1i = x2i ; and xi =
bolt diameter. The average of the values of x1i is the average of the squared
diameters, x1 = 0:11:
+ (0:35)2
^ (0:35) = ^ + ^ (0:35)2 = 1:667 + 2:838(0:35)2 = 2:015
(0:35) =
^ (0:35)
From t-table, P ( 2:048
interval is
1:667 + 2:838(0:35)2
2:015
0:195
ase
v
u
u
t1
h
(0:35)2
Sxx
x1
i2
(ii) This asks for an interval for the strength of a single bolt at a given x
value (not the average of all bolts at a certain x value). Hence it asks for
a prediction
interval. A 95% prediction interval for the strength is: ^ (x)
q
2:048se
1+
1
n
(x1 x1 )2
Sxx
v
h
i2
u
2
u
(0:35)
x
1
t
1
^ (x) 2:048se 1 + +
n
Sxx
v
h
i2
u
u
2
(0:35)
x
t
1
1
= 2:015 2:048(0:0515) 1 +
+
30
0:224
= 2:015 0:107
Note that this interval is wider since it is required to capture a single observation
at x = 0:35 rather than the average of many.
6.3
(b) ^ = 0:02087, ^ =
1:022, s = 0:008389
290
6.4
(a) ^ = 0:9999, ^ = 0:1527, se = 0:3870. A 95% condence interval for
is
[0:9845; 1:0152]. For testing H0 : = 1 the p value = 0:99 and there is no
evidence against the hypothesis.
(b) A 95% condence interval for is [ 0:5587; 0:2533]. For testing H0 :
p value = 0:76 and there is no evidence against the hypothesis.
= 0 the
(c) Scatterplot and residual plots indicate that the model ts the data well.
6.5
6.6 (a) ^ = 0:9947 and tted model is y = 0:9947x.
(b) Scatterplot indicates that the model ts the data well.
(c) se = 0:3831 and a 95% condence interval for is [0:9879; 1:0015]. For testing
H0 : = 1 the p value = 0:12 and there is no evidence against the hypothesis.
(d) Residual plots indicate that the model ts the data well.
6.7 (a) Maximum likelihood estimators of the parameters.
^ = Sxy = 6175 = 2:865 2
Sxx
2155:2
(b) An unbiased estimate of
^=y
^ x = 30:36
is
1
Syy ^ Sxy
n 2
1
=
[24801 (2:8652) (6175)]
44
= 161:55
s2e =
pSe
Sxx
j2:865 2
q
0j
= 0: If H0 is true
v t (44)
= 10:4
161:55
2155:2
and p value = P (jT j > 10:4) < 0:001 (you can look at 40 and 50 df for
conrmation) so there is very strong evidence against H0 :
291
6.8 (a) The scatterplot and residual plots indicate that the model ts the data well.
6.9
(a) We assume that the study population is the set of all Grade 3 students who
are being taught the same curriculum. (For example in Ontario all Grade 3
students must be taught the same Grade 3 curriculum set out by the Ontario
Government.) The parameter 1 represents the mean score on the DRP test
if all Grade 3 students in the study population took part in the new directed
readings activities for an 8-week period.
The parameter 2 represents the mean score on the DRP test for all Grade 3
students in the study population without the directed readings activities.
The parameter represents the standard deviation of the DRP scores for all
Grade 3 students in the study population which is assumed to be the same
whether the students take part in the new directed readings activities or not.
(b) The qqplot of the responses for the treatment group and the qqplot of the responses for the control group are given in Figures 12.14 and 12.15. Looking at
these plots we see that the points lie reasonably along a straight line in both plots
and so we would conclude that the normality assumptions seem reasonable.
Normal Probability Plot
0.98
0.95
0.90
Probability
0.75
0.50
0.25
0.10
0.05
0.02
25
30
35
40
45
50
55
60
65
70
Data
Figure 12.14: Normal Qqplot of the Responses for the Treatment Group
(c) For the given data
sp =
Also P (T
1
21 + 23
1=2
(2423:2381 + 6469:7391)
= 14:5512
292
Probability
0.75
0.50
0.25
0.10
0.05
0.02
10
20
30
40
50
Data
60
70
80
Figure 12.15: Normal Qqplot for the Responses in the Control Group
2 is
r
1
1
+ ; y1
y2 + (2:018) s
21 23
"
51:4762
= [9:9545
41:5217
y2 + (2:018) s
(2:018) (14:5512)
1
1
+
21 23
1
1
+
21 23
= [1:0916; 18:8173]
(d) To test the hypothesis of no dierence between the means, that is, to test the
hypothesis H0 : 1 = 2 we use the discrepancy measure
D=
where
Y1 Y2 0
q
s t (n1 + n2
Sp n11 + n12
T =
assuming H0 :
d=
and
p
Y1 Y2 0
q
Sp n11 + n12
2)
jy1 y2 0j
j51:4762 41:5217 0j
q
q
=
= 2:2666
1
1
sp n11 + n12
14:5512 21
+ 23
value = P (D
= P (jT j
= 2 [1
d; H0 )
2:2666)
P (T
= 0:02863:
2:2666)]
293
Since the p-value is less than 0:01 there is strong evidence against the hypothesis
H0 : 1 = 2 based on the data.
Although the data suggest there is a dierence between the treatment group and
the control group we cannot conclude that the dierence is due to the
the new directed readings activities. The dierence could simply be due to
the dierences in the two Grade 3 classes. Since randomization was not used to
determine which student received the treatment and which student was in the
control group, the dierence in the DRP scores could have existed before the
treatment was applied.
6.10 [0:75; 11:25]
6.11
(a) The pooled estimate of variance is
r
209:02961 + 116:7974
sp =
= 4:25:
18
From t tables, P ( 1:734 < T < 1:734) = 0:90 where T v t (18). The 90%
condence interval takes the form
r
1
1
0:693 6:750 1:734 (4:25)
+
= [0:647; 7:239]
n1 n2
(b) We test the hypothesis using the pivotal
D=
jY1
q
Sp
1
n1
Y2 j
+
1
n2
which, under H0 , is distributed as jT j where T v t(18) distribution. The observed value of this statistic is
d=
10:693 6:750
q
= 2:074
1
1
4:25 10
+ 10
so the p-value is P (jT j > 2:074) > 2(0:025) = 0:05 so there is weak evidence
against H0 :
(c) We repeat the above using as data Zij = log(Yij ): This time the sample means
are
q 2:248, 1:7950 and the sample variances 0:320, 0:240 respectively, sp =
0:320+0:240
2
jY1
q
Sp
1
n1
Y2 j
+
1
n2
2:
294
2:248 1:7950
q
= 1:9148
1
1
0:529 10 + 10
value = P (jT j > 1:91) t 0:07 so there is even less evidence against H0 :
(d) One could check this with qqplots for each of the variables Yij and Zij = log(Yij )
although with such a small sample size these will be di cult to interpret.
6.12 [ 0:011; 0:557]
6.13
(a) For the female coyotes we have yf = 89:24; s2f = 42:87887; nf = 40.
For the male coyotes we have ym = 92:06; s2m = 44:83586; nm = 43
Since nf = 40 and nm = 43 are reasonably large we have that Yf has approximately a N (89:24; 42:87887) distribution and Ym has approximately a
N (92:06; 44:83586) distribution. Therefore an approximate 95% condence interval for f
m is given by
r
42:87887 44:83586
89:24 92:06 1:96
+
= [ 5:67; 0:03]:
40
43
The value f
value
m = 0 is just inside the right hand endpoint and the p
for testing H0 : f
=
0
would
be
close
to
0:05.
There
is
weak
evidence
of a
m
dierence between mean length for male and female coyotes. Since the interval
contains mostly negative values the data suggest the mean length for males is
slightly larger.
(c) For separate condence intervals we use
T =
Y
p v t (n
S= n
Female coyotes: Tf
t (39). Since P ( 2:022
dence interval is [87:15; 91:33].
Male coyotes: Tm
t (42). Since P ( 2:018
condence interval is [90:00; 94:12].
1) :
Tf
t
6.14 We assume that the observations for the Alcohol group are a random sample from
a G ( 1 ; ) distribution and that the observations for the Non-Alcohol group are a
random sample from a G ( 2 ; ) distribution. To see if there is any dierence between
the two groups we construct a 95% condence interval for the mean dierence in
reaction times 1
2.
The pooled estimate of the common standard deviation is
r
0:608 + 0:4380
2
sp =
= 0:2093:
22
295
Since P (T
y2
2:0739 (0:2093)
is
1
1
+
= [ 0:4064; 0:0520]:
12 12
y
h
1
(1148:79875)
7
1=2
= 12:8107:
p
p
2:36s= n; y + 2:36s= n
p
15:3375 2:36 (12:8107) = 8;
= [ 15:3375
10:6891;
= [ 26:0266;
4:6484]
p i
15:3375 + 2:36 (12:8107) = 8
15:3375 + 10:6891]
(c) To test the hypothesis of no dierence due to the safety program, that is, test
the hypothesis H0 : = 0 we use the discrepancy measure
D=
Y 0
p
S= n
where
T =
Y
p s t (n
S= n
1)
296
jy 0j
j 15:3375 0j
p = 3:39
p =
s= n
12:8107= 8
and
p
value = P (D
= P (jT j
= 2 [1
d; H0 )
3:39)
P (T
3:39)]
= 0:012:
Since the p-value is between 0:01 and 0:05 there is reasonable evidence against
the hypothesis H0 : = 0 based on the data.
Since this experimental study was conducted as a matched pairs study, an analysis of the dierences, yi = ai bi ; allows for a more precise comparison since
dierences between the 8 pairs have been eliminated. That is by analyzing the
dierences we do not need to worry that there may have been large dierence in
the safety records between factories due to other variates such as dierences in
the management at the dierent factories, dierences in the type of work being
conducted at the factories etc. Note however that a drawback to the study was
that we were not told how the 8 factories were selected. To do the analysis
above we have assumed that the 8 factories are a random sample from the study
population of all similar size factories but we do not know if this is the case.
6.18
(a) Since two algorithms are each run on the same 20 sets of numbers we analyse
the dierences yi = yAi yBi ; i = 1; : : : ; 20 which are given below
Set
yi
1:19
Set
11
12
yi
0:53
0:68
0:17
1:12
1:16
0:30
0:41
0:36
13
14
15
16
0:52
1:09
0:27
0:46
y2 = 4:7375
0:35
0:06
17
18
19
0:20
0:27
0:74
1
19
20
P
10
(yi
0:39
20
0:01
y)2 =
i=1
297
These values are all positive indicating strong evidence against H0 :
(p value < 0:01).
=0
(b) To check the Normality assumption we plot a qqplot of the dierences. See
Figure 12.16. The data lie reasonably along a straight line and therefore a
Normal model is reasonable.
QQ Plot of Sample Data versus Standard Normal
1.4
1.2
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-2
-1.5
-1
-0.5
0
Standard Normal Quantiles
0.5
1.5
Z<
0:409
0:4873
= P (Z <
Chapter 7
7.1 (a)The expected frequencies are as follows:
Eij
Rust present
Rust absent
Rust-Proofed
21
29
42
50
E11 = 100 100
100 = 21. Other Eij are computed in similar ways. likelihood ratio
statistic gives = 8:17 and Pearson statistic d = 8:05. The p value is about 0:004
in each case so there is strong evidence against H0 .
298
7.2 If the probability of catching the cold is the same for each group, then it is estimated
as 50=200 = 0:25 in which case the expected frequencies Ej in the four categories are
25; 75; 25; 75 respectively. The observed frequencies Yj are 20,80,30,70. The likelihood
ratio statistic is
=2
4
P
Yj
Ej
Yj log
j=1
= 2 20 log
The p
20
25
+ 80 log
30
25
+ 30 log
70
75
+ 70 log
= 2:68:
value is
P (U > 2:68) t 0:1
1 + 31
2 + 19
3 + 11
5+1
6 = 274
where we have assumed that the last category consisted of exactly 6 defectives. Then
274
the maximum likelihood estimate of ; the proportion of defectives, is = 3000
=
0:09133. Then we wish to test the hypothesis that the number of defectives in a
box is Binomial(12; ). Under this hypothesis and using the estimated value of the
parameter ^ = 0:091333 we obtain the expected numbers in each category, for example
12 ^2
(1 ^)10 :
E2 = 250
2
Number defective
Ei
0
79:21
1
95:54
2
52:82
3
17:70
4
4
5
0:64
6
0:08
likelihood ratio statistic gives = 43:08262 and Pearson statistic d = 70:71893. The
p value t 0 (df = 7 1 1 = 5) in each case so there is strong evidence against
the null hypothesis that the model is Binomial(12; ).
However notice that the expected numbers in the last three categories are all less
than 5. It is usually recommended in this case that we pool such categories until the
expected values are roughly 5 or more so that the above table may be replaced by
Number defective
Observed (Expected)
0
103(79:21)
1
80(95:54)
2
31(52:82)
3
19(17:7)
103
79:21
+ 80 log
80
95:54
+ 17 log
17
4:72
4
17(4:72)
299
How many degrees of freedom are there? The model with ve categories has 4 degrees
of freedom. However under the null hypothesis we had to estimate the parameter .
The dierence is 4 1 = 3 so in this case, has 3 degrees of freedom. The p value
is P (W > 38:9) < 0:005 so there is very strong evidence that the Binomial model
does not t the data. The likely reason is that the defects tend to occur in batches
when packed (so that there more cartons with 0 defects than one would expect).
7.4 For ease of computation, assume the values that are
calculating the maximum likelihood estimate for .
n
^ = 1 P ifi = 230 = 1:15
n i=1
200
Number of interruptions
ei
0
63:33
1
72:83
2
41:88
3
16:05
4
4:61
5
1:30
0
64(63:33)
64
63:33
= 2 64 log
1
71(72:83)
+ 71 log
2
42(41:88)
71
72:83
3
18(16:05)
4
5(5:91)
5
5:91
+ 5 log
= 0:43
and p
(64
2 (3).
(5
5:91)2
5:91
= 0:43
and there is no evidence against H0 that the Poisson model ts the data.
7.5 (a) For n = 2, the likelihood function is
L2 ( 2 ) =
2
(1
0
23
2
2)
2
1
44
2 (1
2)
44 2(13)
2)
2
2
2
13
2
2
0<
<1
or more simply
L2 ( 2 ) = (1
2)
2(23) 44
2 (1
70
2 (1
2)
90
0<
<1
300
^2 = 70 = 0:4375:
160
For n = 3
L3 ( 3 ) = (1
3)
160
3 (1
3(10) 25
3 (1
128
0
3)
2(25) 2(48)
(1
3)
3
<
1(48) 3(13)
3)
3
<1
^3 = 160 = 0:5556:
288
For n = 4
4(5) 30
4)
4 (1
200
184
4)
4 (1
L4 ( 4 ) = (1
=
4)
0<
3(30) 2(34)
(1
4
4
3)
2(34) 3(22)
(1
4
1(22) 4(5)
4)
4
<1
^4 = 184 = 0:4792:
384
The expected frequencies assuming the Binomial model, are calculated using
n ^j
1
j n
enj = yn+
^n
n j
j = 0; 1; : : : ; n; n = 2; 3; 4
Litter
Size
=n
enj
2
3
4
0
25:3125
8:4280
7:0643
1
39:375
31:6049
25:9964
2
15:3125
39:5062
35:8751
16:4609
22:0034
5:0608
Total number
of litters
yn+
80
96
96
2
X
y2j log
j=0
= 2 23 log
y2j
e2j
23
25:3125
+ 44 log
44
39:375
+ 44 log
44
39:375
= 1:11:
301
(b) The joint likelihood function for
L ( 1;
2; 3; 4)
1; 2; 3; 4
4
Q
Ln ( n ) 0 < n
n=1
8 70
12
1)
1 (1
2 (1
=
0 <
< 1; n = 1; 2; 3; 4
< 1; n = 1; 2; 3; 4:
12
(1
2)
90 160
3 (1
(1
)90
12+70+160+184
(1
)8+90+128+200
426
)426
(1
0<
3)
128 184
4 (1
200
4)
70
L( ) =
)8
is
160
)128
(1
184
)200
(1
<1
n
(0:5)n
j
j = 0; 1; : : : ; n; n = 2; 3; 4
enj
1
2
3
4
Litter
Size = n
Number
0
1
10 10
20 40
12 36
6 24
of females = j
2
3
4
20
36
36
12
24
Total number
of litters = yn+
20
80
96
96
8
10
+ 12 log
12
10
+ 22 log
22
24
+ 5 log
5
6
= 14:27:
0
6
1
4
2
9
3
3
4
5
5
2
6
2
7
3
8
2
10
2
# between 2 zeros
# of occurrences
13
1
14
1
15
1
16
1
18
1
19
1
20
1
21
1
22
2
26
1
12
1
302
50
= 0:1256
50 + 348
0:1256)j ;
(0:1256) (1
j = 0; 1; : : :
Observation
between two 0s
Observed
Frequency.: fj
12
12
50
Expected
Frequency.: ei
6:28
5:49
9:0
6:88
5:26
5:67
11:42
50
10
11
Total
The observed value of the likelihood ratio statistic is = 1:96 and the observed value
of the Pearson statistic is d = 1:95. The p value t P (W 1:96) t 0:9 where
W v 2 (5) and degrees of freedom = 7 1 1 = 5. There is no evidence against the
hypothesis that the Geometric distribution is a good model for these data.
7.7 The expected frequencies and the row and column estimated probabilities are as
follows:
Eij
Normal Enlarged Much Enlarged
Carrier present
26:57
30:33
15:09
Carrier absent
489:43
558:67
277:91
Estimated probability
^1
0:051502
^2
0:948498
0:369099
0:421316
0:209585
^ 1 = 19+29+24
= 0:051502, other values is computed similarly.
1398
The likelihood ratio statistic is 7:32 and p value t P (W
7:32) = 0:026 where
2
W v
(2) so there is evidence against the hypothesis of independence.
7.8 The observed frequencies are:
yi
Tall husband
Medium husband
Short husband
total
Tall wife
18
20
12
50
Medium wife
28
51
25
104
Short wife
19
28
9
56
Total
65
99
46
210
303
so for example E11 = 6521050 = 15:476 and the expected frequencies and the row and
column estimated probabilities are as follows:
eij
Tall husband
Medium husband
Short husband
^
j
Tall wife
15:476
23:571
10:952
0:238
Medium wife
32:191
49:029
22:781
104
210 = 0:500
Short wife
17:333
26:400
12:267
56
210 = 0:267
^i
65
210 =
0:310
0:471
0:219
18
15:476
20
23:571
12
10:952
+ 28 log
+ 51 log
+ 25 log
28
32:191
51
49:029
25
22:781
19
17:333
28
+ 28 log
26:400
9
+ 9 log
]
12:267
+ 19 log
= 3:13
and p value t P (W 3:13) t 0:55 where W v
There is no evidence against independence.
Using the Pearson statistic
2 (4).and
df = (3
1)(3
1) = 4:
(18
and p value t P (W
2:9) = 0:6 where W v
the hypothesis of independence.
2 (4).
3 boys
4:9844
6:0156
^i
1
2
Vertical
1
0:171875
2 boys
8:1563
9:8438
2 girls
9:9688
12:0313
3 girls
5:8906
7:1094
Horizontal
0:453125
0:546875
2
0:28150
3
0:343750
4
0:203125
304
3
0:125
8
2
0:375
24
1
0:375
24
0
0:125
8
The observed value of the likelihood ratio statistic is = 5:44 and p value t
P (W
5:44) = 0:14 where W v 2 (3). There is no evidence against the
Binomial model.
7.10. (a) The expected frequencies are as follows:
Both
9
23
Above Average
Below Average
eij
Above Average
Below Average
Both
15
15
^
Mother
8
8
Vertical
= 10:8 and p
value t P (W
1
0:3
Mother
6
10
Father
9
9
2
0:16
Father
12
6
Neither
18
18
3
0:18
Neither
23
13
^i
1
2
Horizontal
0:5
0:5
4
0:36
2 (3).
(b) The probabilities and the expected frequencies depending on whether the mother
is a smoker or non-smoker are as follows:
Mother smokes
Above Average
Below Average
Mother non-smoker
Above Average
Below Average
Father smokes
9
23
Father smokes
12
6
Father non-smoker
6
10
Father non-smoker
23
13
For rst table, the likelihood ratio statistic gives = 0:4299469, p value t 0:51; for
the second table, likelihood ratio statistic gives = 0:040767; p value t 0:84: In
both cases there is no evidence against the hypothesis of independence. So even given
the smoking status of the mother it appears that the smoking habits of the father
and the birthweight of the children are independent.
305
Chapter 8
8.1 (a) likelihood ratio statistic is 480:65 so the p
strong evidence against independence.
8.3 (a) likelihood ratio statistic gives
= 112 and p
value t 0
(b) Only Program B shows any evidence of non-independence, and that is in the
direction of a lower admission rate for males.
306
Male
Female
Total
Agreed
68
42
110
Disagreed
77
63
140
Total
145
105
250
308
equals 1.
[2] (c) For a random sample from an Exponential( ) distribution, the value of
estimated using the
.
can be
[2] (d) Suppose y(1) ; y(2) ; : : : ; y(99) ; y(100) are the ordered values of a dataset with
y(1) = min (y1 ; : : : ; y100 ) and y(n) = max (y1 ; :::; y100 ). Suppose IQR = 3:85 is
the interquartile range of the dataset. Then the IQR of the dataset y(1) ; :::; y(99) ;
y(100) + 5 (that is, 5 is added only to the largest value) is equal to
.
[2] (e) Suppose s2 = 2:6 is the sample variance of the dataset y1 ; y2 ; :::; y100 . Then the
sample variance of the dataset y1 + 2; y2 + 2; : : : ; y100 + 2 (that is, 2 is added to
every value) is
.
[4] (f) The data y1 ; y2 ; : : : ; y100 is recorded in kilometers (km) and the sample mean and
sample skewness are recorded. If we decide instead to record the data in meters
instead of kilometers, (1 meter is 0.001 km) then the sample mean is changed by
a factor of
and the sample skewness is changed
by a factor of
.
[12] 3. Researchers are interested in the relationship between a certain gene and the risk
of contracting diabetes. A gene is said to be expressed if its coded information is
converted into certain proteins. A team of researchers investigates whether there is
a relationship between a certain gene being expressed, and whether or not a person
contracts diabetes in their lifetime. The team takes a random sample of 100 people
who are aged 55 or above. For each person selected they determine (i) age, (ii)
whether or not the gene is expressed, (iii) the persons insulin level, and (iv) if the
person has diabetes.
[3] a. This study is an example of (check only those that apply)
i. an experimental study because we need to experiment with the genes.
ii. an observational study because we are recording observations for each sampled unit.
iii. a probability model because probability is required to predict whether a
person will contract diabetes.
iv. a causative study because the diabetes causes the gene.
v. a response study because the patient responds to the clinician.
309
[3] b. The age of the subject is an example of (check only those that apply)
i. an explanatory variate because it explains how long the subject is in the
study.
ii. an explanatory variate because it may help to explain whether a given person
will contract diabetes.
iii. a non-Normal variate because subjects may lie about their age.
iv. a response variate because it responds to many dierent circumstances.
[3] c. The Plan step in PPDAC for this experiment includes (check only those that
apply)
i. the question of whether or not diabetes was related to the expression of the
gene.
ii. the sampling protocol or the procedure used to select the sample.
iii. the specication of the sample size.
iv. the questions the researchers wished to investigate.
v. a determination of the units that are available to be included in
the study.
[3] d. In the Problem step of PPDAC, we (check only those that apply)
310
Median
45
Mean
44:02
3rd Quartile
49
Max
57
Sample s.d.
6:65
10
0
frequency
15
20
Min
30
30
35
40
45
50
55
60
Baumann$post.test.3
45
40
30
35
Baum ann$post.test.3
50
55
-2
-1
nor m quantiles
Figure 11.3: Normal qq plot for test scores with superimposed line and condence region
30
35
40
post.test.3
45
50
55
311
True
False
(b) The distribution has very large tails, too large to be consistent with the Normal
distribution.
True
False
(c) The sample skewness is positive.
True
False
(d) About half of the test scores fall outside the interval (40; 49).
True
False
(e) The shape of the Normal qqplot would change if 5 marks were added to each
test score.
True
False
[13] 5.[7] a. Suppose y1 ; y2 ; :::; y25 are the observed values in a random sample from the
Poisson( ) distribution: Find the maximum likelihood estimate of . Show all
your steps.
[6] b. Suppose y1 ; y2 ; :::; y10 are the observed values in a random sample from the probability density function
f (y; ) =
where 0 <
steps.
2e
y=
for y > 0
312
Male
Female
Total
Agreed
68
42
110
Disagreed
77
63
140
Total
145
105
250
313
(h) Describe an attribute of interest for the target population and provide an estimate based on the given data.
An attribute of interest is the proportion of the target population that agrees
with the statement. The estimate is 110/250 or 44%
[14] 2. Fill in the blanks below. You may use a numerical value or one of the following words
or phrases: sample skewness, sample kurtosis, sample variance, sample mean, relative
frequencies, frequencies, histogram, boxplot.
[2] a. A large positive value of the sample skewness indicates that the distribution
is not symmetric and the right tail is larger than the left.
[2] b. The sum of the relative frequencies equals 1.
[2] c. For a random sample from an Exponential( ) distribution, the value of
estimated using the sample mean .
can be
[2] d. Suppose y(1) ; y(2) ; : : : ; y(99) ; y(100) are the ordered values of a dataset with
y(1) = min (y1 ; : : : ; y100 ) and y(100) = max (y1 ; :::; y100 ). Suppose IQR = 3:85 is
the interquartile range of the dataset. Then the IQR of the dataset y(1) ; : : : ; y(99) ;
y(100) + 5 (that is, 5 is added only to the largest value) is equal to 3.85 .
[2] e. Suppose s2 = 2:6 is the sample variance of the dataset y1 ; y2 ; : : : ; y100 . Then
the sample variance of the dataset y1 + 2; y2 + 2; : : : ; y100 + 2 (that is, 2 is added
to every value) is 2.6 .
[4] f. he data y1 ; y2 ; : : : ; y100 is recorded in kilometers (km) and the sample mean
and sample skewness is recorded. If we decide instead to record the data in
meters instead of kilometers, (1 meter is 0.001 km) then the sample mean is
changed by a factor of 1000 and the sample skewness is changed by a factor
of one (or the same) .
[12] 3. Researchers are interested in the relationship between a certain gene and the risk
of contracting diabetes. A gene is said to be expressed if its coded information is
converted into certain proteins. A team of researchers investigates whether there is
a relationship between a certain gene being expressed, and whether or not a person
contracts diabetes in their lifetime. The team takes a random sample of 100 people
who are aged 55 or above. For each person selected they determine (i) age, (ii)
whether or not the gene is expressed, (iii) the persons insulin level, and (iv) if the
person has diabetes.
314
p
p
315
[3] d. In the Problem step of PPDAC, we (check only those that apply)
i. solve the problem for the maximum likelihood estimate.
ii. list all problems that might be encountered in our analysis.
p
1st Quartile
40
Median
45
Mean
44:02
3rd Quartile
49
Max
57
Sample s.d.
6:65
Based on these plots and statistics circle True or False for the following
statements.
True
False
(b) The distribution has very large tails, too large to be consistent with the Normal
distribution.
True
False
(c) The sample skewness is positive.
True
(d) About half of the test scores fall outside the interval (40; 49).
True
False
False
(e) The shape of the Normal qqplot would change if 5 marks were added to each
test score.
True
False
[13] 5.[7] a. Suppose y1 ; y2 ; :::; y25 are the observed values in a random sample from the
Poisson( ) distribution: Find the maximum likelihood estimate of . Show all
your steps.
L( ) =
n
Y
i=1
yi
yi !
n 1
Q
i=1 yi !
n
P
i=1
yi
n 1
Q
is optional.
i=1 yi !
316
l( ) =
yi log( )
i=1
n
1P
l0 ( ) =
yi
n = 0 for
i=1
is ^ = y.
n
1 P
yi = y
n i=1
(a) Suppose y1 ; y2 ; :::; y10 are the observed values in a random sample from the probability density function
y
2e
f (y; ) =
where 0 <
steps.
y=
for y > 0
L( ) =
n
Y
yi
2e
yi =
n
Q
yi
i=1
i=1
1
2n
n
1P
exp
yi
for
>0
i=1
or more simply
L( ) =
1
2n
exp
ny
for
> 0:
2n ln( )
l0 ( ) =
2n
1
+
ny for
1
2
>0
ny = 0 or
is ^ = y=2.
n
2
( 2 + y) = 0
317
(v) By reference to the condence interval, indicate what you know about the p
for a test of the hypothesis H0 : = 0:8?
value
(c) Suppose a Binomial experiment is conducted and the observed 95% condence interval
for is [0:1; 0:2]. This means (circle the letter for the correct answer):
A : The probability that
318
B : If the Binomial experiment was repeated 100 times independently and a 95% condence interval was constructed each time then approximately 95 of these intervals would
contain the true value of .
2: [20] At the R.A.T. laboratory a large number of genetically engineered rats are raised
for conducting research. Twenty rats are selected at random and fed a special diet. The
weight gains (in grams) from birth to age 3 months of the rats fed this diet are:
63:4
55:6
68:3
73:2
52:0
63:9
64:5
60:7
62:3
63:9
55:8
60:2
59:3
60:5
62:4
67:1
75:8
66:6
72:1
66:7
yi = 1273:8 and
i=1
Yi v N
and
y)2 = 665:718:
(yi
i=1
is assumed where
20
P
= G( ; );
i = 1; : : : ; 20
(a) Comment on how reasonable the Gaussian model is for these data based on the qqplot
below:
QQ Plot of Sample Data versus Standard Normal
80
75
70
65
60
55
50
-2
-1.5
-1
-0.5
0
0.5
Standard Normal Quantiles
1.5
and
is _______________
represent.
is __________________
319
(You do not need to derive these estimates.)
(d) Let
T =
Y
p
S= 20
S2 =
where
20
1 P
Yi
19 i=1
(e) The company, R.A.T. Chow, that produces the special diet claims that the mean weight
gain for rats that are fed this diet is 67 grams.
The p value for testing the hypothesis H0 : = 67 is between _____________
and _______________.
What would you conclude about R.A.T. Chows claim?
(f ) Let W =
1
2
20
P
Yi
i=1
a) = 0:05 = P (W
b).
y=
> 0:
w=2
for w > 0
2 (2)
distribution.
n
2P
Yi
(2n) :
i=1
(c) Explain clearly how the pivotal quantity U can be used to obtain a two-sided 100p%
condence interval for .
320
25
2P
Yi
(50) :
i=1
a) = 0:05 = P (U
b).
321
or n
is given by ^
^)=n. Therefore
^(1
^)
Since we dont know ^ and the right side of the inequality takes on its largest value for
^ = 0:5 we chose n such that
n
1:645
0:02
(0:5)2 = 1691:3
322
The respondents to the survey are people who heard about the survey through local
media, had access to the internet and then took the time to complete the survey. These
people are probably not representative of all citizens of Kitchener. This is an example of
sampling error.
To obtain a representative sample you would need to select a random sample of all
citizens living in Kitchener.
(ii) [2] Assume the model Y v Binomial (n; ) where Y = number of people who
responded no to the question Do you support the statue proposal in concept, by which we
mean do you like the idea even if you dont agree with all aspects of the proposal? What
does the parameter represent in this study?
The parameter represents the proportion of people who would respond no to the
question in the study population (citizens of Kitchener).
(iii) [2] A point estimate of
1920
2441
1:96
1920
2441
1920
2441
=2441 = 0:7866
1920=2441 = 0:7866
(v) [2] By reference to the condence interval, indicate what you know about the p
value for a test of the hypothesis H0 : = 0:8?
Since = 0:8 is a value contained in the interval [0:7703; 0:8029] therefore the p
for testing H0 : = 0:8 is greater than or equal to 0:05.
p
value
(Note that since = 0:8 is very close to the upper endpoint of the interval that the
value would be very close to 0:05.)
(c) [2] Suppose a Binomial experiment is conducted and the observed 95% condence interval for is [0:1; 0:2]. This means (circle the letter for the correct answer):
A : The probability that
B : If the Binomial experiment was repeated 100 times independently and a 95%
condence interval was constructed each time then approximately 95 of these intervals
would contain the true value of .
323
2: [20] At the R.A.T. laboratory a large number of genetically engineered rats are raised
for conducting research. Twenty rats are selected at random and fed a special diet. The
weight gains (in grams) from birth to age 3 months of the rats fed this diet are:
63:4
55:6
68:3
73:2
52:0
63:9
64:5
60:7
62:3
63:9
55:8
60:2
59:3
60:5
62:4
67:1
75:8
66:6
72:1
66:7
yi = 1273:8 and
y)2 = 665:718:
(yi
i=1
i=1
is assumed where
20
P
= G( ; );
i = 1; : : : ; 20
(a) [2] Comment on how reasonable the Gaussian model is for these data based on the
qqplot below:
Since the points in the qqplot lie reasonably along a straight line the Gaussian model
seems reasonable for these data.
75
70
65
60
55
50
-2
-1.5
-1
-0.5
0
0.5
Standard Normal Quantiles
and
1.5
represent.
The parameter represents the mean weight gain of the rats fed the special diet from
birth to age 3 months in the study population (rats at the R.A.T. laboratory).
324
The parameter represents the standard deviation of the weight gains of the rats fed
the special diet from birth to age 3 months in the study population (rats at the R.A.T.
laboratory).
(c) [2] The maximum likelihood estimate of
is
1273:8=20 = 63:69
1=2
1
The maximum likelihood estimate of is 20
(665:718)
(You do not need to derive these estimates.)
= (33:2859)1=2 = 5:7694
Y
p
S= 20
S2 =
where
20
1 P
Yi
19 i=1
t(19)
(e) [6] The company, R.A.T. Chow, that produces the special diet claims that the mean
weight gain for rats that are fed this diet is 67 grams.
The p value for testing the hypothesis H0 :
0:05
.
s=
1
(665:718)
19
p
Since P (T
2 (1
1=2
= 5:9193
value = P (jT j
value
0:02
and
j
jy
j63:69 67j
p0 =
p = 2:5008
s= n
5:9193= 20
2:5008) = 2 [1
= 67 is between
P (T
2:5008)]
2 (1:0975) or 0:02
value
0:05:
1
2
20
P
Yi
i=1
The distribution of W is
Let a and b be such that P (W
2 (19)
a) = 0:05 = P (W
b).
325
Then a =
10:117
and b =
"
665:718
30:144
1=2
665:718
10:117
30:144
1=2
[4:6994; 8:1118]
1
f (y; ) = e
y=
> 0:
w=2
for w > 0
2 (2)
distribution.
0;
G (w) = P (W
2Y
w) = P
=P
w
2
w
2
=F
where
F (y) = P (Y
y)
Therefore
g (w) = G0 (w) = f
w
2
exp
w
2
1
= e
2
w=2
for w
as required.
(b) [3] Supppose Y1 ; : : : ; Yn is a random sample from the Exponential ( ) distribution. Use
your result from (a) and theorems that you have learned in class to prove that
U=
n
2P
i=1
From (a) ; 2 Yi v
2 (2)
i = 1; 2; : : : ; n independently.
Yi
(2n) :
326
Since the sum of independent Chi-squared random variables has a Chi-squared distribution with degrees of freedom equal to the sum of the degrees of freedom of the Chi-squared
random variables in the sum, therefore
U=
n
2P
Yi
or
(2n)
i=1
i=1
as required.
n
P
(c) [4] Explain clearly how the pivotal quantity U can be used to obtain a two-sided 100p%
condence interval for .
Using Chi-squared tables nd a and b such that P (U
2 (2n)
U
Since
p = P (a
0
B1
=PB
@b
b)
n
P
Yi
i=1
0 P
n
2
Y
B i=1 i
B
=P@
b
then
a) =
1 p
2
= P (U
b) where
1C
C
aA
n
P
Yi
i=1
1
C
C
A
2 P
3
n
n
P
2
yi 2
yi
6 i=1
i=1 7
6
7
4 b ; a 5
U=
25
2P
Yi
(50) :
i=1
34:764
a) = 0:05 = P (U
and b =
b).
67:505
(e) [3] Suppose y1 ; : : : ; y25 is an observed random sample from the Exponential ( ) distri25
P
bution with
yi = 560.
i=1
327
The maximum likelihood estimate for
to derive this estimate.)
A 90% condence interval for
is
based on U is
560=25 = 22:4
[16:5914; 32:2172]
2 (560) 2 (560)
= [16:5914; 32:2172]
;
67:505 34:764
328
Cost of
Advertising (x)
Number of
Communities
Total Sales (y )
1:2
2:4
3:6
4:8
5
5
5
5
x = 3, y = 7:93, Sxx =
20
P
(xi
x)2 = 36,
i=1
Syy =
20
P
(yi
i=1
20
P
(xi
x) (yi
y) = 61:32
i=1
+ xi + Ri ; where Ri v N 0;
is assumed where ,
constants.
and
= G (0; ) ;
i = 1; : : : ; 20 independently
329
12
11
10
9
Sa le s ( th o u s a n d s
o f d o lla r s )
1 .5
2 .5
3
3 .5
4
Ad v e r tis in g s p e n d in g ( th o u s a n d s o f d o lla r s )
4 .5
Probability
0.75
0.50
0.25
0.10
0.05
0.02
-1.5
-1
-0.5
0
Data
0.5
1.5
(e) [2] Would you conclude that an increase in the amount of money spent in advertising
causes an increase in the sales of the product in the following week? Explain your answer.
(f ) [3] If the amount of dollars spent on advertising a product on local television in one
week is 5 thousand dollars, nd a 90% prediction interval for the sales of the product (in
thousands of dollars) in the following week.
2: [16] A wind farm is a group of wind turbines in the same location used for production
of electric power. The number of wind farms is increasing as we try to move to more
renewable forms of energy. Wind turbines are most e cient if the mean windspeed is 16
km/h or greater.
The windspeed Y at a specic location is modeled using the Rayleigh distribution which
has probability density function
f (y; ) =
2y
y2 =
0;
>0
330
where
30:0
13:3
41:9
25:6
14
P
39:6
34:5
yi = 336:5
and
9:9
13:6
14
P
i=1
i=1
24:2
5:1
41:4
20:5
22:2
yi2 = 9984:03
p
(c) [5] If the random variable Y has a Rayleigh distribution then E (Y ) =
=2: Thus a
2
mean of 20 km/h corresponds to = (40) = t 509:3. The owner of Windy Hill claims
that the average windspeed at Windy Hill is 20 km/h. Test the hypothesis H0 : = 509:3
using the given data and the likelihood ratio test statistic. Show all your work.
(d) [3] If Yi has a Rayleigh distribution with parameter , i = 1; : : : ; n independently then
W =
n
2P
i=1
Yi2 v
(2n) :
Subject: i
Drug A: ai
Drug B: bi
Dierence:
yi = a i
bi
1
1:08
1:48
0:40
2
1:19
0:62
3
1:22
0:65
4
0:60
0:32
0:57
0:57
0:28
5
0:55
1:48
0:93
6
0:53
0:79
0:26
7
0:56
0:43
0:13
8
0:93
1:69
0:76
9
1:43
0:73
0:70
10
0:67
0:71
0:04
331
10
P
yi =
10
P
0:14 and
i=1
i=1
Yi =
is assumed where
and
y)2 = 2:90484:
(yi
= G (0; ) ;
i = 1; : : : ; 10 independently
and
represent.
(b) [5] Test the hypothesis of no dierence in the mean response for the two drugs, that
is, test H0 : = 0. Show all your work.
(c) [3] Construct a 95% condence interval for .
(d) [2] This experiment is a matched pairs experiment. Explain why this type of design
is better then a design in which 20 volunteers are randomly divided into two groups of 10
with one group receiving drug A and the other group receiving drug B.
(e) [2] Explain the importance of randomizing the order of the drugs, the fact that the
drugs where given in identical tablet form and the fact that the drugs were administered
one day apart.
4: [13] Exhaust emissions produced by motor vehicles is a major source of air pollution.
One of the major pollutants in vehicle exhaust is carbon monoxide (CO). An environmental
group interested in studying CO emmissions for light-duty engines purchased 11 light-duty
engines from Manufacturer A and 12 light-duty engines from Manufacturer B. The amount
of CO emitted in grams per mile for each engine was measured. The data are given below:
Manufacturer A:
5:01 8:60 4:95
11
P
7:51
14:59
11:53
9:24
12
P
9:62
(y1j
y1 )2 = 166:9860
14:10
16:97
(y2j
y2 )2 = 218:7656
11
P
j=1
Manufacturer B:
16:67 6:42
5:21
15:13
3:95
4:12
j=1
14:30
9:98
6:10
12
P
j=1
7:04
5:38
25:53
24:92
j=1
= G (0; ) ;
j = 1; : : : ; 11 independently
332
1;
and
= G (0; ) ;
j = 1; : : : ; 12 independently
2.
2.
(d) [2] What conclusions can the environmental group draw from this study? Justify
your answer.
5: [9] In a court case challenging an Oklahoma law that dierentiatied the ages at which
young men and women could buy 3:2% beer, the Supreme Court examined evidence from a
random roadside survey that measured information on age, gender, and drinking behaviour.
The table below gives the results for the drivers under 20 years of age.
Drank Alcohol
Yes
Gender
of Driver
in last 2 hours
No
77
16
93
Male
Female
Totals
404
122
526
Totals
481
138
619
10
15
11
15
12
15
13
16
13
16
13
17
14
18
14
20
14
333
Let yi = score of the i0 th student, i = 1; : : : ; 19. For these data
19
P
i=1
yi = 270 and
19
P
i=1
yi2 = 3956:
For these data calculate the mean, median, mode, sample variance, range, and interquartile
range.
7: A dataset consisting of six columns of data was collected by interviewing 100 students
on the University of Waterloo campus. The columns are:
Column 1: Sex of respondent
Column 2: Age of respondent
Column 3: Weight of respondent
Column 4: Faculty of respondent
Column 5: Number of courses respondent has failed.
Column 6: Whether the respondent (i) strongly disagreed, (ii) disagreed, (iii) agreed or
(iv) strongly agreed with the statement The University of Waterloo is the best
university in Ontario.
(a) For this dataset give an example of each of the following types of data;
discrete__________
continuous___________
categorical____________
binary_______________
ordinal______________
(b) Two ways to graphically represent categorical data are ____________ and
________________.
(c) A graphical way to examine the relationship between heights and weights is a
______________.
(d) If the sample correlation between heights and weights was 0.4 you would conclude_____________.
334
^=y
61:32
36
^ x = 7:93
(3) = 2:82
11
10
y=2.82+1.703x
9
Sales (thousands
of dollars)
1.5
2.5
3
3.5
Advertising spending (thousands of dollars)
4.5
Looking at the scatterplot and the tted line we notice that for x = 2:4, 4 of the 5 data
points lie above the tted line while for x = 3:6, all 5 of the data points lie below the tted
line. This suggests that the linear model might not be the best model for these data.
(c) [3]
Calculate the estimated residuals ri = yi y^i = yi (^ + ^ xi ), i = 1; : : : ; 20 and order
the residuals from smallest to largest: r(1) ; : : : ; r(n) .
Calculate qi , i = 1; : : : ; 20 where qi satises F (qi ) = (i 0:5) =20 and F is the N(0; 1)
cumulative distribution function. Plot r(i) ; qi , i = 1; : : : ; 20.
OR:
Calculate the estimated residuals ri = yi y^i = yi
the residuals from smallest to largest: r(1) ; : : : ; r(n) .
(^ + ^ xi ), i = 1; : : : ; 20 and order
335
Plot the ordered residuals against the theoretical quantiles of the Normal distribution.
Since there is no obvious pattern of departure from a straight line we would conclude
that there is no evidence against the normality assumption Ri v N 0; 2 , i = 1; : : : ; 20.
(d) [5] To test the hypothesis of no relationship we test H0 :
discrepancy measure
~ 0
p
D=
S= Sxx
where
T =
~ 0
p
s t (18) assuming H0 :
S= Sxx
and
S2 =
Since ^ = 1:703 and
s=
"
Syy
^ Sxy
18
#1=2
20
1 P
Yi
18 i=1
125:282
~ xi
(1:703) (61:32)
18
= 0: We use the
= 0 is true
1=2
= (1:15742)1=2 = 1:0758
j1:703 0j
p = 9:50
1:0758= 36
and
p
value = P (D
= P (jT j
9:50; H0 )
9:50)
where T s t (18)
t 0:
Therefore there is very strong evidence based on the data against the hypothesis of no
relationship between the amount of money spent in advertising a product on local television
in one week and the sales of the product in the following week.
(e) [2] Since this study was an experimental study, since there was strong evidence
against H0 : = 0, and since the slope of the tted line was ^ = 1:703 > 0, the data
suggest that an increase in the amount of money spent advertising causes an increase in
the sales of the product in the following week. However we dont know if the 4 levels of
spending on advertising were applied in the 5 dierent communities using randomization.
If the levels of advertising were not randomly applied then the dierences in the sales of the
product could be due to dierences between the communities. For example, if the highest
(lowest) level was applied to the richest (poorest) communities you might expect to see the
same patterm of response as was observed.
336
(f ) [3] From t tables P (T 1:73) = 0:95 where T s t (18). A 90% prediction interval
for the sales of the product (in thousands of dollars) in the following week if x = 5 is
"
1
(5 3)2
(1:73) (1:0758) 1 +
+
20
36
2:82 + 1:703(5)
= 11:3367
#1=2
2:0055
= [9:33; 13:34]
2: [16]
(a) [4] The likelihood function is
n 2y
Q
i
L( ) =
yi2 =
n
Q
i=1
2yi
i=1
i=1
n
Q
2yi
n
1P
n log
i=1
i=1
l0 ( ) =
and l0 ( ) = 0 if
1
n
n
P
i=1
n
1P
exp
n
1 P
2
i=1
yi2 =
1
2
n +
yi2 ;
yi2 ;
n
P
i=1
>0
> 0:
yi2 ;
>0
is
n
^ = 1 P y2:
n i=1 i
L( )
=
^
L(^)
14
exp ( 9984:03= )
=
14
exp
9984:03=^
713:45
14
exp (14
2 (1)
distribution if H0 :
9984:03= ) ;
is is true.
> 0:
337
For these data the observed value of the likelihood ratio test statistic for H0 :
= 509:3
is
d =
2r (509:3)
= 2 [l (509:3)
l(713:145)]
713:45
2 14 log
+ 14
509:3
2 ( 0:8904)
=
=
9984:03
509:3
= 1:7807
and
p
value = P (D
1:7807; H0 )
t P (W
= 2 [1
P (Z
1:33)]
= 2 (1
0:9082) = 2 (0:0912)
= 0:1824:
Since the p-value t 0:1824 > 0:1, therefore there is no evidence based on the data against
H0 : = 509:3.
(d) [3] From 2 tables we haveIf Yi has a Rayleigh distribution with parameter , i = 1; : : : ; n
independently then
P (W
15:31) = 0:25 = P (W
44:46)
where W v
(28) :
Since
0:95 = P
15:31
n
1P
i=1
= P
a 95% condence interval for
n
2 P
Y2
44:46 i=1 i
Yi2
44:46
n
2 P
Y2
15:31 i=1 i
2 (9984:03) 2 (9984:03)
;
44:46
15:31
= [449:12; 1304:25] :
p
(e) [2] A 95% condence interval for the mean windspeed
=2 based on these data is
"p
#
p
449:12
1304:252
;
2
2
= [18:78; 32:01] :
338
Since the values of this interval are all above 16, the data seem to suggest a mean windspeed
greater than 16km/hr. However we dont know how the data were collected. It would be
wise to determine how the data were collected before reaching a conclusion. Suppose that
Windy Hill is only windy at one particular time of the year and that the data were collected
only during the windy period. We would not want to make a decision only based on these
data.
3: [14]
(a) [2] The parameter represents the mean dierence in antibiotic blood serum level
between drugs A and B in the study population.
The parameter represents the standard deviation of the dierences in antibiotic blood
serum level between drugs A and B in the study population.
(b) [5] To test the hypothesis of no dierence in the mean response for the two drugs,
that is, H0 : = 0 we use the discrepancy measure
D=
S= 10
where
T =
Y 0
p s t (9) assuming H0 :
S= 10
and
S2 =
Since y =
0:14=10 =
s=
0:014 and
2:90484
9
10
1P
Yi
9 i=1
= 0 is true
1=2
= (0:32276)1=2 = 0:5681
j 0:014 0j
p = 0:078
0:5681= 10
and
p
value = P (D
0:078; H0 )
= P (jT j
= 2 [1
From t tables P (T
0:078)
P (T
0:6)
where T s t (9)
0:078)] :
0) = 0:5 so
value
2 (1
0:5) = 1:
339
Therefore there is no evidence based on the data against the hypothesis of no dierence in
the mean response for the two drugs, that is, H0 : = 0.
(c) [3] From t tables P (T
for based on these data is
y
=
p
2:26 (s) = 10
0:014
p
2:26 (0:5681) = 10
= [ 0:4200; 0:3920] :
(d) [2] Since this experimental study was conducted as a matched pairs study, an analysis
of the dierences, yi = ai bi ; allows for a more precise comparison since dierences between
the 10 pairs have been eliminated. That is, by analysing the dierences we do not need to
worry that there may have been large dierences in the responses between subjects due to
other variates such as age, general health, etc.
(e) [2] It is important to randomize the order of the drugs in case the order in which
the drugs are taken aects the outcome.
It is important to give the drugs in identical tablet form so the subject does not know
which drug he or she is taking since knowing which drug is being taken could aect the
outcome.
It is important that the drugs be administered one day apart to ensure that the eects
of one drug are gone before the second drug is given.
4: [13]
(a) [2] The parameter 1 represents the mean amount of CO emitted by light-duty
engines produced by Manufacturer A.
The parameter 2 represents the mean amount of CO emitted by light-duty engines
produced by Manufacturer B.
The parameter represents the standard deviation of the CO emissions from light-duty
engines produced by Manufacturers A and B.
(b) [4] From t tables P (T
166:9860 + 218:7656
s=
21
1=2
= (18:3691)1=2 = 4:2860
340
8:2487; 1:8774] :
(c) [5] To test the hypothesis of no dierence in the mean response for the two drugs,
that is, H0 : 1 = 2 we use the discrepancy measure
D=
where
T =
and
Y1 Y2 0
q
1
1
S 11
+ 12
Y1 Y2 0
q
s t (21) assuming H0 :
1
1
S 11
+ 12
S2 =
Since
11
1 P
Y1i
21 i=1
Y1
and
p
value = P (D
= 2 [1
Y2
3:1857
0:078; H0 )
1:7806)
P (T
Y2i
is true
j 3:1857 0j
q
= 1:7806
1
1
4:2860 11
+ 12
= P (jT j
From t tables P (T
12
P
136:65
=
12
y2 =
d=
i=1
90:22
11
and s = 4:2860 the observed value of D is
y1
0:975)
where T s t (21)
1:7806)] :
2:09) = 0:975 so
value
2 (1
0:95) = 0:1
and therefore there is weak evidence based on the data against the hypothesis of no dierence in the mean response for the two drugs, that is, H0 : 1 = 2 :
341
(d) [2] Although there is weak evidence of a dierence between the mean CO emissions
for the two maufacturers it is di cult to draw much of a conclusion. The sample sizes
n1 = 11 and n2 = 12 are small. We also dont know whether the engines were chosen
at random from the two manufacturers on the day, week, or month. In other words we
dont know if the samples are representative of all light-duty engines produced by these
manufacturers.
5: [9]
(a) [2] This is an observational study because no explanatory variates were manipulated
by the researcher.
(b) [5] Denote the frequencies as F1 ; F2 ; F3 ; F4 with observed values f1 = 77; f2 = 404;
f3 = 16 and f4 = 122. Denote the expected frequencies as E1 ; E2 ; E3 ; E4 : If the hypothesis
of no relationship (independence) between the two variates: gender and whether or not the
driver drank alcohol in the last 2 hours is true then the expected frequency for the outcome
male and drank alcohol in the last 2 hours for the given data is
e1 =
93
481
= 72:27:
619
The other expected frequencies e2 ; e3 ; e4 can be obtained by subtraction from the appropriate row or column total. The expected frequencies are given in brackets in the table
below.
Drank Alcohol
Yes
Gender
of Driver
in last 2 hours
No
77 (72:27)
16 (20:73)
93
Male
Female
Totals
404 (408:73)
122 (117:27)
526
Totals
481
138
619
To test the hypothesis of no relationship we use the discrepancy measure (a random variable)
D=
4 (F
P
i
i=1
Ei )2
Ei
4 (f
P
i
i=1
e i )2
ei
(77
342
or
D=
Since the expected frequencies are all great than 5 then D has approximately a
distribution.
2 (1)
Thus
p
value = P (D
d; H0 )
= P (D
1:6366; H0 )
= 2 [1
P (Z
= 2 (1
0:8997)
1:28)]
= 0:2006
Since the p-value = 0:2006 > 0:1 we would conclude that there is no evidence against the
hypothesis of no relationship between between the two variates, gender and whether or not
the driver drank alcohol in the last 2 hours.
(c) [2] Although there is no evidence against the hypothesis of no relationship between
the two variates: gender and whether or not the driver drank alcohol in the last 2 hours
based on the data we cannot conclude there is no relationship since this is an observational
study. Whether a causal relationship exists or not cannot be determined by an observational
study only. A decision to strike down the law based on these data alone is unwise.
6:The Survey of Study Habits and Attitudes (SSHA) is a psychological test that evaluates university studentsmotivation, study habits, and attitudes toward university. At a
small university college 19 students are selected at random and given the SSHA test. Their
scores are:
10
14
10
15
11
15
12
15
13
16
13
16
13
17
14
18
14
20
14
i=1
yi = 270 and
19
P
i=1
yi2 = 3956:
For these data calculate the mean, median, mode, sample variance, range, and interquartile
range.
343
mean = 14:21, median = 14, mode = 14, sample variance = 6:62,
range = 20
10 = 10, IQR = 16
13 = 3
7: A dataset consisting of six columns of data was collected by interviewing 100 students
on the University of Waterloo campus. The columns are:
Column 1: Sex of respondent
Column 2: Age of respondent
Column 3: Weight of respondent
Column 4: Faculty of respondent
Column 5: Number of courses respondent has failed.
Column 6: Whether the respondent (i) strongly disagreed, (ii) disagreed, (iii) agreed or
(iv) strongly agreed with the statement The University of Waterloo is the best
university in Ontario.
(a) For this dataset give an example of each of the following types of data;
discrete number of courses failed
continuous weight or age
categorical faculty or sex
binary sex
ordinal degree of agreement with statement
(b) Two ways to graphically represent categorical data are pie charts and bar charts
.
(c) A graphical way to examine the relationship between heights and weights is a
scatterplot .
(d) If the sample correlation between heights and weights was 0.4 you would conclude
that there is a positive linear relationship between heights and weights.
344
APPENDIX C: DATA
Here we list the data for Example 1.5.2. In the le ch1example152.txt, there are three
columns labelled hour, machine and volume. The data are (H=hour, M=Machine, V=Volume):
H
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
M
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
V
357:8
358:7
356:6
358:5
357:1
357:9
357:3
358:2
356:7
358
356:8
359:1
357
357:5
356
356:4
355:9
357:9
357:8
358:5
H
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
M
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
V
357
359:6
357:1
357:6
356:3
358:1
356:3
356:9
356
356:4
357
357:5
357:5
357:2
355:9
357:1
356:5
358:2
355:8
359
H
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
28
29
29
30
30
345
M
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
V
356:5
357:3
356:9
356:7
357:5
356:9
356:9
357:1
356:9
356:4
356:4
357:5
356:5
357
356:5
358:1
357:6
357:6
357:5
356:4
H
31
31
32
32
33
33
34
34
35
35
36
36
37
37
38
38
39
39
40
40
M
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
V
357:7
357
356:3
357:8
356:6
357:5
356:7
356:5
356:8
357:6
356:6
357:2
356:6
357:6
356:7
356:9
356:8
357:2
356:1
356:4
346
APPENDIX C: DATA
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
1.76
1.77
1.91
1.80
1.81
1.93
1.79
1.66
1.66
1.82
1.76
1.79
1.77
1.72
1.73
1.81
1.77
1.56
1.71
1.80
1.68
1.75
1.81
1.69
1.74
1.73
1.74
1.80
1.75
1.81
1.72
1.74
1.74
1.78
1.75
1.68
1.78
1.68
63.81
89.60
88.65
74.84
97.30
106.90
108.94
74.68
92.31
92.08
93.86
88.11
80.52
75.14
64.95
89.11
96.49
53.78
76.61
82.62
80.44
93.10
71.09
71.12
80.84
75.12
96.88
73.22
81.77
83.87
55.91
68.73
75.39
94.10
80.54
70.84
100.76
51.65
20.6
28.6
24.3
23.1
29.7
28.7
34.0
27.1
33.5
27.8
30.3
27.5
25.7
25.4
21.7
27.2
30.8
22.1
26.2
25.5
28.5
30.4
21.7
24.9
26.7
25.1
32.0
22.6
26.7
25.6
18.9
22.7
24.9
29.7
26.3
25.1
31.8
18.3
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
1.60
1.60
1.51
1.60
1.67
1.55
1.61
1.56
1.60
1.58
1.56
1.67
1.64
1.67
1.53
1.60
1.67
1.79
1.54
1.65
1.61
1.76
1.52
1.58
1.69
1.57
1.64
1.70
1.60
1.59
1.64
1.57
1.59
1.53
1.64
1.73
1.57
1.61
59.90
48.38
77.98
54.53
79.20
87.45
53.66
64.00
67.58
70.65
51.59
56.89
54.60
63.31
52.67
48.64
69.72
65.04
67.35
65.34
80.87
85.80
87.56
59.16
94.82
60.39
63.47
62.13
63.49
64.21
72.89
74.19
82.67
59.93
79.61
69.14
81.59
63.51
23.4
18.9
34.2
21.3
28.4
36.4
20.7
26.3
26.4
28.3
21.2
20.4
20.3
22.7
22.5
19.0
25.0
20.3
28.4
24.0
31.2
27.7
37.9
23.7
33.2
24.5
23.6
21.5
24.8
25.4
27.1
30.1
32.7
25.6
29.6
23.1
33.1
24.5
347
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
1.75
1.71
1.73
1.71
1.87
1.69
1.73
1.71
1.86
1.73
1.64
1.59
1.78
1.73
1.76
1.80
1.71
1.69
1.80
1.78
1.73
1.71
1.78
1.74
1.69
1.82
1.63
1.74
1.74
1.69
1.79
1.79
1.86
1.70
1.87
1.65
1.72
1.74
1.69
1.57
1.74
84.83
70.47
112.23
72.23
105.26
69.97
102.36
81.58
80.61
76.62
71.27
60.17
92.20
78.41
90.76
92.34
68.72
76.54
90.72
70.66
76.32
88.02
87.76
84.77
67.40
83.14
69.08
72.36
69.03
81.68
89.39
75.30
90.30
102.59
94.42
89.03
78.40
93.55
68.26
53.73
91.13
27.7
24.1
37.5
24.7
30.1
24.5
34.2
27.9
23.3
25.6
26.5
23.8
29.1
26.2
29.3
28.5
23.5
26.8
28.0
22.3
25.5
30.1
27.7
28.0
23.6
25.1
26.0
23.9
22.8
28.6
27.9
23.5
26.1
35.5
27.0
32.7
26.5
30.9
23.9
21.8
30.1
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
1.68
1.57
1.65
1.60
1.62
1.64
1.54
1.58
1.70
1.56
1.68
1.53
1.58
1.59
1.64
1.63
1.66
1.53
1.66
1.65
1.67
1.60
1.71
1.61
1.65
1.60
1.71
1.58
1.61
1.59
1.57
1.64
1.72
1.59
1.64
1.64
1.58
1.53
1.62
1.62
1.61
82.13
58.91
70.51
71.42
59.57
57.56
61.90
84.63
66.76
75.68
72.25
56.88
66.90
50.06
69.66
87.15
76.61
62.03
88.73
85.21
81.99
77.82
84.21
69.99
96.92
77.57
78.37
77.39
64.28
85.96
64.58
76.92
71.89
58.90
86.07
78.00
66.90
61.10
59.05
83.72
76.99
29.1
23.9
25.9
27.9
22.7
21.4
26.1
33.9
23.1
31.1
25.6
24.3
26.8
19.8
25.9
32.8
27.8
26.5
32.2
31.3
29.4
30.4
28.8
27.0
35.6
30.3
26.8
31.0
24.8
34.0
26.2
28.6
24.3
23.3
32.0
29.0
26.8
26.1
22.5
31.9
29.7
348
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
APPENDIX C: DATA
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
1.80
1.77
1.71
1.78
1.56
1.74
1.79
1.85
1.64
1.83
1.70
1.72
1.72
1.70
1.64
1.75
1.68
1.71
1.67
1.80
1.77
1.72
1.66
1.78
1.60
1.72
1.71
1.79
1.74
1.74
1.78
1.77
1.74
1.84
1.82
1.83
1.74
1.74
1.89
1.81
1.64
89.10
87.41
66.38
106.46
66.92
79.93
92.28
79.40
70.20
116.88
78.32
102.66
78.40
83.81
67.51
69.83
77.62
95.03
74.18
92.99
78.64
79.29
72.75
83.65
61.44
65.97
78.37
74.01
69.33
88.10
89.35
90.54
91.43
94.80
86.12
75.35
70.85
98.70
104.66
91.08
94.67
27.5
27.9
22.7
33.6
27.5
26.4
28.8
23.2
26.1
34.9
27.1
34.7
26.5
29.0
25.1
22.8
27.5
32.5
26.6
28.7
25.1
26.8
26.4
26.4
24.0
22.3
26.8
23.1
22.9
29.1
28.2
28.9
30.2
28.0
26.0
22.5
23.4
32.6
29.3
27.8
35.2
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
1.57
1.72
1.61
1.67
1.67
1.60
1.66
1.58
1.71
1.64
1.59
1.61
1.56
1.56
1.54
1.52
1.57
1.67
1.57
1.57
1.68
1.72
1.68
1.77
1.65
1.41
1.54
1.67
1.72
1.72
1.61
1.52
1.61
1.55
1.57
1.51
1.69
1.69
1.58
1.48
1.66
61.62
107.09
45.36
89.80
77.25
82.94
82.12
74.64
79.54
61.32
60.17
95.91
62.79
48.19
69.73
89.64
57.68
75.02
40.42
53.00
101.61
110.94
65.48
73.00
71.60
46.72
73.99
79.48
60.06
63.01
81.65
85.95
54.95
78.56
64.58
76.84
81.11
78.54
72.65
65.49
60.07
25.0
36.2
17.5
32.2
27.7
32.4
29.8
29.9
27.2
22.8
23.8
37
25.8
19.8
29.4
38.8
23.4
26.9
16.4
21.5
36.0
37.5
23.2
23.3
26.3
23.5
31.2
28.5
20.3
21.3
31.5
37.2
21.2
32.7
26.2
33.7
28.4
27.5
29.1
29.9
21.8
349
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
1.77
1.73
1.82
1.73
1.77
1.82
1.80
1.77
1.80
1.70
1.70
1.77
1.77
1.62
1.74
1.68
1.64
1.75
1.66
1.86
1.72
1.69
1.72
1.77
1.66
1.78
1.82
1.84
1.75
1.75
80.20
73.92
84.80
90.39
74.25
107.32
80.03
105.58
110.48
93.64
68.49
77.70
97.12
70.86
82.96
72.25
73.16
92.49
66.69
106.21
88.75
73.97
81.95
82.40
85.42
76.04
78.50
98.86
85.44
65.23
25.6
24.7
25.6
30.2
23.7
32.4
24.7
33.7
34.1
32.4
23.7
24.8
31.0
27.0
27.4
25.6
27.2
30.2
24.2
30.7
30.0
25.9
27.7
26.3
31.0
24.0
23.7
29.2
27.9
21.3
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
1.47
1.63
1.71
1.59
1.56
1.62
1.53
1.70
1.60
1.52
1.61
1.58
1.71
1.58
1.65
1.65
1.70
1.70
1.66
1.67
1.64
1.68
1.54
1.58
1.68
1.64
1.65
1.66
1.60
1.65
61.37
71.20
66.38
70.79
73.49
70.07
61.57
74.27
45.06
67.93
53.66
64.66
66.67
72.65
79.22
74.32
85.83
67.63
77.98
85.90
67.51
60.96
64.03
61.41
75.64
64.82
59.62
76.05
61.70
76.50
28.4
26.8
22.7
28.0
30.2
26.7
26.3
25.7
17.6
29.4
20.7
25.9
22.8
29.1
29.1
27.3
29.7
23.4
28.3
30.8
25.1
21.6
27.0
24.6
26.8
24.1
21.9
27.6
24.1
28.1
350
APPENDIX C: DATA
44.6
31.7
34.7
79
90.1
140.5
64.9
52.7
23.4
31.7
15.8
11.7
78.7
26.5
38.8
49.3
117
100.8
94.9
53.1
19.3
37.7
94.1
9
58.7
88.7
29.5
21.8
125.4
75.1
70.3
18.8
61.2
43.9
50.5
90.4
61.8
26.4
50.2
59.7
21.1
108.4
44.8
61.2
67.3
18.2
22
41.4
28.1
17.5
73.9
24.2
37.6
19.2
68.5
21.4
110.4
31.9
32.8
38.1
27.2
43
40.3
138
14.5
16.3
71.1
62.3
33.1
85.1
96.5
29.5
54.3
69.9
38.3
14.5
53.5
2.6
72.7
36.9
59.5
48.2
40.4
10.9
42.6
42.5
74.9
113.4
102.3
30.6
70.2
13.7
29.6
36.1
30.7
36.3
53.4
17.4
39.9
71.8
44.3
25.3
82.3
31.5
38
40.1
115
6.1
10.1
100.9
19.3
25.5
6.5
167.2
88.4
39.3
47.6
14.2
169.3
90.3
26.5
80
23.4
5.8
8.3
20
66.4
31
21.6
31.2
136.3
108.2
48
26.9
32.8
27.6
103.2
9.2
35.5
42.3
36.3
11.5
0.9
32
47.2
18.8
49.5
40
8.3
44.4
10.6
28.1
59.3
44.5
43.4
17.8
44.5
121.8
8.8
45.1
66.2
27.1
11.1
25.4
46.1
42.3
55
24.2
74.5
18.7
33.6
61.6
53.5
105.1
55.8
Times (in minutes) between 300 eruptions of the Old Faithful geyser
between 1/08/85 and 15/08/85
78
71
87
61
49
87
92
80
89
80 71 57 80 75 77 60 86 77 56 81 50 89 54 90 73 60 83 65 82 84 54 85 58 79 57 88 68
74 85 75 65 76 58 91 50 87 48 93 54 86 53 78 52 83 60 87 49 80 60 92 43 89 60 84 69
108 50 77 57 80 61 82 48 81 73 62 79 54 80 73 81 62 81 71 79 81 74 59 81 66 87 53 80
51 82 58 81 49 92 50 88 62 93 56 89 51 79 58 82 52 88 52 78 69 75 77 53 80 55 87 53
93 54 76 80 81 59 86 78 71 77 76 94 75 50 83 82 72 77 75 65 79 72 78 77 79 75 78 64
88 54 85 51 96 50 80 78 81 72 75 78 87 69 55 83 49 82 57 84 57 84 73 78 57 79 57 90
78 52 98 48 78 79 65 84 50 83 60 80 50 88 50 84 74 76 65 89 49 88 51 78 85 65 75 77
68 87 61 81 55 93 53 84 70 73 93 50 87 77 74 72 82 74 80 49 91 53 86 49 79 89 87 76
89 45 93 72 71 54 79 74 65 78 57 87 72 84 47 84 57 87 68 86 75 73 53 82 93 77 54 96
63 84 76 62 83 50 85 78 78 81 78 76 74 81 66 84 48 93 47 87 51 78 54 87 52 85 58 88
76
74
50
85
80
62
69
59
48
79
351
Skinfold BodyDensity
1.6841
1.9639
1.0803
1.7541
1.6368
1.2857
1.4744
1.6420
2.3406
2.1659
1.2766
2.2232
1.7246
1.5544
1.7223
1.5237
1.5412
1.8896
1.8722
1.8740
1.7130
1.3073
1.7229
1.0613
1.0478
1.0854
1.0629
1.0652
1.0813
1.0683
1.0575
1.0126
1.0264
1.0829
1.0296
1.0670
1.0688
1.0525
1.0721
1.0672
1.0350
1.0528
1.0473
1.0560
1.0848
1.0564
1.9200
1.6736
1.7914
1.7249
1.5025
1.6314
1.3980
1.7598
1.3203
1.3372
1.3932
0.9323
1.8785
1.6382
1.4050
1.8638
1.1985
1.5459
1.5159
1.6369
1.6355
1.3813
1.5615
1.0338
1.0560
1.0487
1.0496
1.0824
1.0526
1.0707
1.0459
1.0697
1.0770
1.0727
1.1171
1.0423
1.0506
1.0878
1.0557
1.0854
1.0527
1.0635
1.0583
1.0621
1.0736
1.0682
1.5324
1.7035
1.8040
1.8075
1.3815
1.5847
1.3059
1.3276
1.5665
1.8989
1.4018
1.6482
1.5193
1.8092
1.3329
1.5750
1.6873
1.8056
1.9014
1.5866
1.2460
1.4077
1.3388
1.0696
1.0449
1.0411
1.0426
1.0715
1.0602
1.0807
1.0536
1.0602
1.0536
1.0655
1.0668
1.0700
1.0485
1.0804
1.0503
1.0557
1.0625
1.0438
1.0632
1.0782
1.0739
1.0805
2.0755
1.4351
1.7295
1.5265
1.7599
1.4029
1.2653
1.2609
1.6734
1.5297
1.5257
1.8744
1.6310
1.6107
1.9108
1.3943
1.7184
1.7483
1.5154
1.6146
1.3163
1.3202
1.5906
1.0355
1.0693
1.0518
1.0837
1.0328
1.0933
1.0860
1.0919
1.0433
1.0614
1.0643
1.0482
1.0459
1.0653
1.0321
1.0755
1.0600
1.0554
1.0765
1.0696
1.0744
1.0818
1.0546