Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CHAPTER 1
By William A. Prothero, Jr.
Kinds of Data
A measurement can come in many forms. It may be the color of a rock, the number
that comes up on a die, or a measurement from an instrument.
We define several types of measurement scales.
nominal :
ordinal :
counting:
interval:
ratio:
The ratio scale is the same as the interval scale, but it has a true zero. Length,
mass, velocity and time are examples of measurements made on a ratio scale.
angular :
Each of the above data types may require variations in plotting strategies. The following
sections show how to construct histograms for these data types and later chapters show how to
plot these data when more than a single variable is associated with the data.
Version: December 19, 2001, University of California
University of California, 2001
1-1
Measurement Errors
If data were error free, it would not be necessary to read the remainder of this text.
Errors come from many sources. In fact, they are physically required through the Heisenberg
Uncertainty Principle, which states that there is an inherent uncertainty in any measurement
that can be made in a finite length of time. On a practical level, errors occur because the
instruments that we use have noise and because of naturally occurring variations in the earth.
For example, suppose you are measuring the composition of rocks sampled from a particular
region. You would expect composition to vary because of (hopefully) small variations in the
history of the rock, variations in chemical composition of the source, and varying contamination
from other sources (crustal rock contamination of igneous intrusive rocks, weathering, leaching,
etc). In seismology, the earthquake signals will vary from site to site because of varying surface
soil conditions, scattering of the seismic waves on the way to the source, and instrument errors.
However, one persons noise may become another persons signal. If the problem is to
determine the magnitude of the quake, the variations in signal due to scattering and surface
structure variations will be noise. But, if the problem is to study scattering or site response,
the variations due to these effects are signal to be studied and explained.
Accuracy and Precision:
Accuracy is the closeness of a measurement to the "true" value of the quantity being
measured. Precision is the repeatability of a measurement. If our measurements are very
Version: December 19, 2001, University of California
University of California, 2001
1-2
precise, then all our values will cluster about the same value. To make the distinction between
these two terms clear, consider the data plotted in Figure 1.1. Five measurements of the
concentration of chemical X in a given water sample are made with each of four different
instruments. The "true" concentration of the sample is 50 mg/l.
CHEMICAL X (mg/l)
80
60
40
20
Figure 1.1 Plot of concentration of chemical X. The correct value of the concentration is 50 mg/l. A shows
a precise and accurate measurement, B is accurate, but not precise , C is precise, but not accurate, and D is
neither precise nor accurate.
Instrument A is both precise and accurate. Instrument B is accurate but not precise.
Instrument C is precise but not accurate. Instrument D is neither precise nor accurate.
Bias:
Bias will also be discussed in more detail in the chapter 7. Cases C and D in figure 1.1
(above) demonstrate bias in the data. In this case, the bias is caused by an inaccurate
measurement device. A good example is when you measure your weight on the bathroom
scale. You may consistently get the same weight, but if the zero of the scale is not set properly,
the result will consistently be high (or low), or biased.
Significant figures and rounding:
Another important concern with respect to data measurement is the correct use of
significant figures. This has become a problem as hand calculators have come into universal
use. For example, suppose you measure the length of a fossil with a ruler and find it to be 5 cm.
Now suppose that you decide to divide that length by 3, for whatever reason. The answer is
1.666666666.... Obviously, since the ruler measurement is probably accurate to less than 0.1
cm, there is no reason to carry all of the sixes after the decimal point. Significant figures are the
accurate digits, not counting leading zeros, in a number. When the number of digits on the right
hand side of the decimal point of a number is reduced, that is called rounding off. If the
portion that is being eliminated is greater than 5, then you round up, but if it is less than 5, you
round down. So, 5.667 would round to 5.67, 5.7, or 6 while 5.462 would round to 5.46,
5.45, and 5.5, and 5. You set your own consistent convention for when the truncated digit is
exactly 5. Generally, it is a round up, but sometimes it is alternately rounded up, then down.
Some conventions and rules exist regarding the number of significant figures to carry in your
answer. When a number is put into a formula, the answer need have no more precision than the
original number. Precision is also implied by how many digits are written. Writing 16.0
Version: December 19, 2001, University of California
University of California, 2001
1-3
implies 16.0 0.05, so that the number is known to within 0.1 accuracy whereas writing
'16.000' would indicate that the number is known to within 0.001 accuracy. The number
41.653 has 5 significant figures, 32.0 has 3, 0.0005 has 1, and so forth.
In calculations involving addition and subtraction, the final result should have no more
significant figures after the decimal point than the number with the least number of significant
figures after the decimal point used in the calculation. For example, 6.03 + 7.2 = 13.2. In
calculations involving multiplication and division, the final result has no more significant figures
than the number with the least number of significant figures used in the calculation. For example,
1.53 x 101 * 7.21 = 1.10 x 102. Note that it may be clearer if you use scientific notation in
these calculations. Consider the number 1,000; 1.000 x 103 has 4 significant figures whereas 1
x 103 only has 1. If your calculation involves multiple operations, it is best to carry additional
significant figures until the final result so round-off errors don't accumulate.
It is extremely important to maintain precision when performing repetitive numeric
operations. Keep as much precision as possible during intermediate calculations, but
show the answer with the correct number of significant figures.
Data distributions
It is very difficult to extract meaning from a large compilation of numbers, but graphical
methods help us extract meaning from data because we see it visually. When the data values
consist of single numbers, such as porosity, density, composition, amplitude, etc., the histogram
is most convenient. Data are divided into ranges of values, called classes, and the number of
data points within each class are then plotted as a bar chart. This bar chart represents the data
distribution. The shape of the distribution can tell us about errors in the data and underlying
processes that influence it.
A histogram can be described as a plot of measurement values versus frequency of
occurrence and for that reason, histograms are sometimes referred to as frequency diagrams.
Some examples of histograms are shown in Figure 1.2(a-f).
The histogram in Figure 1.2 (b) is called a cumulative histogram or a cumulative
frequency diagram. Note that in this figure, the values for weight % always increase as
increases. The weight % at any point on this graph is the weight % for all values equal to or
less than the value of at that point. ( is a measure of grain size equal to -log(grain diameter in
mm)). The histogram in Figure 1.2 (d) is called a circular histogram and is a more meaningful
way of displaying directional data than a simply x,y histogram.
1-4
50
100
WEIGHT %
80
60
25
40
20
0
2
10
10
Figure 1.2a and b: Grain size plots. The plot on the right is the cumulative histogram, which is just the
integral of the curve on the left.
60
50
% MAP AREA
40
30
20
10
0
igneous
metamorphic
sedimentary
Altitude (km)
-6-7
-5-6
-4-5
-3-4
-2-3
-1-2
-0-1
+0-1
+1-2
+2-3
+3-4
+4-5
0
10
20
30
% OF EARTH
1-5
100
# Students
75
50
25
0
31-40
41-50
51-60
61-70
71-80
81-90
91-100
Figure 1.2f: Examples of histogram of interval data. Each bar is the number of students achieving scores
between specific values.
Figure 1.3 Symmetric, bimodal, and skewed distributions. Here there is sufficient data in the sample so that
the histogram bars follow a relatively smooth curve.
The mean is the sum of all measurements divided by the total number of measurements. The
mean of N data values is:
N
m =
i =1
This value is also referred to as the arithmetic mean. There is also a geometric mean,
defined as:
g=
x1 x 2 x 3 .....x N
1-6
The formula for the sample mean, m, given above, uses the Greek summation
N
symbol,
i =1
It is interesting to note that the logarithm of the geometrical mean is equal to the
arithmetic mean of the logs of the individual numbers. Log[(abcd)1/N ] = {log(a) + log(b) +
log(c) + log(d)}/N
median:
the value one-half of the measured values lie above and one-half the measured
values lie below.
For a symmetrical distribution, the mean, mode and median values are the same.
mode
median
mean
mode
median
mean
The dispersion or variation of a data distribution can be described by the variance and
standard deviation. The standard deviation is defined below.
N
sample variance =
(x
i =1
m) 2
N
N
(x
i =1
m) 2
N 1
1-7
sbx =
(x
i =1
N
N
sx =
m) 2
(x
i =1
m) 2
N 1
where the mean, m was defined previously. The larger the standard deviation, the larger the
spread of values about the mean. The variance is the square of the standard deviation. The
meaning of "unbiased", used above, will be discussed in Chapter 5.
mean
standard
deviation
Figure 1.5. This figure shows how the standard deviation is a measure of the spread of values about the
mean.
Symbols to be used throughout this book:
ith data value,
xi
number of data in the experiment,
m, or x
mean of data values,
unbiased standard deviation of the data, s
variance of the data,
s2
A data distribution can also be described in terms of percentiles. The median value is
the 50th percentile. The value which 75% of the measured values lie below is the 75th
percentile. The value which 25% of the measured values lie below is the 25th percentile, and so
forth. The 25th, 50th and 75th percentiles are also known as the quartiles. Given the range of
data values and the quartile values for a distribution, we can tell if the distribution is symmetrical
or highly skewed.
1-8
To illustrate the definitions of the terms defined above, we consider the set of
measurements shown in Table 1.1.
5.0
5.8
6.1
6.3
6.9
7.4
7.5
7.5
7.6
7.8
8.3
8.3
8.8
8.8
8.9
9.0
9.1
9.2
9.4
9.4
9.4
9.4
9.5
9.6 10.0
10.0 10.3 10.3 10.5 10.5
10.6 10.7 10.8 10.8 10.9
11.0 11.4 11.6 11.9 11.9
12.2 12.3 12.4 12.8 12.8
13.1 13.4 13.5 14.0 14.1
Table 1.1. Length of fossil A (cm)
There are 50 data points in this data set, ranging in value from 5.0 cm to 14.1 cm. The
mean, median and mode are 10.0 cm, 10.0 cm and 9.4 cm, respectively. The standard
deviation is 2.2 cm. The 25th and 75th percentiles are 8.8 cm and 11.9 cm, respectively. See
if you can find the median and mode by inspection.
This method may be used for any kind of data, but it is the only way that nominal data can be
plotted.
number in class
Bar Height =
2. Frequency:
width of class
This method may only be used for interval, ratio, and a modified version of angular data. It
requires that the data can be expressed on an interval number scale (it also works for integers).
This scale is most common when a comparison with expected values is wanted. The number
of values within the class is then equal to the area of the bar (height*width). This normalization
has the advantage that the width of the bars may vary without the undesirable effect that wider
bars also get taller. Figure 1.6 demonstrates the appearance of the plot of dice tossing
experiment when bar heights are calculated according to number in class and according to
number in class/class width, when faces "5" and "6" are combined into a single class.
Dice tossing experiment: A die is tossed N times. The number of times each face is expected to come
up is N/6, since there are six faces and it is equally likely that any face come up. When some number of
tosses are made, the number for each face will normally NOT be N/6. This is due to a natural
randomness that will discussed further in later chapters.
1-9
30
# of
times a
die
number
comes
up for 60
tosses
25 Expected
25
20
20
15
15
10
10
Expected:
Observed:
5 and 6
5 and 6
# showing on Die
# on die face
Figure 1.6 Comparison of two methods of plotting a histogram when it is modified so that the numbers 5
and 6 are combined into a single class. The plot on the left shows the effect of simply adding the 5 and
6 values into a single bar. The plot on the right shows what happens when the bar height is the number
of 5 and 6 occurrences divided by the number of faces included in the bar, or class interval.
3. Probability:
Bar Height =
number in class
( width of class) * (Number of data)
Here, the bar height for option 2 is divided by the total number of data, N. In this case, the
histogram bars can be directly compared to the "probability density distribution", which will be
introduced in Chapter 5. This has the advantage, for the dice toss experiment, of allowing us to
plot any number of dice tosses without resetting the maximum Y value on the plot and makes it
easy to compare actual results to "expected" results.
Practice: It is important that you know how to process data by hand before entrusting it to the
computer. This exercise gives you some practice in this. The problem is to plot a histogram of
the data in the table below. You decide to sort the data into 5 equal width classes divided
between 0 and 10.
First, enter the upper and lower boundaries of each class into the table below. There should be
5 equally spaced intervals with the upper boundary of class n equal to the lower boundary of
class n+1.
Class #
1
2
3
4
5
Lower
Boundary
0
Upper
Boundary
10
1-10
Next, enter the class # for each of the numbers in the table of numbers below. You can do this
by inspection. Verify also that equation 4 gives the correct class number. This formula is
needed when the script is written to find the class.
Values
Class #
7.64
6.22
8.75
1.61
4.17
6.91
1.88
1.96
8.23
4.31
8.84
5.59
1.5
5.94
5.78
2.66
2.8
8.34
3.68
4.91
Now, count the number of data values in each class and enter them into column 2 (1) of the
table below. Then enter the class frequency and class probability values according to
normalizations 1, 2, and 3 of page 1-9.
Class #
(1) # in Class
(2) Class
Frequency
(3) Class
Probability
Density
Area of Class
1
2
3
4
5
For (1) you should have 4,3,6,3,4 and for (2), you should have 2,1.5,3,1.5,2 and for (3) you
should have 0.1,0.075,0.15,0.075,0.1 and for the area of each class, you should have
0.2,0.15,0.3,0.15,0.2.
Version: December 19, 2001, University of California
University of California, 2001
1-11
If the above data were sampled from a continuous uniform distribution, where every number
between 0 and 5 is equally likely. The probability of getting a number in any class is 1/5 = 0.2.
Notice that the area of the class dithers around 0.2 and that the sum of all of the areas is 1.0.
Now, draw up the histogram on some scratch paper.
Circular Histograms
Circular histograms must also be constructed in a particular way. The important point is
that the area of the circular histogram element must be proportional to the number of data points
it contains. Figure 1.7 shows a portion of an angular histogram. The area of a circle of radius R
is R2 . Since a complete circle represents 2 radians of angular rotation (360o ), the area of a
pie shaped segment of a circle that is W radians, Aclass = R2 (W/2). W is analogous to the
width of the histogram bar from before, but its units are radians. What remains is to normalize
the area. We can also say that the area of a segment divided by the total area of the histogram
should be given by the number of data in the class divided by the total number of data. That is,
the fractional area represented by a class should equal the fraction of the total that is contained
within the class. This gives us the relation: Aclass/A = f/N, where Aclass is the
area of the pie-shaped histogram segment, A is the total area of the histogram, f is the number of
data points in the class, and N is the total number of data points.
The equation is, then:
A class
A
R 2 W
f
=
2 A
N
and
2
R =
2 Af
WN
1-12
So, you first decide how many segments you want for the histogram. If you want 10 segments,
you have W=2/10 radians. All that is needed is to decide the total area of the histogram
based on how physically large you want the plot to be, determine the class frequencies (how
many are in each class), compute R, and make the plot.
Thinking in "Statistics"
Some of the terms of statistics were defined for studies of a population of people. A
pollster wants to predict the outcome of an upcoming election. He/she can't poll every person. It
would be too expensive. So, a sample of individual members of the population is called up on
the telephone and asked their opinion. It is the responsibility of the pollster to sample the
population in such a way that the results will reflect the opinions of the population as a whole.
Data taken from a sample are used to infer the opinions of the population. In fact, this is the
central problem of statistics. We take a sample of measurements of our study area, then infer
the properties of the entire study area from that sample.
But, it isn't enough just to produce a single number that is the "answer". Some measure
of the accuracy of that number is required. This number is usually expressed as a "confidence
limit". We say that: If this experiment was repeated many times, what are the upper and lower
limits that 95% of the results lie within? If we wanted to be safe, we could specify 99%, or
some other percentage. But, the critical piece of information is the statement that yes, there are
errors, but there is an XX% chance that our result lies between the two specified limits.
The ideas of a sample and a population apply to geological statistics, as well as
opinion polling. The term population as it is used in statistics refers to the set of all possible
measurements of a particular quantity. For example, if we are interested in the nature of the
pebbles in a given conglomerate, the population will consist of all the pebbles in the
conglomerate. For a dice toss experiment, the population can be considered the infinity of
possible dice toss outcomes. When the experiment is repeated, the population of all possible
dice throws is being sampled repeatedly. In other words, we visualize an abstract collection of
all possible values of the quantity of interest. Then, when we make a measurement, toss the die,
throw the coin, whatever, we are sampling this abstract population. I like to think of it as a
giant grab bag full of small pieces of paper with a number written on each one. A sample is
taken by reaching in and grabbing out N pieces of paper and reading the numbers.
Suppose, for example, that you drill some cores of a rock to determine the orientation
of the remnant magnetism for a paleomagnetic study. The rock most likely does not have a
constant magnetic direction throughout its body, so a single core would give a very uncertain
result. The result can be improved by taking a number of cores and averaging the results. The
level of uncertainty will be affected by how many cores are averaged together and the amount of
variation of magnetic direction within the rock itself. The entire rock formation would be
considered the population and the collection of cores would be the sample. It is then the
statisticians task to infer the properties of the population from the measurements on the
sample.
1-13
Sampling Methods
Simply going out and making some measurements sounds easy. In practice, the process
is prone to errors and misjudgements. It can be very expensive to launch a field program, gather
data at great expense, then find in the analysis that there are not enough data or the data are
sampled incorrectly. You can usually improve the sampling strategy if you understand as much
as possible about the process or system you are sampling. When this is impossible, small test
experiments can be useful. The following paragraphs discuss some of the issues involved in
designing a sampling strategy.
It is important to take a truly random sample so that errors tend to average out. But,
getting truly random sampling is not always straightforward, especially in the earth sciences,
where values of interest may not be randomly accessible or where, for example, only certain
kinds of rock formations are exposed.
Suppose you want to sample the soil properties in a 1 km2 area. You might be
measuring a soil contaminant, nutrients, porosity, moisture, or any other property appropriate to
the study. You must first answer some very basic questions. Is the parameter you want to
measure distributed randomly within the area, or is there a systematic variation? You must also
determine the source of noise in the measurement. Is the noise due to error in the measurement
instrument, or is it due to natural variations in the properties of the soil? An example of a
systematic variation is a slow change of the parameter across the area you are sampling. For
example, if you are measuring nutrients but portions of the study area have large trees and some
have low plants, you would expect a dependence on this. If you want to study the properties
averaged over a large area, you may want to consider natural variations as noise to be averaged
out. On the other hand, variations of the nutrients that are caused by vegetation differences may
be of interest. It all depends on the goals of your study.
One sampling option is to adopt simple random sampling. The area is divided into a
grid and sampling takes place at randomly selected grid points. Grid points may be selected
using random number tables or a computers random number generator. In the field, you could
toss a die. One method (Cheeney, 1983) is to divide the length into 6 locations and select one
by tossing the die. The chosen interval is divided into 6 subintervals and one of these is chosen
by die toss. This subdivision can continue as far as needed.
Another sampling method is stratified sampling . This method prevents the bunching
of data points that may occur with simple random sampling. For example, we might lay out a
grid of 10 x 10 squares and take a number of randomly located samples within each of the 100
squares. If you are measuring the magnetization direction of a rock outcrop by taking cores
and the random selection system bunched all of the samples in a small portion of the rock, it
would be wasteful to blindly take these data. However, if this selection of random data points
were rejected until a more satisfactory distribution was determined, the statistical assumptions
of randomness would be violated and conclusions based on statistics would be suspect. If you
will reject bunched data locations, stratified sampling is the method to choose. Methods of
identifying systematic variations will be discussed in later chapters, when correlation is
discussed.
Version: December 19, 2001, University of California
University of California, 2001
1-14
Another approach would be systematic sampling scheme in which you would pick a
location at random and distribute the data points at even intervals from the start point over the
remainder of the field area. You must be careful that this approach doesn't introduce any bias.
If there is any reason to believe that the property being measured varies systematically, this
approach may not work. This method generally reduces the number of data points needed in a
sample, but produces somewhat less precise results than other methods. A systematic sampling
method is used in point counting work in petrography.
How many samples should be taken? As we will see, the required sample size
depends on two major factors. The first is the precision required by the study. The more
precision you want in your results, the more samples you will need to take. The second is the
inherent variability in the population you are sampling. The greater this variability, the greater the
necessary sample size. Of course there are practical limits which must be considered. These
may include the availability of possible samples and the costs involved in sampling. More
complete discussions of sampling theory and problems are given in Chapter 5 and 6. Also, for a
discussion of sampling methods, see Cheeney (1983) and Cochran, W.G. (1977. Sampling
Techniques, 5. New York: Wiley).
1-15
Review
After reading this chapter, you should:
Be able to discuss the types of measurement scales discussed in this chapter: discrete,
continuous, nominal, ordinal, counting, interval, ratio and angular.
Understand the difference between accuracy and precision and know how to use significant
figures correctly.
Be able to describe a data distribution in terms of overall shape, central tendency and
dispersion. Know how to find the mean, mode, median, standard deviation and percentiles
for a data distribution.
Be able to construct a histogram of various kinds of data and compute the correct bar
heights.
Vocabulary:
sample
population
sample mean
sample variance
sample standard deviation
histogram
bias
1-16
Problems
1. Describe the following data as discrete, continuous, nominal, ordinal, counting, interval, ratio
and/or angular.
a.
b.
c.
d.
e.
f.
g.
h.
i.
2. Define the term's accuracy and precision by way of a dart board analogy. That is, draw a
dartboard with 5 darts on it thrown by someone who is accurate but not precise, precise but not
accurate, precise and accurate and neither precise nor accurate.
3. Give the answers to the following problems to the correct number of significant figures.
a. 13.67 + 4.2 =
b. 2.4 * 4.11 =
4. Using Excel, plot a regular histogram and a cumulative frequency histogram for the following
data set. Be sure to indicate the mean and standard deviation of the data on the plot of the
histogram. Note: you can plot a histogram with any desired bar heights by directly entering the
bar heights in the field labeled class frequencies and clicking on the Plot Data button.
43
52
56
59
64
65
67
69
72
78
47
53
57
60
64
65
68
70
72
78
48
54
57
61
64
65
68
70
73
79
49
55
58
62
65
65
69
70
74
79
49
55
58
63
65
65
69
70
74
83
5. In designing a water well, you need to select a screen slot size that will retain about 90% of
the filter pack material surrounding the well hole. Data from a sieve analysis of this filter pack
material are shown below. Construct a cumulative frequency diagram, using Excel, and
Version: December 19, 2001, University of California
University of California, 2001
1-17
determine the necessary slot size by plotting the cumulative % caught on the sieves on the y-axis
and the sieve slot size on the x-axis..
weight % caught on sieve
2
8
20
30
30
10
6.
0.0
0.4
0.6
0.8
1.1
1.7
1-18
Chapter 2
Plotting X - Y Data
Most of your X-Y data plots will be created in a charting program. A simple x-y chart created
in Microsoft Excel is shown below. The chart has a title, a label for each axis, and a legend that
describes the symbols that represent the two data sets that are plotted. The importance of making clear
data plots cannot be overemphasized. The reader should be able to understand the content of the plot
by looking at the plot and its caption.
Concentration-ppm
20
Data #2
10
0
0
10
15
Figure 2.1a. This is a sample data plot showing correct axis and data labeling. When plotting data, error
bars should also be shown on the plot.
Logarithmic Scaling
Often, it is useful to plot values on a logarithmic scale. The logarithm of either, or both, of its X
and Y axis values is plotted. The most common reason for plotting on a logarithmic scale is that data
2-1
values span many orders of magnitude. This is true for the earthquake magnitude scale where ground
motion induced by quakes varies from sub-micron to meter amplitudes, a range of six decades or more.
10
15
2-2
1000000
100000
10000
1000
100
10
1
1
0
10
15
10
15
Figure 2.1c. These two figures illustrate 2 ways of labelling the Y axis for a log plot. The left plot is the most
conventional and is what Excel produces. The right side is most useful for calculating the coefficients of the equation
of the line that fits the data. Notice, on the right, that the log of the y values are taken, then a linear y axis is used to
plot the values.
Slope = 0.75/0.5
0.75
0.5
intercept
2-3
(linear dependence)
(2-1)
y = A xn + b
(2-2)
y = A en x + b
(exponential dependence)
(2-3)
Equation 2-1 is the familiar equation of a straight line. It is characterized by its slope and intercept,
which is the value of y at x=0. A diagram is shown in figure 2.2. The slope, m, is 1.5 and the intercept,
b, is 2. It is possible to determine using graphical methods, the unknown constants in equations 2-2
and 2-3. The following operations demonstrate how this is done.
Power Law Equation 2-2
Rearranging equation 2-2 slightly, it becomes:
Y
Ax n
nX l
+ log( A)
2-4
we can plot Yl vs Xl and the slope will be equal to n, the power of x in equation 2-2. The intercept will
be the value of log(A).
Exponential Equation 2-3
Similarly, for equation 2-3, we have:
y b = A en x
log(y b) = log( A) + nx log(e)
We let:
Yl
log ( y
b)
So,
Y l = n log ( e) x + log ( A )
We can see that the Y axis should be plotted on a logarithmic scale as log(y-b) and the X axis on a
linear scale. The slope will be the value of nlog(e). There is a complication in this procedure for these
two functional forms. We do not know the value of the constant b. Often we expect that b = 0, as in
the case of radioactive decay. If it is strongly suspected that the data follow a power law with a nonzero b, then b could be varied in the plot until the best straight line is achieved
So far, appeals to intuition are being made so that you obtain an understanding of the underlying
principles. However, the fitting of straight lines in the presence of noisy data is fraught with dangerous
traps in interpretation. Questions that must be asked of any data fit are:
a) what other values of the parameters produce an equally good fit?
b) do other functional dependencies produce an equally good fit?
A more quantitative definition of a Good fit will be given when computer curve fitting is discussed
using Excel.
Helpful hints:
It is not necessary to have the X = 0 value plotted to determine the intercept, which produces the
b value in equations 2-1 to 2-3. Once it has been determined that the data follow a straight line
dependence, any X,Y value (from the straight line) may be used to solve for b. Just read an X,Y
value from the graph, substitute it into the equation (slope is known, but b is not), and solve for b.
2-5
y
7.389056
2980.958
22026.47
4.85E+08
2.35E+17
log(y)
0.868589
3.474356
4.342945
8.68589
17.37178
The first thing to notice about the numbers in the y column is that they range from 7.8.. to 2.35 x1017, an
extremely large range. An x-y plot of this data is shown in figure 2.3 below.
y = exp(2*x)
2.5E+17
2E+17
1.5E+17
y
1E+17
5E+16
0
0
10
20
30
Notice that the extreme range of the data causes all of the data except the largest to be plotted on the x
axis. We suspect that we should make the y axis into a log axis. Figure 2.4 shows this.
2-6
y = exp(2*x)
log(y)
20
15
10
log(y)
5
0
0
10
20
30
Figure 2.4. The example data is plotted on a log y axis. Note: exp(x) means ex.
Notice that the data plots as a straight line. We can measure the slope and intercept of this straight line
and find the "A" and "n" coefficients of the equation, to make sure they agree with what we already
know. But first, we note that the labels on the Y axis are still reflecting the original data values. If we use
these numbers to calculate the slope and intercept of the straight line, we will get the wrong answer. This
is because the Excel plot routine, just to be nice and convenient for those who want to read the original
numbers from the Y scale, did not really label the log(y) values. The easiest way to get the log values is
to make a third column that is log(y), then do a new plot of x vs log(y). This plot is shown in figure 2.5
below.
We can see that the slope of this line is: 0.869, and can be measured from the plot itself, or calculated
from the table of numbers (don't do this with real data; it's best to do a least squares fit when data have
errors):
(17.37 4.34)
slope =
= 0.87
( 20 5)
See if you can get these numbers yourself. The plot above also shows the y intercept to be 0. So,
referring back to our equation:
Y l = n log ( e) x + log ( A )
2-7
Also, since b = 0 (y intercept), then we solve log(A) = 0. The log(1) = 0, and our original value for A is
1. So, we have created some data artificially, pretended we didn't know where it came from, then
worked backwards to get our initial equation. This is the procedure for all of the other functional forms.
Complications:
If the values of one of the variables are negative, you cant take the log, because the logarithm of a
negative number has no meaning. But, you can make a substitution x = -x in the equations. This lets you
take the log(-x) for all values, so the log would become the log of a positive number. You need to adjust
the equation for the equation coefficients, though.
Also, most data have errors. We did the example with noise free data, so it worked out perfectly. In the
presence of errors, the coefficients that you solve for will have errors too. Also, sometimes you cannot
tell whether a log or linear axis gives the best fit. You have to use what you know about the process that
created the data and use your best judgement. It is never wise to blindly apply mathematical
techniques without knowing something about the processes that created the data.
Review
After reading this chapter, you should:
Know how to find the functional dependence of common forms and find the unknown
parameters in the function, from the plot. Be sure that you can derive the equations for slope
and intercept in all three cases.
2-8
Problems:
Problem 1: This problem is designed to support your understanding of the simple derivationsof the
slopes and intercepts for the 3 functional forms of equations that have been discussed.
1a) If all values of x are negative, you cannot take the log of these numbers to test for a power law or
exponential dependence. Derive the equation for slope and intercept for power law and exponential
dependencies when all values of x are negative.
Problem 2: Do problem 1, but when all values of y are negative.
Problem 3. Seismologists have noticed that the relationship between the magnitude of earthquakes and
their frequency is described by the equation
y* = a-bx, where y* is the log of the number of earthquakes and x is the magnitude of the earthquakes.
For the following data, find 'b'. Use the mid-point of the range of values given for x.
magnitude of earthquakes
6.5-7.0
2
6.0-6.5
3
5.5-6.0
10
5.0-5.5
16
4.5-5.0
72
4.0-4.5
181
3.5-4.0
483
3.0-3.5
846
2.5-3.0
302
2.0-2.5
73
number of earthquakes
Problem 4.
Determine the half-life of chemical B based on the following experimental data. (The
half-life is the time at which one-half of the chemical remains.
fraction chemical left
time (days)
0.97 0.2
0.92 0.5
0.84 1.0
0.71 2.0
0.42 5.0
0.18 10.0
0.03 20.0
0.006 30.0
2-9
Problem 5.
The following data were collected during an experiment to determine the relationship
between temperature and vapor pressure for an organic chemical. From previous experience you know
that the general form of the equation that describes this relationship is given below.
ln P = A
T
+ B
T(0K)
283
298
323
343
Dataset 2
0,2
1,1.1
2,0.6
3,0.33
4,0.18
5,0.1
6,0.05
7,0.03
8,0.02
9,0.01
Dataset 3
0,0
1,1
2,5.66
3,15.59
4,32
5,55.9
6,88.18
7,129.64
8,181.02
9,243
Dataset 4
0,2.02
0.83,1.23
1.67,0.71
2.5,0.45
3.33,0.26
4.17,0.15
5,0.16
5.83,0.08
6.67,0.02
7.5,0.03
8.33,0.02
9.17,0.04
2-10
CHAPTER 3
Correlation and Regression
In this chapter, we discuss correlation and regression for two sets of data measured on a
continuous scale. We begin with a discussion of scatter diagrams.
Scatter Diagrams
A scatter diagram is simply an x,y plot of the two variables of concern. For example, figure 3.1
shows a scatter diagram of length and width of fossil A. These data are listed in Table 3.1.
_______________________
16
length
width
18.4
15.4
16.9
15.1
13.6
10.9
11.4
9.7
7.8
7.4
6.3
5.3
___________________________
12
Table 3.1
4
10
15
20
A good example of the need to understand the calculation at more than a superficial level is the
computation of the correlation between two variables, x and y. An x-y scatter plot is always done first.
Then you can visually determine whether there might be a correlation and whether it is reasonable to
calculate a correlation coefficient. Some interesting misinterpretations of the correlation coefficient will
be illustrated in the following pages. Even though the computer is a great tool for doing extensive
computation, you should do the calculation by hand, at least once, to make sure you understand the
process.
Version: December 20, 2001
University of California, 2001
3-1
xi
( xi x ) 2 xi2 n
n
var( x ) = s 2x = n
= n
n 1
n 1
(A)
Notice that the variance, in the above equation, is the standard deviation of the data squared. The
standard deviation was defined in chapter 1. The second form of the variance (right hand side of the
equation) is exactly equivalent to the standard definition, but is sometimes convenient to use when
calculating with a calculator, or when deriving equations. In general, the variance will increase as scatter
in the data values increases.
Another important quantity is the "covariance" between two variables, x and y. The formula for
the covariance is given below. It is very analogous to the variance, but includes both x and y values.
Notice the similarities between the two equations. Instead of squaring x, we have x times y values. This
keeps the dimensions the same.
x i yi
( xi x )( y i y ) xi yi n n
n
cov( x , y ) = s xy2 = n
= n
(B)
n 1
n 1
The covariance is an expression of the relationship between the x and y data points. Notice that it is
similar to the standard deviation of a single variable squared, but instead of squaring values, x and y
values are multiplied.
Hints on understanding these formulas: It is very important to become familiar with the summation
notation. A few minutes to focus on this notation will be very worth your while when you try to understand
more complex concepts and formulas later in this chapter. Suppose there are n values of x. Suppose these
values are: 1, 2 and 3 (for simplicity). The formula:
are 3 values, n = 3. The n on the
data set that means we do:
sign means to sum over all of the n values of x. So, for the simple
formula A above. Use both forms of the formula to convince yourself that they are equivalent. After you do
this, assume a y dataset to be 2, 3, and 4. Now do the covariance formula (B) and see what you get. Do
both forms. They should agree. If they do, you will have mastered the summation notation.
3-2
be written as:
s x2 =
x=
1
x i , the second form of equation A above can
n n
nx 2
n 1
Correlation coefficients
The problem with the previous formula for covariance is that its actual value is not simply related
to the relationship between x and y. It would be more elegant if we had a scale where 0 implied no
relationship and 1 (or -1) implied the maximum relationship. This can be achieved by defining the
Pearson's correlation coefficient. The Pearson's correlation coefficient, denoted by rxy is a linear
correlation coefficient; that is, it is used to assess the linear relationship between two variables. It is
used for data that are random and normally distributed. The xy subscripts are used to emphasize the fact
that the correlation is between the variables, x and y. This coefficient is very important in the least
squares fit of a straight line to x and y data.
The value of rxy can vary from -1 to +1. When the two variables covary exactly in a linear
manner and one variable increases as the other increases, rxy =+1. When one variable increases as the
other decreases, rxy =-1. When there is no linear correlation between the two variables, rxy =0. Figure
3.2 shows some scatter diagrams for various values of rxy.
3-3
25
25
r=-1
r=+1
0
0
10
20
10
20
r=0
r=+0.8
10
10
10
10
Figure 3.2. Plots showing the value of the Pearsons correlation coefficient for different rxy values. Notice that the
plots show increasing scatter as the r value decreases toward 0.
3-4
(x
x )( y i y ) ( n 1)
rxy =
( xi x )2
i
n 1
s xy2
(3-1)
( yi y )2 s x s y
i
n 1
xi yi xi y k n
i
xi2 xi
i
i
2
n y i yi n
i
(3-2)
where x is one variable, y is the second variable and n is the sample size. It doesn't matter which
variable we call 'x' and which we call 'y' in this case. Notice that all we had to do to convert from the
covariance was to divide by s x s y . This division "normalizes" the value of the covariance so that it varies
between -1 and +1.
For the data in Table 3.1, (to make sure you understand the calculation, see if you can
duplicate the numbers given below. they correspond to eq. 3-2):
(74. 4)(63. 8)
6
2
74. 4
63. 8
1039. 62 760. 92 6
6
888. 48 -
r =
= 0. 99
Extra help: When calculating the above values, is very important to do the calculations in the correct order.
For example, if you have 3 numbers, say 3,2, and 5. The sum of them is 10. In our summation notation, this
is:
= (3 + 2 + 5) = 10
N
2
xi = (3 + 2 + 5) = 10 2 = 100
i =1
But, suppose we square the values before adding them together. This is indicated as:
N
2
i
= ( 3) 2 + (2) 2 + (5) 2 = 9 + 4 + 25 = 38
i =1
3-5
numbers into the examples in this book until you are comfortable and get answers that agree with the
book's.
We have calculated rxy, but as of yet, we have said nothing about the significance of the
correlation. The term "significance" has meaning both in real life and in statistics. An experimental result
may give us a number, but is that number "significant"? In statistics, we ask whether this number is highly
probably, given the errors (or randomness) in the data. For example, suppose you are a psychic and
studying psycho-kinesis, which is the use of the mind to influence of matter. You concentrate on
"heads". A coin is tossed once and the side that comes up is "heads". Wow! Is this significant? Does it
mean anything, or could the side just as easily have been "tails"? The probability of heads coming up is
1/2. Most would agree that a 50-50 probability is pretty "insignificant" and psycho-kinetic powers
remain unproven. But, suppose, after 100 tries, the coin toss favors heads 75% of the time. This result is
highly unlikely due to randomness. Therefore the "significance" of the result is much greater. This is an
important point, and applies to correlation as well. Intuitively, the significance of a correlation of 0.9 is
much greater when the sample size is very large than when the sample size is very small. For a small
sample size, the alignment of data points along a straight line may be fortuitous, but when many data
points lie along a straight line, the case becomes much more convincing, or "significant". The calculation
of "significance" will be discussed in greater depth in later chapters.
(a)
(b)
20
6
10
r=0
r=0.8
-6
10
20
-6
Figure 3.3. Data which show a low correlation coefficient, yet are obviously correlated. These kinds of data illustrate
inappropriate applications of the Pearson correlation coefficient.
3-6
Obviously, these data are not randomly distributed, and a quick look at the scatter plot verifies this.
A problem also occurs when the data are acquired from a 'closed system'. A closed system is
one in which the values of x and y are not completely independent because the fixed total of all
measurements must add to 100% or some other fixed sum. Closed system data occur frequently in
geologic studies. For example, closed systems exist in measurements of percentage compositions in
studies of whole rock chemistry and in work with ternary plots of rock composition. Because the sum
of the various measurements must add to a fixed sum, an increase in the proportion of one variable can
only occur at the expense of one of the other variables. Therefore, negative correlations are artificially
induced.
One final point is a reminder that a significant correlation between two variables does not imply
a cause and effect relationship. We may notice that at the end of the month, our bank account balance is
at its lowest level. Does this mean our bank account is somehow linked to the calendar? No, it's the fact
our paycheck is deposited on the first of the month. The day of the month doesn't CAUSE our bank
account to go down, it is just a variable that varies in the same way.
Temperature
150
100
50
0
0
depth
Figure 3.4. A plot of depth vs temperature that will be used in the least squares fit example.
3-7
There are two situations that will affect our approach to the regression:
1. The error or variation is almost exclusively in one of the two variables. This situation would occur,
for example if one was measuring fault offset vs time. The time measurement would be very precise,
but the offset measurement would be subject to measurement errors and natural variations in
distance due to shifts in monuments.
2. The error or variation is inherent in both variables. In this case, we compute the reduced major axis
line.
_____________________________________________________________
depth
temperature (C)
0.25
25
0.5
35
1.0
60
2.0
80
3.0
105
_____________________________________________________________
Table 3.2
Temperature (d eg C)
Temperature vs Depth
120
100
80
x I ,y i
60
y = ax + b
ei
40
20
0
0
depth
Figure 3.5. Plot of temperature/depth data. xi,y i is the coordinates
of the ith data point. ei is the difference between the value
predicted by the straight line, and the actual data value.
3-8
yi = ax i + b + ei
(3-3)
y and x are the x,y values of the ith data point. a is the slope of the straight line and b is its intercept. ei
is the error, or misfit. In order to get the best values for a and b, we want the sum of squares of all
of the errors to be as small as possible, or:
n
)
)
Rd = ( y i yi )2
where yi is the predicted y (from the straight line) at point i.
i =1
)
y i = b + ax i
equation for predicted y
To find the best values for a and b, we can differentiate Rd with respect to a and set the result to zero,
then do the same for b. Then we solve the two equations for a and b.
We do:
R d
R d
= 0 and
=0
a
b
When we do the operations indicated in the two equations, we have 2 equations and 2 unknowns, and
can then solve for a and b. The 2 equations are, after differentiating and simplifying:
n
i =1
i =1
i =1
x i y i = b xi + a xi2
(3-4)
and
n
i =1
i =1
y i = nb + a xi
(3-5)
1
x i . Then we get:
n n
y = b + ax
(C)
Now, we can substitute the above equation into eq 3-4, where we get:
x i y i = ( y ax ) x i + a xi2
Working on eq. 3-5, we divide each side by n, and use x =
= ny x + a ( x nx )
2
i
= ny x + a ( n 1) s x2
Ok, now rearrange the last line of the above equation:
xi y i nx y a (n 1)
=
s x2
n 1
n 1
Notice that the left side of the equation is s 2xy and that the n-1 cancels on the right side. This leaves us
with the simple formula:
s 2xy = as x2
3-9
or:
a=
s 2xy
s 2x
Remembering our definition of rxy , we get equation 3-6 below. Putting this value for a into eq. (C),
above, we get equation 3-7 below.
s
a = rxy y
(3-6)
sx
sy
b = y rxy x
(3-7)
sx
The sy and sx in the equation are the standard deviation of the y values and the standard deviation of the
x values. rxy is the Pearson correlation coefficient, which was defined earlier. Notice the relationship, in
equation 3-6 between the slope and correlation coefficient. As rxy gets larger, the slope, a gets larger
also, and if rxy = 0, then the slope of the best fit line is zero too.
Finally, we can write the equation for the best fit line as:
s
s
y i = ( y rxy y x ) + rxy y xi
sx
sx
another useful form, easier to remember, for the above equation is:
( y i y )
( x x)
= rxy i
sy
sx
(3-8)
(3-9)
Discussion: It is important to remember that the best fit line will not go through each data point. From
algebra, we remember that we need at least two equations to solve exactly for two unknowns. A straight line
has only two unknowns, the slope and the intercept. When we have more than two values for x and y, we
have more than enough unknowns to determine a straight line that passes exactly through two data points.
In fact, if the data have errors, the straight line slope and intercept will be different for each pair of data
points. The problem is that we have too many data points to exactly fit the line to all of them. This is called
an over-determined problem. In fact, it would be meaningless to try to exactly fit each data point, since there
are errors in real data. That is why we only try to find the "best fit" line for the data.
( 6 .7 5 ) ( 3 0 5 )
b =
1 4 .3 1 2 5 -
5
6 .7 5
= 2 8 .2 7
2
a = 6 1 - ( 2 8 .2 7 ) ( 1 .3 5 ) = 2 2 .8 4
and y = 22.8 + 28.3x. As a check on the calculation, the best-fit line should pass through the point,
( x ,y ) .
3-10
In the least squares regression line, we assumed that we knew one of the variables much better
than the other. If this is not the case, then a least squares regression is not appropriate. A reduced
major axis line is another type of linear regression line. Here, the sum of areas of the triangles between
the data points and the best-fit line, as shown in Figure 3.5.
Figure 3.5
The equations for a and b for a reduced major axis line are:
b=
2
i
2
i
( y i )2
n
and a = y bx
( x i )2
n
760. 92 b =
1039. 62 -
so
( 63. 8 )
( 74. 4 )
y = 0.22 + 0.84x.
3-11
i
n 1
n 1 i =1
3-12
Review
After reading this chapter, you should:
Know what least squares regression and reduced major axis lines are and how to calculate
them.
3-13
Exercises
1. Construct a scatter diagram, calculate 'r' and determine the significance of 'r' for the following
data. Show all your work!
island age (million years)
0
0.5
2.8
7.8
11.2
0
200
400
800
1050
2. Determine the least squares regression line and 90% confidence interval for the data in Exercise
#1 above. Which variable should be called 'x' and which should be called 'y'? Does it matter?
Show all your work!
3. Construct a scatter diagram, calculate 'r' and determine the significance of 'r' for the following
data. Show all your work!
SiO 2 (weight %)
45
50
55
44
53
48
4. Determine the reduced major axis line for the data in Exercise #3 above. Which variable should
be called 'x' and which should be called 'y'? Does it matter? Why is a reduced major axis line
more appropriate than a least squares regression line, assuming the error in the analytical
techniques used for all analyses is the same. Show all your work!
5. During a Journal Club talk, a student states that the correlation between two variables is 98%.
Should you be impressed by this statistic or do you need more information? Explain.
6. List four pitfalls to watch out for when working with correlation and regression statistics.
3-14
3-15
CHAPTER 4
The Statistics of Discrete Value Random Variables
In the study of statistics, it is useful to first study variables having discrete values. Familiar examples are
coin and dice tosses. This gives us a chance to better understand beginning statistical principles and
leads naturally to the study of continuous variables and statistical inference.
Combinations
An understanding of combinations is the first step in learning about probability and statistical inference.
Lets begin with an analysis of the coin toss. When you toss a coin 10 times, how many heads and tails
do you expect? Right now, it would be a good idea for you to toss a coin 10 times and see how many
heads you get. Did you expect to get that number?
Simulating coin tosses using Excel:
The random number function (rand()), generates a random number between 0 and 1. You can use
Excels IF statement to test whether the random number is greater or less than 0.5 to give it a two
state value. To do this, make a column of random numbers in B2 to B12 using =rand(). In C2, enter
=IF(B2<0.5,0,1). Extend the formula to C12. Notice that the value in column C is 0 or 1, depending
on whether the random number is <0.5 or >0/5. You can sum up the number of heads by putting
=sum(C2:C12) in cell C13. Press Apple= keys simultaneously (or "Ctl=" on a PC) to get new
simulated toss experimental results.
You dont need to use Excel to simulate coin tossing. Go ahead and toss 4 coins right now. Do it
several times. You can either toss one coin four times, or 4 coins once. The statistics are the same.
When you toss a coin, you expect to get heads half the time and tails the other half. There are 2
possible outcomes in a single coin toss. These are a) heads and b) tails. There is only one outcome
that we are interested in (heads), so we define a heads as a success, and we have 1 of the outcomes
that is a success. To get the ratio of heads to tosses, you do:
Ratio =
So, this shows how we find that half of the tosses are expected to be heads. Ratio is the probability
that a single toss will come out to be a head. \
Suppose you are performing an experiment that consists of tossing a coin 4 times. You should get 4*P
= 4*(1/2) = 2 heads. Of course, sometimes you get 0 heads, 1 head, 3 heads, and 4 heads. But, if you
toss the coin many, many times, you expect the ratio of heads to tails to become closer and closer to
0.5.
For the 4 coin toss experiment, is it possible to predict the number of times we expect to get some
number of heads different from 2? It has already been shown that it is possible to predict the
Version: December 20, 2001
University of California, 2001
4-1
probability of getting a head in a single toss using the number of possible outcomes of a toss. Lets
write down all of the possible outcomes when we toss 4 coins. Each outcome is equally likely. The
possibilities are shown below, with the first letter representing the outcome of the first toss, the second
letter representing the outcome of the second toss, so TTHH would be tails for the first toss, tails for the
second, heads for the third, and heads for the fourth.
TTTT, TTTH, TTHT, TTHH, THTT, THTH, THHT, THHH
HTTT, HTTH, HTHT, HTHH, HHTT, HHTH, HHHT, HHHH
There are several facts to notice. The first is that the counting started with all Ts and progressed as if
counting in binary, where a T was a 0 and an H was a 1. There are other ways to do this, but
binary counting will come in handy later. The 4 tosses become analogous to a 4 bit number, which
has 24 = 16 possible values. Notice that there are 16 combinations of heads and tails that can occur for
the 4 coin toss sample we are discussing. So, how many outcomes are there with 0 heads? Count
em. The answer is 1. There are 4 outcomes with 1 head, 6 outcomes with 2 heads, 4 outcomes with
3 heads, and 1 outcome with 4 heads. Of course, the total of all the outcomes is 16, as it has to be.
We cant just apply the formula that we used before. The number of possible outcomes is 16, since
there are that many combinations of heads and tails when a coin is tossed 4 times. So, the probability
for a single outcome must be 1/16. That is the probability for any one of the above combinations
happening in the sample. But, when we are going to say we have a success when several of the above
outcomes occur, we then add the probabilities for all successful outcomes. Said another way, we
defined success such that half of the 16 combinations were successes, then the probability would have
to be 1/2, wouldnt it? So, the formula becomes:
P (1 head ) =
# Successes
# Outcomes
4
16
So, the probability of getting 1 head is 1/4, which means that if we conduct the 4 coin toss sample 12
times, we expect to get 0 head 12*(1/16) times, 1 head 12*(4/16) times, 2 heads 12*(6/16) times, 3
heads 12*4/16) times, and 4 heads 12*(1/16) times.
When asked to determine the probability of a particular random combination occurring, you
can Brute Force the result by writing down all possible combinations, then counting the
number of combinations that you consider successes and dividing that number by the total
possible combinations.
Rules of Probability
The probability of rolling a "5" with a single die is 1/6, but what is the probability of rolling a "5" or a
"6"? What is the probability of rolling two "6"s with two dice or that the sum of the faces of two dice
will add up to "5"? We know that the probability of any one coin toss being "heads" is 0.5, but what is
the probability that if we toss 4 coins, all will be "heads" or that 3 of the 4 coins will be "heads"?
4-2
There are two basic rules of probability which you need to know. Rule #1 is that the probability of
occurrence of more than one mutually exclusive event is the sum of the probability of the
separate events. Mutually exclusive means that only one possible event can occur at one time (e.g., a
coin toss is either "heads" or "tails", it cannot be both). The probability of rolling a "5" or a "6" with a
single throw of a die is 1/6 + 1/6 which equals 1/3. The probability of throwing either a "heads" or
"tails" is 1/2 + 1/2 which equals 1, which makes sense since there are no other choices.
Rule #2 is that the probability of the occurrence of a number of independent events is the
product of the separate probabilities. Independent means that the occurrence of one event does not
affect the probability of the occurrence of any other event. The probability of rolling two "6"s with two
dice is therefore 1/6 * 1/6 which equals 1/36 and the probability of tossing 4 "heads" in a row is 1/2 *
1/2 * 1/2 * 1/2 which equals 1/16.
In some problems, the probability of a particular event may not be constant. For example, consider the
following problem. Two cards are dealt from a deck of 52 cards. What is the possibility that both
cards are aces? Since there are 4 aces in the deck and there is equal probability of receiving any card,
the probability of being dealt an ace with a single card is 4/52. The probability of being dealt a second
ace is 3/51 since one ace has been removed from the deck. So the probability of being dealt 2 aces
with 2 cards is 4/52 * 3/51 which equals 0.0045 or about 0.5%.
Probability distributions
A probability distribution is a plot of the expected frequency of an event occurring. Remember that
you have already been exposed to sample distributions, where the actual data sampled is being
plotted. Much of statistics involves the comparison of probability distributions with sample
distributions. Sometimes we use the term "expected frequency" rather than probability. There are
several important probability distributions in the field of statistics. In this chapter, we will be concerned
with the Binomial Distribution.
(41)
where E[x] is the expectation value and is the average of an infinity of experiments (ensemble
average). is the population mean because it samples the entire population. Suppose we apply this
to the experimental result: the difference between the number of heads and the number of tails in a coin
toss experiment.
We define:
d=
# of heads # of tails
N
Probability and Probability Distributions
4-3
We know that for any particular experiment, d will generally not be zero. We express the average of d,
for an infinite number of experiments as E[d]. We get:
# of heads - # of tails
E[d ] = E
= 0
N
The above equation states that, even though d may only rarely be exactly zero, when many coin toss
experiments are averaged, the positive and negative ds ultimately cancel and d asymptotically
approaches zero. It is exactly zero in the limit of the average of an infinity of experiments. There are
some general properties of E[] that are properties of the normal average also. Several useful relations
are:
E[ x] =
E[ ax ] = a
E[ ax + b] = a + b
a)
b)
c)
d)
E[
( x ) 2
] = 2
N
You can verify the above by considering the way a normal average behaves when it is multiplied by a
constant, or a constant is added to it. The above formulae only change this conceptualization by using
and infinite number of terms in the average. Note that a new symbol was introduced, . This is the
population mean of x, which is the average of x in the limit of an infinite number of measurements. We
also introduced the population variance of x, which is the variance averaged over an infinity of
experiments.
So for the coin toss case discussed previously, if c is a constant, the quantities:
E[cd] = 0
(4-2)
and
(4-3)
Also:
A more rigorous and general derivation of the expectation value will be given in a later section.
4-4
Expectation values and Ensemble Averages: This seems to be difficult for students to grasp, yet it
is very simple in concept. An ensemble is defined as the results of a number of experiments. If the
experiment is a single coin toss, then the ensemble is the results of some number of coin tosses. An
infinite ensemble would be the result of an infinite number of coin tosses. The Expectation Value of a
quantity is just the average value of the ensemble results. So, we do an experiment and get a result, say
x. The value of x is determined an infinite number of times in an experiment. We then can take the
expectation value of x, x2, etc, as discussed above. Why is this useful? Its useful because it gives us a
way to think about random processes. If we had the perfect experiment, with an infinite amount of data,
we would expect to get the Expectation Value of whatever variable we were measuring. However, if we
have a less than perfect experiment, the result will be more or less close to the expectation value,
depending on the variance. If we can theoretically calculate the variance, we would use the expectation
value of the variance (above). If not, we have to estimate it from the data.
Binomial distribution
The discrete probability distribution for a measurement, observation or event which has two possible
outcomes is described by the binomial distribution. Examples of such observations include a coin
toss ("heads" or "tails"), a quality control laboratory test (defective or not defective) and the drilling of an
oil well (a strike or a dry well).
Another important example where the binomial theorem applies is fluctuations in the values of a
particular class in a histogram. The binomial theorem will be derived for this example. Figure 4.1
shows the continuous probability distribution where any value is equally likely between 0 and xMax.
The uniform distribution is used here for simplicity, but any distribution could be used.
uniform distribution
1
xMax
p(x)
xu
xMax
x values
Figure 4.1. Description of variables for the derivation of the binomial theorem.
Suppose an experiment is run where X is measured N times. The probability that X is inside the class
interval is:
X Xl
P= u
xMax
4-5
Note: For continuous data, the probability that any particular value will occur is zero. Why do you think this
is? Think about how many real numbers there are in any finite interval. Is it intuitive that because there are
so many possibilities, that the probability of getting any one of them is very small? To fix this problem, for
continuous numbers, it is necessary to consider the probability that a value will lie between two other
values. This is expressed as p(x l<x<x u), which is the probability that x lies between xl and xu. The probability
is the area under the curve (cross-hatched in figure 4.1 above). For discrete data (like dice toss and binary
theorem), x values are not continuous, and there are only a finite number of possible values, so it is possible
to evaluate the probability of getting a particular value.
Continuous distributions will be discussed later in the chapter. The important element of this discussion is
that the data point can be either a "hit" (with probability P) or a "miss" with probability 1-P. For N data,
the number of "hits" will be NP. Conversely, the number of "misses" will be N(1-P). We define:
Q = 1 P , so
(1 P) N = QN
We now plot the histogram of the expected results of this experiment. Figure 4.2 shows the expected
number of "misses" (0% on the x axis) and "hits" (100% on the x axis). Let's put this in terms of a dice
toss. If we are watching the number "2", then a toss with "2" showing will be a "hit" and a toss where
any other number is showing will be a "miss". P will be 1/6, and Q will be 5/6. Our experiment has N=1,
because we toss the die once, then count the result as a "hit" or "miss".
Q
P(R)
P
0%
Outcomes: % of data within the Class
100%
Figure 4.2. The acquisition of a single data point is the experiment, the probabilities of 0% inside the class and 100%
inside the class are plotted.
When the first data point is taken, there are only 2 possible outcomes, with probability P and Q. These
are that 0 data are in the class (0%, or 0 "hits") and 1 data are in the class (100% or 1 "hit"). Suppose
that we now have 2 data points in our experiment. In our dice analogy, this would correspond for
tossing the die twice, the counting "hits" or "misses". But now we have 3 possibilities: 0 "hits", 1 "hit", or
2 "hits". Figure 4.3 illustrates the probabilities. Keep in mind that when the data is "in the class", we
count it as a "hit".
Examining figure 4.3, we can see that each of the first two outcomes generates 2 more possible
outcomes. This is analogous to the toss of 2 coins. If the first toss produces a head, the second toss
Version: December 20, 2001
University of California, 2001
4-6
can produce a head, or a tail (HH or HT). If the first toss produces a tail, the second toss can produce
a head or a tail (TH or TT). So, there are 4 possible outcomes. For the coin toss example, P = Q =
0.5.
in
out
PP
in-in
in
out
in
First value
QP
out-in
out
PQ
in-out
out-out
Second value
Figure 4.3. Expected numbers of data points within each of the 4 different outcomes that are possible with 2 data
values in the experiment.
Figure 4.4 is a histogram of the expected outcomes when 2 data points are taken. There are 3 possible
values for the probabilities. None, half, or all of the 2 data values are between Xu and Xl. So the
probabilities of the 3 outcomes (2 in, 1 in, and 0 in) are P2, 2QP, and Q2.
Q2
P(R)
2PQ
P2
0%
50%
100%
Figure 4.4. Histogram of outcomes when 2 data points have been taken. Notice that the R value is also shown along
the x axis.
The next step in the derivation is to take 3 data values. Notice how this is reminiscent of counting all of
the possible outcomes of coin tossing. The only difference is that we are using P and Q instead of H
and T. We are also counting the possible outcomes in a slightly different way. For 3 data values, the
outcomes are, counting the same way we did with the coin tosses:
PPP
QPP
PPQ
QPQ
PQP
QQP
PQQ
QQQ
4-7
P2Q
Q2P
P2 Q
Q2P
PQ2
Q3
To get the probabilities for each outcome, we add the probabilities where the outcome is the same.
Thus, we have:
P3
3P2Q
3PQ2
Q3
P(R)
3PQ2
Q3
3P2 Q
P3
100%
0%
33%
67%
Outcomes: % of data within the Class
Figure 4.5. Probabilities of outcomes when 3 data values have been taken.
The following table summarizes the results so far. The probabilities of the individual classes are the
coefficients of the equations.
Number of data points
N=1
(P + Q)
N=2
(Q + P)2
N=3
(Q + P)3
By inference, we expect that for any value of N, the probabilities are the coefficients of the equation
N(Q+P)N. Using the binomial expansion, the coefficients for arbitrary N are:
QN; NQN-1; N(N-1)Q N-2P2/(12); N(N-1)(N-2)Q N-3P3/(123) . . . . .
. . . . . . . . . . N(N-1)(N-2) . . . 2QPN-1/(123 . . . (N-1)); PN
4-8
N!
R !(N R )!
(1 P)
N R
(4-4)
Important information: Do you know what the "!" sign means? It is the "Factorial" symbol. This is just a
shorthand that allows us to write down a sequence of numbers more concisely. Its definition is: N! = N*(N1)*(N-2)*.1. So, 4! would be: 4*3*2*1. There is an interesting property that isn't obvious. That is that 0! =
1. Odd, but you will see that this definition of 0! is the most useful for formulas like equation 4-4.
This is the answer. To apply this to an example, suppose that we have a 5-sided die and will throw it
10 times. P = 1/5 = 0.2 and N = 10. The expression for a particular side coming up R times is:
R
4 1
P( R ) =
R !(10 R)! 5 5
10 !
10
Figure 4.6 shows the histogram of the probabilities of getting a particular die value R times in 10 throws.
R = 2 times shows the greatest probability, since 10(1/5) is the expected value.
P(R)
10
Next, we consider a more interesting example. Suppose that the probability of striking oil with a wildcat
well is 10%. If 10 wells are to be drilled, what is the probability that all 10 will be dry? If we assume
that the probability of finding oil at any one well is independent of finding oil at any other and that a well
is either a strike or dry, nothing in between, then we can apply the binomial probability distribution to
this problem. Here, P = 0.1, R = 0 and N = 10, so
P=
10 !
0!(10 !)
which equals 0.35 or 35%. Note that we arrive at the same result if we simply follow the rules of
probability discussed in the beginning of this chapter. Since the probability of drilling a dry well is 0.9,
Version: December 20, 2001
University of California, 2001
4-9
the probability of drilling 10 dry wells is 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9
which equals 0.35.
What is the probability of drilling one successful well? Now, P = 0.1, R= 1 and N = 10, so
P=
10 !
1!( 9!)
(0. 9) (0.1)
which equals ~0.39 or ~39%. To find the probability of drilling 2 successful wells, let p = 0.1, r = 2
and n = 10. In this case, P = 0.19 or 19%.
What if we want to know the probability of drilling at least 1 successful well? In this case, we must add
the probability of drilling 1 successful well to the probability of drilling 2, 3, 4 or more successful wells.
Note that often a problem such as
(a)
(b)
0.4
0.4
N=7, P=0.25
N=7, P=0.5
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0
this can be simplified by rewording the original question. For example, to find the probability of at least
1 successful well, we can find the probability of 0 successful wells. The probability of more than 0
successful wells is just 1 minus this number. As an exercise, do these calculations. The sum of the
probabilities for 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 successful wells should be 1.
The binomial distribution for n = 7 is given in Figure 4.7 for different values of P. Note that the
distribution is symmetric for P = 0.5 and asymmetric for P 0.5. Important properties of the binomial
distribution are its approximate mean and variance, if 0.1 P 0.9 and NP > 5:
E[x] = NP
and
E[s2 ] = NP(1-P)
4-10
From the above two equations, the ratio of the standard deviation of binomial distributed data to the
mean is given by:
f =
NP (1 P )
NP
(1 P)
PN
(4-5)
So, as N increases, the width of the distribution expressed as a fraction of the mean also decreases.
Suppose N=10 and P=0.2. Then:
f =
(1 0. 2)
0. 2(10 )
= 0.632
which means that the standard deviation of the distribution is a bit more than half of the mean. But if we
let N=1000, then
f =
(1 0. 2)
0. 2(1000 )
= 0. 0632
and the standard deviation is only 6.3% of the mean value, a factor of 10 lower. The standard deviation
relative to the mean varies as:
f
1
N
N! N e
N
2 N
This is extremely useful when large factorials will eventually cancel, but the computers number
range is too small to evaluate them prior to cancellation.
Problem 1:
Compute and plot the distribution of the number of heads in 100 coin tosses, using the
binomial theorem. To compute the large factorials, use Excel and Stirling's
approximation.
4-11
Review
After reading this chapter, you should:
Understand the definition of an "expectation value". What is it, in statistical terms? How can you
define it in words?
Know when the binomial distribution applies and how to use it to solve problems.
4-12
Exercises
1. If you select one card at random from a deck of 52 cards, what is the probability that the card will
be
a.
b.
c.
d.
e.
2. If you select two cards at random from a deck of 52 cards, putting the first card back before you
select the second card, what is the probability of
a.
b.
c.
d.
3. If you select two cards at random from a deck of 52 cards, and do not put the first card back
before you select the second card, what is the probability of
a. the cards both being aces?
b. the cards belonging to the same suit?
c. at least one of the cards being an ace?
4. Assume that when you drill a wildcat oil well, there are only two possible outcomes - either the well
is dry or it strikes oil. Assume that the chances of success at any one wells is independent or the
success at any other. Suppose that 8 wells are to be drilled and p = 0.1. Using the equation that
describes the binomial probability distribution,
a. what is the probability that they will all be dry?
b. what is the probability that 1 out of the 8 will be a success?
c. what is the probability that 2 out of the 8 will be a success?
d. what is the probability that at least 1 out of the 8 wells will be a success?
5. Given the assumptions in Example 6 above, how many wells must be drilled to guarantee a 75%
chance of at least 1 success?
6. Write an Excel sheet to calculate the binomial probability, P, given input by the user for N, R and P.
Version: December 20, 2001
University of California, 2001
4-13
7. Modify the program from exercise #9 to calculate the cumulative binomial probability, P. That is,
calculate the probability of at least R successes in N events.
8. Make a game that is a game based on probability. It could be involve cards, dice or whatever. Use
your imagination!
4-14
CHAPTER 5
5 1 + 2 3 + 6 4
13
= 2. 7
If you don't believe this formula, work it out for yourself on paper.
5-1
Frequency of occurrence
3
2
1
5
6
7
8
X1
X
Figure 5.1. Plot of a histogram of a continuous distribution whose area has been approximated by 8 bars.
Notice that we computed m by grouping equal values. Similarly, we compute the mean of the
distribution of figure 5.1 by making the approximation that the all data values within each class are the
same, and are at the center of the class. The number of data within the i'th class is the area of the class,
which is Fi*(Xi+1 - Xi). So, the mean will be:
m=
(X + X 3 )
(X + X 9 )
(X 1 + X 2 )
+ F2 ( X 3 X 2 ) 2
....F8 ( X 9 X 8 ) 8
2
2
2
N
F1 ( X 2 X 1 )
Fi are the height of the individual bars, and the Xis are the values of X at the class boundaries. If we
define:
X ci =
(X i + X i +1 )
2
F1 ( X 2 X 1 ) X c1 + F2 ( X 3 X 2 ) X c 2 ....F8 ( X 9 X 8 ) X c 8
N
So, the mean is computed by multiplying the height of each bar by its center value of X, summing over
all bars, and then dividing by N. Now suppose that N gets very large and the bars also become very
narrow. Defining X=Xi+1 - Xi, we have:
N
F X
i
= lim N
i =1
ci
X
=
1
N
xF ( x )dx = xp (x )dx
(5-1)
5-2
where p(x) is given by F(x)/N. Note that E[x] = m from equation 4-1.
Notation: m will be used to indicate the mean of observed data
will be used to indicate the expected mean, or population mean obtained by averaging many
repeated experiments. So, E[m] = .
So, in the limit of infinite data, the distribution becomes continuous and we compute the mean as shown
in the previous equation. In general, the expectation value of an arbitrary function, f(x) is
+
E[ f ( x )] =
f ( x ) p(x )dx
(5-2)
The above equation is the general formula for computing an expectation value of a general function of a
random variable x. which is distributed according to the probability distribution p(x). It is now possible
to derive several extremely important algebraic properties of expectation values.
Multiplication of the function f(x) by a constant results in the following:
E[af ( x )] =
So,
E[af ( x )] = aE[ f ( x )]
(5-3)
Similarly, the following relationships can be proven, where x and y are random variables and a and b
are constants.
E[af ( x ) + bf ( y)] = aE [ f ( x )] + bE [ f ( y )]
(5-4)
E[ x ] =
2
p( x )dx
(5-6)
The expectation value of the data variance is much more interesting. We compute
E[( x ) ] = E[ x 2 x + )] = E [x ] E [2 x ] + E [ ]
2
Since E[x] = ,
Version: December 20, 2001
University of California, 2001
5-3
= E [( x ) ] = E[ x ] =
2
p( x )dx
(5-7)
2 is the variance of the continuous distribution. This will be called the population variance in the next
chapter.
It is important to note the basic difference between the continuous and discrete distributions when
data with continuous values are considered. When data values are continuous, the only time that
discrete histograms are used is in the processing of actual data. Continuous distributions apply only
when we are considering the limit of an infinity of data. Continuous distributions are used when
performing computations to find expected values. When working with real data, the observed
values may vary considerably from expected values, as has been demonstrated by the simulations in
the previous chapters.
( x )
2
(5-8)
P(x)
0.1
=0
=2
-8
-6
-4
-2
X value
5-4
Figure 5.2. Plots of the gaussian distribution for =0 and = 1 and =2. The area between X = -2 and X = +2 for the
curve with = 2 is the same as the area between X = -1 and X = +1 for the curve with = 1. These areas are filled in
for each curve.
It can be proven mathematically that the continuous gaussian distribution describes the discrete
binomial distribution when n approaches infinity. The gaussian distribution stretches from - to
+ and is described completely by two parameters, and . Figure 5.2 shows two examples of the
gaussian distribution for different values of and .
By definition, the total area under the normal curve is one square unit. Therefore the area under
the curve between any two x values, x 1 and x 2 gives us the percentage of the total number of x values
that lie between x 1 and x 2 . This means that simply by knowing and for a gaussian distribution, we
can determine the probability of the occurrence of an x value between any given x 1 and x 2 . values.
Figure 5.3 divides the area under the curve into percentages. As you can see from this figure, for any
gaussian curve, ~68.3% of the x values lie between and 1 and ~95.5% of the values lie between
and 2.
To illustrate how we use this information, we consider the following example. Assume that we
have a list of the mid-term exam scores for a class of 1,000 Geology 4 students. The mean of the exam
is 80 and the standard deviation is 5. Assume that the exam scores follow a perfect normal distribution.
Based on Figure 5.3, we know that the percentage of students who scored between 80 and 85 is
34.13% of the class, the percentage of students who scored between 75 and 85 is 64.26% and the
percentage of students who scored above 90 was 2.27%. We can see that 15.87% of the class scored
below 75 and that 2.27% scored below 70.
1
2
1
P(x)
34.13%
34.13%
2.27%
2.27%
13.60%
-3
-2
13.60%
+ 2
+ 3
x value
Figure 5.3. Areas beneath various segments of the Gaussian curve.
We can calculate the area under this curve for any two x values by numerical integration, but in
practice, we rely on existing tables such as Table A1 in the Appendix. Take a look at this table now.
This table is organized according to z-values, which describe the normal distribution, where
Z=
( x )
(5-9)
5-5
percentage of the area under the curve that lies between any z-values, which we can translate back to x
values.
Let us continue with our example of Geology 4 exam grades to illustrate the use of Table A1.
Suppose we are interested in finding the percentage of students who scored above 88. First we need to
determine the appropriate z-value. Here, z = (88-80)/5 = 1.6. We look up z =1.6 in Table A1 and
find that the area under the curve above 88 is 0.0548 or 5.48%. If we wanted to find the percentage of
students who scored below 88, we would look up the percentage of students who scored above 88 (z
= 1.6) and subtract that value from 1, since the total area under the curve must add to 1. So the
percentage of students who scored lower than 88 is 94.52%. If we wanted to find the percentage of
students who scored between 80 and 88, we would find the percentage of students who scored above
80 (z = 0), 0.50, and subtract from this value the percentage of students who scored above 80 (z = 0),
0.0548 leaving 0.4452. It is best to draw yourself a sketch of the area of the curve in which you are
interested.
To test your understanding, find the area under the normal curve between the z-values listed
below and see if your answers agree with the ones given. Note that P(-Z) = 1-P(Z).
Z
area
Z
area
0.00 and
0.55 and
1.23 and
2.34 and
1.00 and
0.500
0.2912
0.1093
0.00996
0.1587
1.96 and
- and -1.27
- and -0.88
-0.70 and 0.70
-1.00 and 1.00
0.025
0.1020
0.1894
0.516
0.6826
You should also know how to use Table A1 to find the z-value that a given percent of the curve
lies above. For example, what z-value does 2.5% of the curve lie above? The answer is +1.96. Find
the z-values that 10%, 5%, 1% and 0.25% of the curve lies above. Make sure that your answers are
+1.285, +1.645, +2.325 and +2.81.
5-6
0.4
Encloses 95% of
the area under the
curve
P(Z)
1.96
1.96
Z value
Figure 5.4. 90% of the area under the normal curve is contained between Z values of +1.96 and -1.96. This
corresponds to 1.96 in the general case.
Poisson distribution
The poisson distribution describes the probability distribution of discrete events occurring
within a given finite interval or object, such as time, length, area, volume, body of water, host specimen,
etc. For example, the poisson distribution may be used to describe radioactivity decay, where the
number of decay particles is counted for a specified length of time. Conditions for a process obeying a
poisson distribution are:
The probability of a single occurrence of the event is proportional to the interval size.
The probability of 2 or more events occurring within a sufficiently small interval is negligible.
Events occur in non-overlapping intervals independently. That is, the occurrence of one event does
not influence the occurrence of the other event.
The form of the poisson distribution is:
p(y ; ) =
y!
for y = 1, 2, 3, etc. y is the value of the random variable (# of occurrences within the interval) and is
the expected number of events. For example, suppose you are observing radioactive decay and expect
10 events/second. The probability of getting y values during any particular second is:
p(y ,10 ) =
e 10 10
y!
Uniform distribution
In a uniform distribution, the probability of the occurrence of any x value is the same as the
probability of the occurrence of any other x value, on the interval between Xu and Xl . Its probability
density distribution is given by:
5-7
p(x ) =
1
Xu Xl
The denominator maintains the normalization, which requires that the area of p(x) is equal to 1.
1
2 Io (k )
ek
( cos (
where k (which is an angle that is always >0) is called the concentration parameter (analogous to
),I0 (k) is a modified Bessel function of the first kind, and 0 is the mean of the distribution. Its values
are tabulated in math tables texts.
Log-normal distribution
In a log-normal distribution, the logarithms of a set of values form a normal distribution. For
example, grain sizes, trace element concentrations and the sizes of oil fields all follow a log-normal
distribution. If a distribution of data values is skewed toward the low end, try taking the log of each
data value and plotting the distribution. If it looks like a normal distribution, the data most likely follow
the log-normal distribution. The form of the log-normal distribution is:
p(x ) =
1
x n 2
1 ln x n
2
n
where n and n are the mean and standard deviation of the ln(x)s. Note that at x=0, p(x) goes to
infinity.
5-8
Figure 5.5. This is a plot of the gaussian distribution for =5 and =1.25. The tails of the distribution are at +-1.96,
and we are defining any data point in the tails as a rare event.
(5-3)
Now, here is where we have to be extremely careful in how we think about this result. Obviously, for a
single value of xi, if we repeated the experiment, the value of would jump around all over the place.
What we can say is that: If the value for is between the two limits of equation 5-3, then we will get
the sampled value for xi only 5% of the time, on the average. Suppose the true
5-9
Figure 5.6. Smallest and largest values of that are consistent with a particular sample xi if we require that at least
95% of repeated experiments are required to produce an xi within this range. Notice that the left figure shows the
distribution offset to the left by the maximum amount and the right hand figure shows the distribution offset to the
right by the maximum amount.
value of is exactly xi. You would really get xi much more often than 5% of the time. However, that is
not the point. All you have is a single value of xi and you have to draw whatever conclusion that you
can from the information you have. Note also that you dont know from the data, but must know its
value independently. This presents a fatal complication if all you really know about the data is a single
value. In the absence of more information that that provided by a single data point, it would be
extremely optimistic to make any conclusion at all. The next section shows how to make inferences
about the population mean and variance when more than a single data point is available.
5-10
Sampling Distributions
Sampling distributions are probability distributions of a sample statistic (e.g. m and s2 )
calculated for all possible samples of size N from a given population, as discussed in the previous
paragraph. For example, if our population consists of the numbers 5, 8, 10 and 6 and we are interested
in the sampling distribution of the sample mean, m for a sample size of 2 (N=2), our sampling
distribution includes the values 6.5, 7.5, 5.5, 9, 7, and 8 where 6.5 is the mean of 5 and 8, 7.5 is the
mean of 5 and 10 and so forth. For this same population and a sample size of 3, our sampling
distribution is 7.66, 6.33, 8 and 7 where 7.66 is the mean of 5, 8 and 10, 6.33 is the mean of 5, 8 and 6
and so forth.
Consider the sampling distribution for the sample mean, m for a population with a normal
distribution. The distribution of the sample means is normally distributed. The mean of the sample
means is equal to the population mean,
m =
(5-4)
and the standard deviation of the sample means is related to the standard deviation of the population by
m =
(5-5)
population mean
m=
p(x)
lower confidence limit
m - sample mean
Figure 5.7. This shows a possible distribution of sample means, m. The peak is at the population mean. The total
area under the distribution is 1, so if the shaded area is 0.95, 95% of the times the experiment is repeated (for a large
number of repeats), the sample mean will lie between the upper and lower confidence limits. The standard deviation
of the distribution of sample means is expected to be the standard deviation of the population divided by the
square root of the number of data points in the sample, N.
5-11
Z=
(5-6)
where we have used the relationship between the variance of the sample means and the population
variance that we discussed above. We can use the standard normal distribution to determine the sample
size necessary to estimate the population mean, , to a required confidence level. The confidence
level is the percentage of times the experiment is expected to produce the result lying between some
upper and lower bound (figure 5.3).
Referring to figure 5.2 and substituting the value of given by equation 95 for in equation 53, and m for xi (because now we are working with the sample mean) we have:
m 1.96
< < m + 1. 96
(5-7)
Remember that the above equation means that if the experiment were repeated a large number of times,
and were beyond either one of the extremes shown, we would arrive at the measured value of m, for
a sample size of N, 5% (or less) of the time. Notice also that we must know . The next chapter will
discuss how we find the limits to when is not known.
For example, suppose an experimenter needs to measure the ratio of two isotopes to within
0.06 to a 95% confidence level. That is, if this experiment were repeated many times and the 'true'
isotopic ratio were known, then this 'true' value would lie within the specified range for 95% of the
experiments. Suppose that the errors in the measurements follow a normal distribution and the standard
deviation for the technique is known to be 0.1.
How many measurements should be made? Are 5 measurements enough? From Appendix
A1, we see that the z-value for which 2.5% of the curve lies above is 1.96. Since 2.5% of the curve
also lies below z = -1.96, that leaves 95% of the curve between z = +1.96 and -1.96. We refer to the
area above the positive z-value as the upper tail and the area below the negative z-value as the lower
tail. By using the above expression for z using z=1.96, we can find the maximum difference between x
and that will allow us to be in the center 95% of the curve.
Subtracting m from each side of equation 5-7, we can determine the limits of m - as:
1. 96
(5-8)
(5-9)
5-12
m =
(1. 96 )(0.1)
5
= 0. 09
z
N
(1. 96 )( 0.1)
15
= 0. 05
which is lower than our required value of 0.06 for m - so we may be taking more samples than
necessary. We can determine the minimum number of samples by solving for N in equation 5-9, and,
z
N =
m
(5-10)
Statistical Inference
We introduce the concepts of statistical inference and begin our discussion of hypothesis testing
and estimation here. We will present these concepts in more detail in the next chapter when we discuss
the t, F and 2 sampling distributions.
In general, statistical inference involves either the formulation of a hypothesis about a
population and the testing of that hypothesis or the estimation of a confidence interval for a population
parameter as given by equation 5-10. Both hypothesis testing and parameter estimation are based
on a sample and a sampling distribution (figure 5.3). For example, we might state as a hypothesis that
"the mean of this population is not significantly different than 10 at the 95% confidence level" or "the
means of these two populations are not statistically different at the 95% confidence level". If we are
interested in estimation, our question might be 'between what two values can we be 95% confident that
the mean of this population lies?'. We imagine the results of an experiment repeated many times. The
confidence intervals are the upper and lower values between which a particular statistic (e.g. sample
mean) lies some percentage of the time (often 95%).
5-13
In statistics, we never prove anything! We simply state the probability that a given
hypothesis is true or that a population parameter lies within a particular interval. There is nothing
magical about the value of 95%. We could have chosen to consider the 80% confidence level or the
99% confidence level, but 95% is a common value chosen.
m =
i =1
(5-11)
where N is the sample size (the number of individuals in the sample). The sample mean, m is the best,
unbiased estimator of the population mean, . By unbiased, we mean that if we conducted the same
experiment many times (infinity, in the limit) then m will tend towards exactly. This result was given in
equation 4-1, where it was stated that E[x] = m.
The other property of the population that is of interest is its standard deviation. This parameter
defines how much variation from the mean the individual population values take. Again, this must be
estimated from the sample. We define the sample variance, s2 :
(x
s =
2
m )2
i =1
N 1
N
xi
i =1
i =1
N 1
(5-12)
where x i is the sample value and N is the sample size. Using the above equation for the sample
variance, s2 , we arrive at the best, unbiased estimator (see next section) of the population variance, 2
The standard deviation is the square root of the variance. The two equations for variance given above
are equivalent. The first is a simpler expression; the second may be more convenient under some
circumstances.
When computing s2 for large N, you must be careful as you accumulate the sum of squared data values,
so that the sum of the squared numbers does not overflow the capacity of the number system used by
the computer. The first form of the equation will be superior in this respect, because the mean is
subtracted before the number is squared. When this is not enough, the sums must be computed
separately for sub-blocks of numbers, then divided by N-1, then added together.
5-14
Problem 1.
Prove that equation 5-12 does produce an unbiased estimate of the population
variance. Do this by generating a random number in Excel using the Data Analysis
selection from the Tools menu. Generate a large number of random numbers with a
normal distribution (with = 0 and 2 = 1). Then, repeatedly sample and average the
computed values for s2 . Show that dividing by N-1 instead of N in the formula
produces the correct variance, in the limit of many samples. Hint: start with small values
of N.
There is another way to intuitively understand this biasing effect. It is caused by the use of m (the mean
computed by the data) rather than (the true population mean) in the variance formula. Consider the
computation of the variance of a sample consisting of a single value. The sample mean is exactly equal
to the sample value. In the variance formula, the numerator is zero, and the denominator is 0, so we get
0/0, which is undetermined. Its a good thing, because: how can we get a variance with only one
sample? There is not enough data! If we used N in the denominator for a sample size of 1, s2 would be
0, which would be wrong. Undetermined is a better answer. However, if we somehow knew the
population mean, , we would have some information on the variance, even from one sample. When
the population mean is known, the correct denominator is N, which is 1. Now, if we take 2 data points
for our sample, we can get a somewhat better value for s2 . The denominator (N-1) is 1. The sample
mean is computed as (x1 + x2)/2 and the variance is {(x1 - m)2 + (x2 - m)2}/1. It is important to note
that is halfway between x1 and x2. No matter what the population mean (for 2 data points), m is
always halfway between x1 and x2. The values of (x1 - m)2 and (x2 - m)2 will be slightly reduced, on
the average, from that which would have been obtained using . We speak of this as the statistics of the
variance having N-1 degrees of freedom.
5-15
When the population mean, is known, the correct formula for variance (the unbiased estimator of
the population variance) is:
N
(x
s2 =
)2
i =1
When only the sample mean, m is known, the correct formula for variance (the unbiased estimator of
the population variance) is:
N
s2 =
(x
i =1
m )2
N 1
The formula for a property of a sample distribution that tends toward the same value of the property
of the population (when its value for a large number of samples are averaged) is called an Unbiased
Estimator. The above formula for s2 is an Unbiased Estimator.
Review
After reading this chapter, you should:
Be able to calculate the sample mean and the sample variance. Know these formulas.
Understand what is meant when we say that the sample mean and the sample variance are the
'best, unbiased estimators' of the population mean and population variance.
Be able to describe the distribution of sample means for a normal population and the nature of
the mean and variance of this distribution and the relationship between these parameters and
sample size.
Be able to determine the sample size necessary to estimate the population mean from a sample
based on the distribution of the sample means. Assume that s is known.
Be able to state and understand the Central Limit Theorem and appreciate why this theorem is
so important in statistics.
Be familiar in a general way with the concepts of statistical inference , parameter estimation and
hypothesis testing.
Understand what an unbiased estimator is, and why it is important to use unbiased estimators.
5-16
Exercises
1.
2.
a.
1
3
5
2
3
5
2
4
3
4
b.
4
6
6
8
8
8
9
9
9
10
10
10
10
11
11
12
13
14
15
17
Identify the population, the individual and a sample for each of the following problems. Note
that there is no one correct answer for many of these.
a. a study of the minerals and rock fragments in a thin section
b. a study of the porosity of a sandstone unit
c. a study of the opinions of professional geologists in California on offshore oil issues
d. a study of the occurrence of a particular fossil in the Jurassic
e. a study of the isotopic composition of lavas from hot spots
f. a study of the average number of hours spent by UCSB students doing homework per week
g. a survey of the extent of chemical contamination on a 1 km2 industrial lot
h. a study of fault orientation in a map area
i.
3.
Suggest appropriate sampling schemes for each of the problems outlined in exercise #2 above.
Note that there is no one correct answer for these questions.
4.
5.
5-17
6.
7.
Suppose you wish to measure the concentration of chemical X with an accuracy of 0.1 g, to a
95% confidence level. The measurement process you will be using has a known variance of
0.01 g2. How many measurements must you make?
8.
9.
Use the Central Limit Theorem to make a program that will compute a gaussian distributed
random number using the rand() function. This function, as specified, returns a number with a
uniform distribution. Prove that this number is gaussian distributed by computing the number of
times the random number is outside of the 1s bounds. Of course, you will have to make a
pretty good computation of what s is before you can do this (Hint: use your simulation to do
this).
10.
Assume that the final exams for a Geology 4 class follow perfectly a normal distribution. The
mean of the exam was 55 and the standard deviation is 20. Using Figure 4.10 from this
chapter, determine how many students received a score
a.
b.
c.
d.
e.
above 75?
below 35?
between 35 and 55?
above 95?
between 35 and 95?
Each of your answers should be accompanied by a sketch of the area of the curve you are trying to find.
11. Use Table A1 to answer the following questions pertaining to the problem described in exercise
#12. How many students received a score
a.
b.
c.
d.
e.
above 70?
between 70 and 90?
between 50 and 80?
above 85?
below 50?
Each of your answers should be accompanied by a sketch of the area of the curve you are trying to find.
12. For the problem described in exercise #10, between what two scores did 95% of the class score?
5-18
5-19
CHAPTER 6
( m )
(6-1))
If is unknown, the z tables cannot be used. It is logical to substitute the sample standard deviation, s
for the population standard deviation, . If we take all possible (infinity in the limit) samples of size N
from a normal distribution with mean and variance 2 and calculate a t-statistic defined as:
t =
( m )
s
N
(6-2)
for each sample, we will have a Students-t distribution, which we shall call the t distribution from
now on. For large N, the t-distribution approaches the normal distribution. For practical purposes,
when N>30, the t-distribution and the normal distribution are identical. For N<30 the t-distribution
curve will be symmetric as is the normal curve; however, as N decreases the curve will flatten. This
makes intuitive sense as with a smaller sample size, m is less likely to be close to than for a larger
sample size and therefore there will be fewer values for small t. Curves for a normal distribution and for
a t-distribution with N=4 are shown in Figure 6.1.
Parametric Statistics
6-1
Tables of t-values
In Chapter 5, we used tables with z-values for a
standard normal distribution to find the
proportion of the area under the normal distribution curve between a given z-value and z equal to
infinity. A table for t-values is contained in Appendix A2.
We are interested in finding the t-value that a certain proportion of the area under the t-distribution
curve lies above. In some cases we want to find a t-value such that 5% of the area under the curve is in
the upper tail. In other cases, we want 5% of the area under the curve to be contained in both tails, so
we want to find a t-value such that 2.5% of the area under the curve is in the upper tail. Appendix A2
gives the t-values for which 2.5% and 5% of the area under the curve is in the upper tail. These values
are given for various degrees of freedom. For a t-distribution, the number of degrees of freedom
is N-1, where N is the sample size .
A table of t-values is given as Table A2 in the Appendix. It is important that you feel comfortable with
reading this table and that you understand the relationship between the numbers on this table and the tdistribution curve. For example, suppose we want to find the t-value above which 5% of the area
beneath the t-distribution curve lies for a sample of size 11. In this case, we have 10 degrees of
freedom. Since we are interested in the upper tail only, we find the t-value for which we are looking,
1.812, in the row corresponding to 10 degrees of freedom and the column for 0.05. Next, suppose we
want to know between what two t-values 95% of the distribution curve lies, for 10 degrees of freedom.
In this case we want to find the t-value for which 2.5% of the area in the upper tail and 2.5% of the area
in the lower tail. We look in the 0.025 column for 10 degrees of freedom and read a value of 2.228
from the table. This means that the 95% of the area of the curve lies between t=-2.228 and t=+2.228.
Often we have a t-value and want to know whether that value lies in the tail or tails of the curve. For
example, suppose we calculate a t-value of 1.9000 for a sample with a sample size of 16. Is this t-value
in the 5% of the area beneath the distribution curve contained in the two tails of the curve? We look in
the column 0.025 and the row for degrees of freedom equal to 15
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-2
(a)
(b)
Figure 6.2. Curves showing t distribution. The areas in black show the 5% region of unlikely event for a two tail
test (a) and for a single tail test (b).
and find the t-value 2.131. It helps a great deal to draw a sketch of the curve and at this stage you
should always do so. From Figure 6.2a, it is clear that our t-value of 1.900 is not contained in the two
tails of the curve. What if instead we asked whether this t-value was contained in the upper tail of the
curve that contained 5% of the area under the curve? The answer this time is yes it is, as Figure 6.2b
shows.
Parametric Statistics
6-3
m t
< < m +t
(6-3)
The application of this equation is demonstrated in figure 6.3. The figure demonstrates several important
properties. Given a particular value for t, the allowable range of will decrease as N increases. This is
because the standard
Sm a ll e st
c o n s i s t e n t w i th m
+t
-t
P(x)
s
N
L a r g e st
c o n s i s t e n t w i th m
-t
s
N
+t
s
N
Sample mean
Sample mean
Figure 6.3. Largest and smallest population mean, that is allowed by a specified t value. Note on the above plots
that the sample mean, m is at the same place on the plot and the probability distribution curve is shifted to the right
and to the left to reflect its most extreme allowed positions. See figure 5.6.
deviation of the distribution of the means decreases with increasing sample size (refer to equation 6-5
and the associated discussion). Also, the larger the value of t, the wider the limits on . The value of t
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-4
is selected from the t distribution table according to the number of degrees of freedom (N - 1) and the
desired confidence intervals.
Suppose we want to find for the porosity of a sandstone unit based on a sample of 11 measurements
for which m is 6.4 and s is 3.1. We won't worry about units of porosity here. We assume that the
porosity of the sandstone follows a normal distribution. Specifically, we want to find a range of values
which we can be 95% confident contains . This means that if the experiment were repeated many
times, 95% of the sample means, m, would be within this range. Since our sample size is 11, our tstatistic will have 10 degrees of freedom. We are interested in the t-distribution for 10 degrees of
freedom. Since we want the 95% confidence interval and we have no reason to be interested in only
the upper tail or only the lower tail of the t-curve, we use a two-tailed test. That is, we wish to find the
positive t-value such that 2.5% of the area under the curve lies above that value and the negative t-value
such that 2.5% of the area under the curve lies below that value. Since the curve is symmetric and most
tables give values only for positive t-values, we consider the upper tail only. We look up the value for
the t-statistic in column 0.025 for 10 degrees of freedom in Table 2A. This value is +2.228. This
means that when samples of size 11 are taken from a normal population, only 5% of the samples will
have extreme values for m and s such that t will satisfy the relation
2. 228 t 2. 228
for 95% of all possible samples of size 11.
From equation 6-2, we can express this relation as
m
2. 228 s
2. 228
(6-3)
s
N
m +t
2.228 (3.1)
11
s
N
m+
(6-4)
2. 228 (3.1)
11
Parametric Statistics
6-5
In the example above we used a two-tailed approach. In other problems, we might be interested in
estimating so that we are 95% confident that exceeds a certain value or that is below a specific
value. In such cases, a one-tailed approach would be appropriate.
Using the Students t-test:
1. Take the sample
2. compute the mean, m and sample variance, s
sx = sx
3.
calculate:
4.
5.
6.
Single tailed test: Consider again the problem of determining the mean porosity of a sandstone unit
discussed above. Suppose we want to know the value, max we could be 95% confident exceeded.
Restated, we could say: if we repeat this experiment an infinity of times, the value we would get
for the sample mean would exceed max in 95% of the experiments. We look up the value for the tstatistic in column 0.05 for 10 degrees of freedom in Table 2A. This value is +1.812. This means that
when samples of size 11 are taken from a normal population, only 5% of the samples will have extreme
values for m and s such that t will be greater than 1.812. Another way of saying this is that for 95% of
all possible samples of size 11, t will be less than 1.812. We write
t 1. 812
m
1. 812
s
N
Parametric Statistics
6-6
Ho), and also the alternative hypothesis that is true, if the original hypothesis in refuted (the alternative
hypothesis, or Ha).
For example, we might want to ask the question: Can we state, at a 95% confidence level, that
the mean of the population from which this sample was taken is 15? In this case, we would pose the
null hypothesis, Ho, that the mean of our population is 15. Our alternative hypothesis, Ha is that the
mean of our population is not 15. We state this formally as
Ho
1 = 15
a n d Ha
1 15
and we set our significance level at 95%. This means that we will reject Ho if there is less than a 5%
probability of Ho being true. We calculate a t-statistic based on the expression for t given above. Since
m is 6.4, s is 3.1, and our sample size is 10, our t-value is 6.0 as calculated below.
m 15 9. 4
t =
= 3.1
= 6. 0
s
N
11
In this case, we have 10 degrees of freedom. A two-tailed test is appropriate here since we do
not care whether the porosity of our sandstone is greater than 15 or less than 15, only that it is
significantly different from 15. Therefore, we look up the critical value for t in a t-table with 10 degrees
of freedom at the 0.025 level. One way to remember the appropriate column of the t-table to look at is
to take the desired significance level for rejection (here 0.05) and divide this number by '1' if we have a
one-tailed test or '2' if we are using a two-tailed test. In this case, the critical t-value is 2.228.
Now we have all the information we need to either accept or reject our null hypothesis at the
95% significance level. Figure 6.3 shows the critical and calculated t-values for this problem. Since our
t-value of 6.0 lies in the curve's tail, we can say that there is less than a 5% chance that the mean
porosity of the sandstone is 15. Therefore, we reject Ho at the 95% confidence level and conclude that
the mean porosity of the sandstone is not 15.
Parametric Statistics
6-7
Figure 6.3 . t distribution showing 95% significance levels and example t value of 6.0.
We could also ask a question about the sample size necessary for a given study. In the last
chapter, when we asked this question, we knew 2. Usually, we only know s2 and we must use the tdistribution. If we have no idea what the value for s is in a particular case, we might perform a pilot
study to obtain this information before going ahead with the main study.
Discussion:
In general, the critical question is whether the data are consistent with the null hypothesis at the
required significance level. What is required is the sample distribution and confidence levels. An
important issue in this process is the selection of the correct sample distribution. In this text we
emphasize the gaussian distribution, where the Central Limit Theorem assures us that the distribution of
sample means will be close to gaussian.
Siegal (1956) suggests six steps that should be carried out in hypothesis testing. These are:
1. State the null and alternative hypotheses.
The null hypothesis is usually called H0. The alternative hypothesis is called Ha here. In chapter 5,
problem 6, where the problem was to decide whether die were weighted or not, H0 might be stated as:
there is no difference between the probability of each die face showing and Ha might be stated as:
there is a difference between probabilities of dice face showing.
2. Choose the statistical test.
There are many statistical tests to choose from. Each test requires that the data conform to certain
assumptions, for example that the population is normally distributed. Tests vary in their power to
discriminate. That is, the limits set on the parameters to be tested may vary from very wide when a test
has little power to narrow when a test has a high power. Tests with very few assumptions often
have little power relative to tests which have many assumptions. It is important to make efficient use of
the data. For example, if there is good reason to assume that the data are normally distributed, gaussian
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-8
statistics should be used. If the data are not normally distributed, less powerful tests will be required.
The power of a test will be discussed more later.
3. Choose the size of the sample, N, and the size of a small quantity, .
The choice of determines the confidence limit that you require for your data. is the probability
that the result falls outside of the confidence limits. An of 0.05 would correspond to a confidence
level of 95%. It is expected that as N increases, the result will be more accurate. This is reflected in the
limits set by equation 5-7 for gaussian distributions. It is important to choose neither too small nor too
large a value of N. Too large a value is a waste of effort and too small means that the results will not be
reliable.
4. Evaluate or determine the frequency distribution of the test statistic.
The test statistic is the value that we are going to test. In the dice toss problem, the test statistic
was the number of 3s (or the face you were testing). In that problem, the frequency distribution of
the number of times a particular die face occurred was given by the binomial distribution. The test
statistic might be the sample mean or sample variance. If the data are gaussian distributed, the correct
distribution might be the t, Chi-squared, or F distributions (coming later). If there is not enough
fundamental knowledge about the process that produces variations in the data, it may be necessary to
prove that the data follow a particular frequency distribution. If measured values are put into an
equation (e.g. to compute an age date), the distribution of the statistic may be influenced operations in
that equation.
5. Define the critical region or region of rejection of the null hypothesis.
This is the probability that the result of the experiment is outside of the confidence limits that were set
in step 3. For example, if the frequency distribution of the test statistic is normal and its standard
deviation is 1, an experimental result greater than 1.96 or less than -1.96 would be within the region of
rejection at the = 0.05 level. So, if the test statistic is outside of the critical region. we can accept
the null hypothesis at the specified confidence level, but if it is inside the critical region, we can reject
the null hypothesis at the specified confidence level.
6. Make the decision.
Suppose the test statistic is within the critical region. There are two possible conclusions. They are
either that H0 is false, or that an unlikely event has occurred and H0 is actually true.
m1 m 2
1
1
N1
(6-5)
N2
where m1 and m2 are the sample means and N1 and N2 are the sample sizes of the first and second
samples, respectively, and where
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-9
S =
(N 1 1)s12 + (N 2 1)s 22
(6-6)
N1 + N2 2
and S is called the combined variance. Because our t-distribution was determined by sampling from a
single population, one of the requirements for performing this t-test is that the variance estimates not be
significantly different. This can be tested using the F distribution discussed in the next section.
If we take all possible samples of size N1 and N2 from the same population and calculated a tstatistic as defined here for each sample, the result will be a t-distribution with N1+N2-2 degrees of
freedom. We illustrate how we use this t-test with an example.
Consider the following problem concerning ore deposits. Data on the concentrations of ore on
opposite sides of a fault suggest that the ores are significantly different. If so, then it is likely that the fault
was formed before the ore was emplaced. If not, then it is likely that the fault formed after ore
deposition. Based on the data provided below, is there a difference, at the 95% confidence level,
between the concentration of ore on either side of the fault? In other words, is the probability less than
5% that the sample from north of the fault and the sample from south of the fault are from identical
populations? We assume that the distribution of ore concentrations follows a normal distribution and 2
from the sample north of the fault is not statistically different from 2 from the sample south of the fault.
The data for this problem are given below in Table 6.1.
Mean ore concentration
North of fault
South of fault
33 mg/kg
23 mg/kg
Variance
Number of data in
Sample
13
12
10
15
Our null hypothesis, Ho, is that the mean of the ore concentration north of the fault is not
statistically different from the mean of the ore concentration south of the fault, assuming that the
variances for the two populations are the same. We state this formally as
H0 (1 = 2 12 = 22 )
where the subscript '1' refers here to the ore north of the fault and the subscript '2' refers to the ore
south of the fault. If Ho is true then we can state that our two samples come from identical populations
within the confidence level of the test.
Our alternative hypothesis, Ha, is that the mean of the ore concentration north of the fault is
statistically different from the mean of the ore concentration south of the fault. We state this formally as
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-10
H a ( 1 2 )
Next we must state our desired significance level. Remember, with statistics we never prove
anything. We can only state that our null hypothesis is supported at a stated level of confidence or
significance. For this problem, we set the required significance level at 95%. Another way of stating this
is that we will reject Ho if, when the experiment was repeated a large number of times, we would reject
Ho 95% of the those experiments.
Next, we calculate the t-statistic for the comparison of two means with equations 6-5 and 6-6.
S =
t =
= 3.5
33 23
= 7. 0
1
1
3.5
+
10 15
Here, our t-statistic is 7.0 and we have 23 degrees of freedom. A two-tailed test is appropriate here
since we do not care whether the concentration of ore on one side of the fault is higher or lower than the
other only that one side is significantly different. Therefore, we look up the critical value for t for this
problem in a t-table with 23 degrees of freedom at the 0.025 level. Here, the critical t-value is 2.069.
Figure 6.4. t distribution for 23 degrees of freedom, with 95% confidence limits marked.
Now we have all the information we need to either accept or reject our null hypothesis at the
95% significance level, as shown in Figure 6.4. Since our t-value of 7.0 lies in the curve's tail, we can
say that there is less than a 5% chance that these two samples are from identical populations.
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-11
Therefore, we reject Ho at the 95% confidence level and conclude that we can be 95% confident that
the fault was formed before the ore was emplaced.
Figure 5.4. Illustration of type I and type II errors. Curve A
corresponds to the distribution specified in H0. Curve B is another
distribution, which might also produce the experiments outcome.
Parametric Statistics
6-12
Decision:
accept H0
accept H1
H0 true
1-a
a (type I)
H0 false
b (type II)
1-b
Table 5.1. Matrix of possible interpretations of an experiment based on the true population value and the sampled
outcome.
Example:
Figure 5.4 shows two distributions that might produce the same experimental result. Suppose
we define H0 as = 15 and Ha as 15. If X is greater than 17 (in the region defined by a),
we reject H0. The probability that a type I error occurred (we falsely rejected H0) is given by a (0.05
for a 95% confidence level). However, suppose that X is 15, well within the accept zone of H0. The
probability that is really 20, but X is within the accept zone of H0 is given by the area indicated by
b, which is the probability of a type II error. The type II error occurs when we erroneously accept
H0. It is simple to compute the probability, b, given the value of a chosen for the test and of the
population.
Suppose that the A = 10 and B = 20. Suppose also that
so that the upper reject region begins at 10 + 1.96*4 = 17.84. Since the mean of Curve B is 20, this
is (20 - 17.84)/4 = 2.16, which is 0.54 standard deviations from the mean of Curve B. From
Appendix A1, the area of the normal curve at Z = 0.54 is 0.295. Be sure to check this in Appendix
A1 to be sure you understand where this number came from. Use the graphic at the top of the figure to
be sure what area the table produces. The result means that the probability b of a type II error in this
case is 0.295. 29.5% of the time, under the given circumstances, if the distribution of Curve B was the
real distribution, one would falsely accept H0.
The F-distribution
The F-distribution describes the distribution of the ratio of the variances of two independent
samples from the same normal population. If we take all possible samples of size N1 from one normal
population and size N2 from a second normal population, where both populations have the same
variance 2 and calculate an F-statistic defined as
F=
s12
2
s2
for each two samples, we will have an F-distribution where s1 is the variance of the first sample and s2
is the variance of the second sample. The sample with the largest variance is always put on top in the
equation. The F-distribution has N1-1 and N2 -1 degrees of freedom.
An example of an F-distribution is shown below in Figure 6.6. The choice of which sample
should be sample 1 and which should be sample 2 should be made so that F > 1 in order to use most
tables of F-values. Note that there are no negative values for F.
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-13
0.8
frequency of F
0.6
0.4
0.2
0.0
0
Figure 6.6. The F distribution , which is the ratio of variances of two samples from a Gaussian distributed population
for 4 degrees of freedom for both samples.
A table of F-values is given as Table A3 in the Appendix. As with the standard normal and t
distributions, we are interested in finding the value above which a certain proportion of the area under
the distribution curve lies. It is necessary to specify the degrees of freedom for both the numerator and
the denominator to use a table of F-values. F-values corresponding to 5% probabilities are given in
Table A3. For example, where 5% of the area under the curve lies in the upper tail for 4 degrees of
freedom in both the numerator and the denominator, the F-value is 6.39.
F-test
An F-test based on the F-distribution can be used to test the probability that two samples were
taken from populations with equal variances. For example, consider the problem of ore concentrations
across a fault that we discussed earlier. In performing a t-test, we assumed that the variances of the two
populations represented by the two samples were not statistically different. Let us now test whether or
not this was a good assumption.
Our null hypothesis, is that the two variances are equal and our alternative hypothesis is that they
are not. We state
H o ( 1 =
2
22 )
a n d H a ( 1
2
22 )
and this time let us set our confidence level at 95%. We look up the F-value in this case where N1=10
and N2=15 corresponding to a 5% area in the upper tail. This value is 2.65. The variance of the first
sample is 13 and the variance of the second sample is 12, so F = 1.08. Our F-value is not in the tail of
Version: December 20, 2001
University of California, 2001
Parametric Statistics
6-14
the curve, so we accept our null hypothesis. We were justified in assuming that the two samples came
from populations with the same variances at the 95% confidence level.
DO THIS NOW! Practice readi ng the F Tables in Appendix A3
Verify that you can read the table to get the F value for the following situations:
1. N1 = 5, N2 = 8. find the F value that is in the upper 5% of the range. Assume
sample 1 has the smallest variance. (ans: F=6.09).
2. N1 = 10, N2 = 20, find the Fvalue that is in the upper 5% of the range. Assume sample
2 has the smallest variance. (ans: F=2.42).
2
2
3. N1=12, N2= 11. s 1 = 4. 5, s2 = 1. 2 . Are the variances from the same population, to
a 95% significance? (ans: No, F = 3.75, the F limit = 2.85)
2
2
4. N1= 4, N2= 6. s 1 = 4. 6, s2 = 2. 0 . Are the variances from the same population, to a
95% confidence level? (ans: Yes. F=2.3, F limit = 5.41)
2-distribution
If gaussian distributed variables are squared, they follow the 2-distribution. For example, if Y
is a single gaussian distributed variable, then
2 =
(Y ) 2
2 =
i =1
(Y i ) 2
Parametric Statistics
6-15
frequency of chi-squared
0.2
0.1
0.0
0
12
Random variables with a gaussian distribution become 2 distributed when they are squared.
The mean of a 2 distributed variable with N degrees of freedom, E[2] = N
The variance of a 2 distributed variable with N degrees of freedom is var[2]=2N
Table A4 in the Appendix gives the values of 2 which define the upper tail of the curve for
various degrees of freedom. Critical 2-values are given corresponding to various area under the curve
in the upper tail.
2-tests
A very useful application of the 2-test is in testing whether a sample came from a Gaussian
distribution. To do this, we form a statistic which is related to the difference between the expected
and observed number of data values within each class. The 2-statistic for this situation is:
x2 =
# of classes
(O i E i )2
i =1
Ei
where Oi is the observed frequency in the ith class of the distribution and Ei is the expected
frequency in the ith class according to some probability distribution. The number of degrees of
freedom are c - k - 1 where c is the number of classes, k is the number of estimated parameters (k = 2
if m and s2 are used as estimates for and ). So, if an analysis used 6 histogram bars, and was
Parametric Statistics
6-16
estimated from the data, x, and was also estimated from the data, the number of degrees of freedom
would be 6 2 = 4.
The 2-distribution is important because it can be used in many parametric and non-parametric tests.
Concept Review
It is important to understand the similarities in how the various distributions are used to test a hypothesis.
All of the distributions discussed in this chapter are derived from Gaussian distributed data. When the
data is transformed in specific ways (e.g. we may be interested in a squared parameter: chi-squared, or
a ratio of variances: F test), a certain distribution results. This is the distribution of a gaussian distributed
variable that has been squared or ratiod, or some operation has been performed on it.
For example, the Normal distribution results if we transform the Gaussian distributed data according to:
Z=
( x i )
The t distribution results if we transform the Gaussian distributed data according to:
t =
( xi m)
s
The t test is used for putting confidence limits on the distribution of sample means. It is important that the
sample means follow a normal distribution. Use the 2 test to prove it.
The 2 distribution results if we square Gaussian distributed variables. Use the 2 test to test the
confidence with which a distribution is normal (p 6-16).
The F distribution results if we compute the sample variances and divide the largest variance by the
smallest variance. It is used to test the confidence with which the sample variances of two samples are
from populations with equal population variances (p 6-14).
F=
s12
s 22
Of course, you should remember that the distribution comes from visualizing the repeating of the
experiment many times and plotting the histogram that is the average of all of the histograms, in the limit
where the class interval gets very small.
Reading each of the tables is similar. You figure out the degrees of freedom and the confidence limits,
read the value, then see if the computed sample statistic is within the Accept or reject range.
Encouragement
Parametric Statistics
6-17
While statistical thinking represents a radical departure from the way you normally think, it is
really not so hard if you concentrate on a few facts. When making statistical inferences, it is helpful to
remember the sampling paradigm discussed earlier. There exists a population of values and you have
taken a sample from that population. The test statistic follows some kind of distribution, based on the
population statistics (for Gaussian population distributions the mean and variance are enough). Once
you know that distribution, the confidence limits follow immediately by considering the area underneath
the distribution curve. After that, it is a simple matter to test whether your sample value falls within those
limits.
Problem 2.
Parametric Statistics
6-18
Review
After reading this chapter, you should:
Be able to read tables of t-values, F-values and 2-values and understand the relationship between
these values and the t, F and 2-distribution curves.
determine whether or not the mean of a population is different from (or higher or lower than) a
specified value; and
to compare two samples to test if they are from identical populations to a certain level of
confidence.
Be able to perform an F-test to determine whether two samples come from populations with equal
variances.
Understand Type I and Type II errors, and the power of a test, and be able to calculate the
probability of each.
Parametric Statistics
6-19
Exercises
State the null hypothesis and alternative hypothesis for all problems where you are asked to perform a
statistical test involving hypothesis testing.
1.
2.
3a.
For the purpose of using a t-test to estimate the population mean from a sample, how many
degrees of freedom are there?
3b.
For the purpose of using a t-test to compare two sample means how many degrees of freedom
are there?
4.
5.
What does it mean to say "I am 95% confident that the population mean lies between 140 and
150."?
6.
7.
In using a t-distribution to estimate the population mean from a sample, does the size of the
range of values specified for the population mean increase or decrease with
a.
b.
c.
8.
17
15
25
Parametric Statistics
6-20
21
19
9.
23
18
25
26
The recommended safe limit for chemical Y in drinking water is 10 mg/l. Water samples are
taken once a month to monitor this chemical. The data for the first 6 months of testing are given
below. Can we be 95% confident that the concentration of Y is less than 10 mg/l?
11
10.
14
23
After reviewing some measurements made in the lab, the lab supervisor notices a seemingly
systematically bias in the data. The supervisor suspects that the two lab assistants who made
the measurements are using a slightly different measurement technique and that this is the root of
the problem. One day, both assistants are given the same 10 materials to measure. Based on
the following data, can we be 95% confident that the technique of the two assistants is different?
Assistant A
Assistant B
52
58
57
70
65
57
59
65
68
60
11.
For 10 degrees of freedom in the numerator and 10 degrees of freedom in the denominator,
what is the f-value above which 5% of the area beneath the f-distribution curve lies?
12.
The variance of errors in measurements made by two different labs are given below. Are these
differences in variances statistically significant at the 95% significance level?
sample size
Lab A
Lab B
13.
standard deviation
11
21
66
40
This very important problem demonstrates the use of the Chi-squared distribution to test
whether a sample could have come from a Gaussian distributed population. 20 random data
points are taken. m = 2.995, s = 1.049. The data were plotted on a histogram consisting of 10
equal classes beginning at a value of 0 and ending at a value of 6. The number of data within the
classes is: 0,1,1,4,4,5,2,2,1,0.
a) Assuming that the data are sampled from a Gaussian distribution, compute the expected
number of data in each class. Approximate and with m and s.
b) Compute the 2 statistic for these data.
Parametric Statistics
6-21
c) Within a 95% confidence level, could you reject the null hypothesis that the data are sampled
from a Gaussian distribution with = 2.995 and = 1.049?
Parametric Statistics
6-22
Chapter 7
L + = (l i + i )
i =1
The total length, L is the sum of all the individual lengths, li. The error in the ith length is given by i.
This results in an overall length error of . The individual errors would be expected to both add and
cancel randomly so it would be incorrect to simply add the errors. Since the total length will be a
random variable, we compute its "expected" value. Since the mean value of the individual errors is
zero, we have:
E [ L + ] = E
E [L ] + 0 =
( li + i ) =
i =1
( [ ] [
i= 1
E li + E i
E [(l + ) ]
i
i =1
]) = (li + 0) = L
i =1
So, the expectation value of L is just L, which is equal to the sum of all of the individual distances, which
agrees with our intuition. This only tells us that for repeated experiments, the errors average to zero.
Propagation of Errors
7-1
But, for an individual experiment, we need the standard deviation of the error. We get this by
computing the expectation value of the variance using equation 7-9. We have:
2
2
= E ( Le L) = E (l i + i ) l i = E i
i
i
i
2
L
L is the error free length and Le is the length measured as a result of a single experiment. Notice that the
term on the right is the square of a sum of terms. Multiplying out some of the terms, this will look like:
N N
L2 = E i k = E ( i k )
k
i =1 k =1
i
There are terms that are sums over i k . If N=2 we can multiply the terms by hand, resulting in
E [(1 + 2 )(1 + 2 )] = E [21 + 2 1 2 + 22 ]. The expectation of all 1 2 terms will be zero, since
2
2
1 and 2 are independent and will average to zero. The 1 and 2 terms will not, since they are
squared and will never have negative numbers to cancel with the positive ones. So, we have:
L2 = E 2i
i
In the general case, when i = k cancellation will not occur and E[ii] 0, but when i k, E[ik] =
0.. The expectation is the variance of the population (the population of errors of each of the individual
2
length measurements), which we will call l . The final answer is:
2
= E ( l i + i ) L = E 2i = Nl2
i
i
[ ]
2
L
So, the variance of the total length is given by the sum of the variances of each of the individual length
measurements. If the variances of each of the terms in the sum are different, the individual variances are
summed to get the final answer as shown in equation 7-1 below.
N
L = i2
2
(7-1)
i =1
Interestingly, the above formula also applies to the case when measurements are subtracted. This is
because the minus sign is eliminated by the variance computation, which squares the error.
Propagation of Errors
7-2
Problem 1:
Prove equation 7-1 for the case when 3 lengths are added to get the total length. Let
each of the individual length measurements have random errors with standard deviation
of e.
Problem 2:
Suppose that measurement A has a Gaussian distributed error with variance a2 and
measurement B has a Gaussian distributed error with variance b2 . Prove that the
variance of the difference, A - B is given by a2 + b2.
M
V
If the mass and volume each have errors, how will these combine to produce the error for the density?
To approximate the effect of of a small change in M and V, , we use the chain rule of differentiation,
which says:
df ( x, y ) =
f
f
dx +
dy
x
y
The above formula gives us a relationship that can be used to compute a small change in the function,
f(x,y) caused by small changes in either x or y. It only applies exactly to infinitesimally small changes in
x and y. Here we dont need an exact result, so we can extend it to larger changes (we say the result is
accurate to first order). We indicate that the changes are finite by using the notation x and y
instead of dx and dy. So, the chain rule takes the form:
f ( x , y ) =
f
f
x + y + small error
x
y
This equation is the first order term of the Taylors expansion for a function of two variables. The small
error will become important when bias is treated. For the density formula, the change in density due
to a small change in mass and volume is given by:
( M, V )
M +
V
M
V
Propagation of Errors
7-3
and since:
1
=
;
M V
M
= 2
V
V
Then:
=
1
V
M
V
Expressing the above equation as the fraction of the total density (note that we are dropping the
symbol, so must remember that the equations are only accurate to first order):
M V
=
M
V
=E
=
+
E
=
V M
V
M
Note that once the chain rule is used, the results follow those derived for sums and differences of
random variables. If we define c as the ratio of the standard deviation of the parameter to the value of
the parameter, according to the above equation, we have:
c = c M + cV
2
where
V2
M2
2
2
c = 2 ; c V = 2 and c M = 2
V
M
We can then write a general law of propagation of errors, which states that if:
f ( x, y, z, .. . p, q, r, ... . ) =
x y z . ...
p q r .. ..
then the total error, expressed in terms of the fractional variation defined above,
c f = c x + c y + c z +. ... .. +c p + c q + c r +. ... ...
2
(7-2)
So, equation 7-1 expresses the total variance of the result when data are summed and equation 7-2
above expresses the total variance of the results when data are multiplied and divided.
Version December 20, 2001
University of California, 2001
Propagation of Errors
7-4
Propagation of Errors
7-5
Induced Bias
Mathematical operations on noisy data can affect the result in unexpected ways. A simple case
occurs when noisy data values are squared. The randomness which previously averaged to zero
because of cancellation of positive and negative will no longer average to zero because all of the
squared numbers have a positive sign. There will be a non-zero average, or bias added by this effect.
For example, suppose data follow the form of equation 7-7, where Y = y + aY noise . Ynoise is a
Gaussian distributed random quantity with mean = 0 and standard deviation noise. Suppose Y is
squared. We have:
Y
noise
Now taking the expectation of each side of the above equation and using equation 7-9, we have:
[ ]= E y + a Y + 2 yaY
= E[y ]+ E [a Y ]+ E[2 yaY
]
EY
noise
noise
noise
noise
(7-3)
= y + a ( noise + noise ) + 0
2
So, when Y is squared, its average value (which is the expectation) is biased by the variance of the
noise. If the mean of the noise is zero, as it has been defined here, then noise = 0. So, if Gaussian
distributed data will be used in a formula which squares the values, it is much better to find the average
of the values in the sample prior to squaring each value, as opposed to squaring each sample, then
taking the average.
120
Y +Bias
100
80
60
40
20
0
0
10
Figure 7.1. Plot of the result of squaring noisy data (equation 7-1). The dotted line shows how the expected value of
Y without noise is increased by the bias, which is a. Where is the standard deviation of the noise. In this case, a
= 4. This would lead the experimenter to estimate too high a value for the quantity represented by Y.
It is very common to put noisy data values into a formula, so it is important to understand the
effect that the formula will have on the answer. Will the noise bias the answer? Is the distribution of the
answer Gaussian if the data are Gaussian? It is important to answer these questions if we are to apply
statistical tests based on the assumption that errors are Gaussian distributed. Are the statistical tests
Version December 20, 2001
University of California, 2001
Propagation of Errors
7-6
applied to the data first, or should they be applied to the answer? This section will give guidance on this
question and follow with an example in age dating.
Assume that the data, x will be entered into a general formula, given by:
Y = f (x )
(7-4)
Y is the value computed from the data. Generally, x will also have a variation due to noise. The
experimenter would hope that this variation would be small relative to the data value (high signal to noise
ratio). This variation can be expressed as:
Y = f ( x + )
(7-5)
We are interested in the case where /x is small (relatively high signal to noise), so we use a Taylors
expansion of f(x), which is given by:
f ( x + ) = f ( x ) +
f ( x ) 2 f 2 ( x )
n 1 f n 1 (x )
+
+
..
...
+
+ error
2 ! x 2
( n 1)! x n 1
x
(7-6)
The Taylors series expansion for several functional forms is given below. The expansion is carried only
to second order. This is good enough to show the effect of bias. If the bias in a result is large, one
should also look at the higher order terms or take a different approach to the noise analysis.
If the function has an exponential dependence,
Y = f (x ) = Ae nx +b
f (x )
= Ane nx
x
f 2 ( x )
= An 2 e nx
x 2
So
Y = f (x ) + ( Ane
nx
(7-7)
Here, x is the value of the data and is the random variation or noise in the data. The ( Ane nx ) term
is the first order randomness in Y (the result of the calculation) which is caused by randomness in x (the
data). The last term is also random and causes the bias in Y, since it will not average to zero. To get
the expected bias in Y, we take the expectation value of Y:
] [
E [Y ] = E [ f ( x )] + E Ane nx + E 2 An 2 e nx
= f ( x ) + Ane nx E [ ] + An 2 e nx E [2 ]
Propagation of Errors
7-7
The following derivations all assume that the random variable that is being input to the equation is
gaussian distributed.
2
Now, E[]=0, since the average of the noise is taken to be zero, and E [2 ]= noise
. Remember that
we are evaluating the noise effect at a particular value of x, so f(x) is unvarying in the above derivation,
so E[f(x)]=f(x). So, the result is:
2
2
E [Y ] = f (x ) + An e noise
nx
(7-8)
The second term is the bias effect, which gets larger as the square of n. The ratio of the bias to the
actual value is given by:
R=
bias
f ( x)
2
An 2 e nx noise
2
= n 2 noise
Ae nx
(7-9)
So, the bias (relative to the signal) gets larger as n and increase.
Practice: Using the above techniques, prove that the expansion to second order and bias ration, R are
correct for the following useful functional forms:
1. Linear :
f ( x ) = mx + b
Y = f (x + ) = mx + b + m
(7-10)
E [Y ] = mx + b + 0
R=0
2. Variable in denominator
f ( x) =
A
x
Y = f (x + ) =
E [Y ]
R
A
x
A
x
2A
2A
x
+ 2
2A
x3
+... .. ...
(7-11)
x3
2 2
x2
Propagation of Errors
7-8
E [Y ] Ax n +
R
4. Logarithmic:
2
2 x2
2
2
An (n 1) x n
2
2
An( n 1) x n
+. .. ... ...
(7-12)
n(n 1)
f ( x ) = A ln ( x )
Y = f (x + ) = A ln (x ) +
E [Y ] = A ln ( x ) + 0
R=
5. Exponential:
n 1
A
x
2 A
2x
+ .. ... ...
2 A
2x
(7-13)
2 x ln ( x )
f ( x ) = Ae bx + c
Y = f (x + ) = Ae b ( x + )
bx
= Ae (1 + b +
E [Y ] Ae (1 +
bx
Problem 3:
(b )2
2!
b 2 2
2
+.. .. ... .)
(7-14)
b 2 2
2
Write and implement a button script that shows that equations 7-13 is true by
repeatedly adding random values to x and computing the running average of the value of
f(x), as was done in chapter 5 for coin tossing. Show that the value computed from the
equation for R agrees with the value found from the simulation.
Propagation of Errors
7-9
Age dating based on radioactive decay relies on the fact that radioactive elements decay at a
known rate depending on time. In general, we can represent the concentration of the radiogenic
element by:
t
A = A0 e
(7-14)
where A0 is the original concentration of the parent element at time t=0 and is the decay constant.
The time at which A is equal to half of the concentration is called the half-life and is equal to:
A
2
= A0 e
t 1/2
or
T 1/ 2 =
log e 2
0. 693
If the parent element decays to the daughter element, after a time, t the concentration of daughter
atoms will be:
D = A0 A = A0 A0 e
= A0 (1 e t )
(7-15)
If we take the ratio of D/N and solve for t, the result is:
t =
log e 1 +
D
N
So, if it is known that the daughter atoms were the result only of radioactive decay of the parent
atom, the age can be computed. But, it is often the case that there is an initial concentration of the
daughter element. When more than one age dating method is used, the results (if they agree) are said
to be concordant.
For this case study, we treat the 207Pb/206Pb isotope system. 238U decays to 206Pb and 235U
decays to 207Pb. The decay equations (from equation 7-15) are:
Pb
[ U]
(e
Pb
[ U]
(e
206
now
238
now
238 t
1)
235 t
1)
and
207
now
235
now
Propagation of Errors
7-10
[
[
207
206
]
Pb ]
Pb
now
now
[ U]
=
[ U]
235
now
238
now
(e
(e
235 t
238 t
1)
1)
235 t
1)
137 . 88 (e
1)
(e
238 t
(7-16)
[ 207 Pb]/[ 206 Pb] is the measured present day lead isotope ratio and the present day [ 235 U/238 U]
ratio is 1/137.88 and is assumed to be a constant which does not depend on age and history of the
sample. So, it is possible to compute an age from a single analysis. The best mineral for use of this
system is zircon, because it retains uranium and its decay products, crystallizes with almost no lead, and
is widely distributed.
Equation 7-16 cannot be solved explicitly for age (t). The Simulations stack included with
this book provides a button whose script solves this equation numerically. An important question to be
asked is: how sensitive is the age determination to errors in the various constants that are in the
equation? Currently, the best available measurement accuracy in the 207 Pb/206 Pb ratios causes 1/5 of
the uncertainty in age than that caused by uncertainties in the decay constants. The decay constants of
the uranium to lead systems are 9.85 x 10-10 0.10% yr -1 for 235 U and 1.55 x 10-10 0.08% yr -1 for
238U and have been defined by international convention. The 235 U/238 U ratio is also uncertain to
about 0.15% .
The measurement of the 207 Pb/206 Pb ratios requires complex instrumentation and precise
analytical techniques. This ratio can be measured to accuracies as great as 0.1% to 0.03%. Another
important source of error is the correction for common lead, which is lead that is present in the sample
from sources other than decay of the parent isotopes of uranium. The source of common lead is
original lead in the sample as it crystallized, lead introduced by exchange with external sources, and lead
added during handling prior to analysis. A complete analysis of common lead errors is beyond the
scope of this text.
Problem 4:
Determine the error of the age determination of a zircon when the 207 Pb/206 Pb ratio
changes by 0.1% for ages of 100 Ma, 200 Ma, 300 Ma, and 500 Ma (1 Ma = 1 x 106
years; at 100 Ma and a ratio error of 1%, the age error is 23.7%). Draw a graph of the
age error vs age of the sample.
Problem 5:
Make graphs of the error in the age determination vs age of the sample caused by the
errors in each of the two decay constants.
Problem 6:
Determine the error of the age determination due to the error in the 235 U/238 U ratio.
Propagation of Errors
7-11
Problem 7:
Study the problem of bias in the result due to random errors in the
measured207 Pb/206 Pb ratio. Because the age equation cannot be solved analytically,
this simulation will need to be implemented using the computer. Repeatedly compute
the age, each time with a random error {e xRandom ("g", -1) } to the 235 U/238 U
ratio. Keep a running average of the age determination and put the current average
value into a card field, as was done in Chapter 5. Bias may show up as a higher age, on
the average, than the actual age. See if you can think of any way to determine bias
without doing the repeated sampling simulation.
Let's look at the expansion of the equation =M/V to higher orders. The density, , has a small
change, , caused by small changes in M and V. We can write this below as:
M + M
+ =
V + V
Rewrite this as:
M + M 1
+ =
1 + V
V
V V V
=1
+
+
+.. .. .
V
V V V
V
1+
1
Propagation of Errors
7-12
So after subtracting simplifying by subtracting out the on the left and the M/V on the right, we can
rearrange the density equation as:
(M + M ) 1 V
=
V
2
3
M
V V
+
+.. ..
V V
V
V
Multiply out so we can better see the small terms multiplied together.
=
M M
V
V
M V
V
2
2
3
V
V
V
+M
+ M
M
+. .. .
V
V
V
The first and second terms have only one variable with . This is why it is called the "first order". The
third and fourth terms have two variables multiplied, and is called the "second order". Terms 5 and 6
have cubed variables, and are called "third order", etc. First order terms are linear in the gaussian
distributed random error for mass and volume (M and V), so their distributions will be gaussian.
However, second order terms are squared. The V2 term is 2 distributed, since the 2 distribution is
the one that describes the distribution for squared gaussian variables (Ch 9).
But, what is the distribution for the MV term? We know that V2 will always be positive.
However, in this case it is possible that M will be positive while V is negative. So, right off, we know
that it will not be the same distribution as the one for V2. If M and V are completely independent
of each other, the product will average to zero and the contribution of this term to the standard deviation
of will be the product of the standard deviations of the volume and mass errors. The concept of
"independence" will be discussed further in a later chapter. It is sufficient to say, for now, that when two
random variables are independent of one another, the standard deviation of their product is just the
product of the standard deviations of each of the individual random variables. The third order terms
begin to get even more complicated. Term 5 is ok, because it has V2 and M. The V2 portion will
be 2 distributed, as before and M will be gaussian distributed, so we will have the product of a 2
distributed variable and a gaussian distributed variable. The term with V3 will have another unique
distribution. This is best modeled on the computer using a simulation. It is rarely be necessary to go
beyond second order.
It is the second order terms in the error expansion that produce "bias" in the result. This bias
cannot be eliminated by increasing the sample size. It exists even for an infinite number of data. This
can be easily seen by taking the expectation of , as follows:
1
V M V
V
E [] = E M M
+M
+. .
V
V
V
V
2
Simplifying,
E [] =
1
M
1
M
2
E [M ] E [V ] E [M V ] + 2 E (V ) +. ... .
V
V
V
V
Propagation of Errors
7-13
Since the average of M , V and MV will be zero, over many repetitions of the experiment, the
only term that will remain is the V2 term, which is the cause of the bias. So, the second order bias in
is given by:
E [] =
M
V
E (V )
]= VM
where v is the standard deviation of the volume measurement. It is interesting that the bias is
controlled by errors in the volume alone. The mass is in the numerator of the equation, so its effect is
linear and will average to zero at all orders.
Problem 8.
Suppose V/V=0.5 and M/M=0.5. Find the distribution(s) of the relative error in
density / up to second order. Find the expected mean value of and its standard
deviation if V=10m3 and M=2kg.
Problem 9:
Create a button that simulates problem 8. Show that the values for standard deviation
of each of the "orders" of the error expansion that result from your simulation agree with
the values you expect from problem 8.
m
N
population distribution of the volume and mass measurements. After the averaging, the standard
Version December 20, 2001
University of California, 2001
Propagation of Errors
7-14
deviation of the errors is reduced by N. When this is put in the M/V formula, the bias caused by the
second order error terms is reduced by 1/N. The first order error terms are reduced by 1/N, as
expected. The other option for performing the analysis is to compute the value of for each of the M
and V values. Then, after all of the values of are computed, take the average of the 's. This has the
extreme disadvantage of increasing the size of the second order error terms, which cause bias in the final
result.
Problem 10:
Write a button script to simulate the effect described in the above paragraph and show
quantitatively that the results of the simulation agree with the results of the above
analysis. Generate N random values of the mass and volume using the xRandom
function. Use the parameters of problem 8, where V/V=0.5 and M/M=0.5. Then
compare the results (std deviation and bias of the answer) when the mass and volume
values are averaged first to the results when the densities are computed first and
averaged to get a final density.
where y is the exact value and Ynoise is the random error. Here we will consider Ynoise to have an average
of zero and a standard deviation of 1. The constant, a is the standard deviation of the added noise. If
the average value of the noise was not zero, we would say it was biased. As discussed in Chapter 6,
two important cases are 1) when the amplitude of the noise is a constant and 2) when the amplitude of
the noise is proportional to the signal, y. Below are the two forms:
Noise constant:
Version December 20, 2001
University of California, 2001
Y = y + aY noise
Propagation of Errors
(7-7)
7-15
Noise proportional to y:
Y = y + aY noise y = y (1 + aY noise )
(7-8)
The distinction between these two cases is shown in the two log-log plots of figure 7.5. Data are
generated according to the each of equations 7-7 and 7-8.
Figure 7.5 The left plot is a log-log plot of y = x2 + Ynoise where noise is constant. The right plot is a log-log plot of y =
x2 + 0.2 x2 Ynoise ., where noise is proportional to noise free signal, y.
The left hand plot in figure 7.5 shows a log-log plot with signal (y) plus constant noise. The right hand
plot shows signal (y) plus noise proportional to the value of y. The important feature here is that the
randomness in the left plot decreases at larger x and y, and in the right plot the randomness remains
relatively constant. This has important consequences when fitting straight lines to log-log plotted data.
Obviously, in the first case, one would not want to fit a straight line to the lower values of the data where
the noise is high. In the second case, the noise is relatively uniform over the range and a fit will be force
to take into account the full range of the data.
Propagation of Errors
7-16
Review:
After reading this chapter and working the problems, you should:
Understand the relationship between the mean and variance of the data to the mean and variance of
the answer, after putting data into an equation.
Understand what bias is and how to compute it analytically for simple functional forms and how to
model it on the computer using simulations.
Be able to determine the distribution of errors in the answer that is a function of the equation used to
get the answer and the distribution of the data.
Propagation of Errors
7-17