Sei sulla pagina 1di 62

The term inference refers to a key concept in statistics in which we draw a conclusion

from available evidence.


The purpose of descriptive statistics is to summarize or display data so we can quickly
obtain an overview. Inferential statistics allows us to make claims or conclusions about
a population based on a sample of data from that population. A population represents all
possible outcomes or measurements of interest. A sample is a subset of a population.
We use the term population in statistics to represent all possible measurements or
outcomes that are of interest to us in a particular study. The term sample refers to a
portion of the population that is representative of the population from which it was
selected.
Data is simply defned as the value assined to a specifc observation or measurement.
!ata that is used to describe somethin of interest about a population is called a
parameter.
"or instance# let$s say that the population of interest is my wife$s three%year%old preschool
class and my measurement of interest is how many times the little urchins use the
bathroom in a day.
&f we averae the number of trips per child# this fure would be considered a parameter
because the entire population was measured. 'owever# if we want to make a statement
about the averae number of bathroom trips per day per three%year%old in the country#
then !ebbie$s class could be our sample. We can consider the averae that we observe
from her class a statistic if we assume it could be used to estimate all three year%olds in
the country.
!ata that describes a characteristic about a population is known as a parameter.
!ata that describes a characteristic about a sample is known as a statistic.
Information is data that is transformed into useful facts that can be used for a specifc
purpose# such as makin a decision.
We classify the sources of data into two broad cateories( primary and secondary.
)ou can obtain primary data in many ways# such as direct observation# surveys# and
e*periments.
Direct observation( "ocus roups are a direct observational technique where the
sub+ects are aware that
data is bein collected. ,usinesses use focus roups to ather information in a roup
settin controlled by a moderator. The sub+ects are usually paid for their time and are
asked to comment on specifc topics.
Experiments: This method is more direct than observation because the sub+ects will
participate in an e*periment desined to determine the e-ectiveness of a treatment. An
e*ample of a treatment could be the use of a new medical dru. Two roups would be
established. The frst is the e*perimental roup who receive the new dru# and the second
is the control roup who think they are ettin the new dru but are in fact ettin no
medication. The reactions from each roup are measured and compared to determine
whether the new dru was e-ective.
The beneft of e*periments is that they allow the statistician to control factors that could
in.uence the results# such as ender# ae# and education of the participants. The concern
about collectin data throuh e*periments is that the response of the sub+ects miht be
in.uenced by the fact that they are participatin in a study. The desin of e*periments for
a statistical study is a very comple* topic and oes beyond the scope of this book.
Surveys( This technique of data collection involves directly askin the sub+ect a series of
questions.
The questionnaire needs to be carefully desined to avoid any bias or confusion for those
participatin. /oncerns also e*ist about the in.uence the survey will have on the
participant$s responses. 0esearch has shown that the manner in which the questions are
asked can a-ect the responses a person provides on a questionnaire. A question posed in
a positive tone will tend to invoke a more positive response and vice versa. A ood
stratey is to test your questionnaire with a small roup of people before releasin it to
the eneral public.
Another way to classify data is by one of two types( quantitative or qualitative.
Types of measurement scales:
A nominal level of measurement deals strictly with qualitative data. 1bservations are
simply assined to predetermined cateories. 1ne e*ample is ender of the respondent#
with the cateories bein male and female. This data type does not allow us to perform
any mathematical operations# such as addin or multiplyin. We also cannot rankorder
this list in any way from hihest to lowest. This type is considered the lowest level of data
and# as a result# is the most restrictive when choosin a statistical technique to use for the
analysis.
)ou can use numbers at the nominal level of measurement. 2ven in this case# the rules of
the nominal scale still remain. An e*ample would be zip codes or telephone numbers#
which can$t be added or placed in a meaninful order of reater than or less than. 2ven
thouh the data appears to be numbers# it$s handled +ust like qualitative
data.
1n the food chain of data# ordinal is the ne*t level up. &t has all the properties of nominal
data with the added feature that we can rank%order the values from hihest to lowest. An
e*ample is if you were to have a lawnmower race. 3et$s say the fnishin order was 4cott#
Tom# and ,ob. We still can$t perform mathematical operations on
this data# but we can say that 4cott$s lawnmower was faster than ,ob$s. 'owever# we
cannot say how much faster. 1rdinal data does not allow us to make measurements
between the cateories and to say# for instance# that 4cott$s lawnmower is twice as ood
as ,ob$s 5it$s not6.
1rdinal data can be either qualitative or quantitative. An e*ample of quantitative data is
ratin movies with 7# 8# 9# or : stars. 'owever# we still may not claim that a :%star movie
is : times as ood as a 7%star movie.
;ovin up the scale of data# we fnd ourselves at the interval level# which is strictly
quantitative data. <ow we can et to work with the mathematical operations of addition
and subtraction when comparin values. "or this data# we can measure the di-erence
between the di-erent cateories with actual numbers and also provide meaninful
information. Temperature measurement in derees "ahrenheit is a common e*ample here.
"or instance# => derees is ? derees warmer than @? derees.
'owever# multiplication and division can$t be performed on this data. Why notA 4imply
because we cannot arue that 7>> derees is twice as warm as ?> derees.
The kin of data types is the ratio level. <ow we can perform all four mathematical
operations to compare values with absolutely no feelins of uilt. 2*amples of this type of
data are ae# weiht# heiht# and salary. 0atio data has all the features of interval data
with the added beneft of a true > point. The term true zero point means that a > data
value indicates the absence of the ob+ect bein measured. "or instance# > salary indicates
the absence of any salary.
The distinction between interval and ratio data is a fne line.
To help identify the proper scale# use the twice as much rule. &f the phrase twice as
much accurately describes the relationship between two values that di-er by a multiple
of 8# then the data can be considered ratio level.
&nterval data does not have a true > point. "or e*ample# > derees "ahrenheit does not
represent the absence
of temperature# even thouh it may feel like it.
Frequency distributions is simply a table that oranizes the number of data values into
intervals.
The intervals in a frequency distribution are o-icially known as classes# and the number
of observations in each class is known as class frequencies.
/onstructin a frequency distribution(
% from classes of equal size.
% make classes mutually e*clusive# or in other words# prevent classes from overlappin.
% try to have no fewer than ? classes and no more than 7? classes
% avoid open%ended classes# if possible 5for instance# a hihest class of 7?Bover6.
% include all data values from the oriinal table in a class. &n other words# the classes
should be e*haustive.
Relative Frequency Distribution
0ather than display the number of observations in each class# this method calculates the
percentae of observations in each class by dividin the frequency of each class by the
total number of observations.
Cumulative Frequency Distribution
/umulative frequency distributions indicate the percentae of observations that are less
than or equal to the current class. &t totals the percentaes of each class as you move
down the column. Cohn used his phone D times or less on D: percent of the days in the
month.
rap!in" a Frequency Distribution# t!e $isto"ram
A historam is simply a bar raph showin the number of observations in each class as
the heiht of each bar.
% the frst thin we need to do is open 2*cel to a blank sheet and enter our data in /olumn
A startin in /ell A7.
% ne*t enter the upper limits to each class in /olumn , startin in /ell ,7.
% o to the Tools menu at the top of the 2*cel window and select !ata Analysis.
% The /hart Wizard allows me more control over the fnal appearance.
Statistical Flo%er &o%er# t!e Stem and 'eaf Display
The ma+or beneft of this approach is that all the oriinal data points are visible on the
display.

The stem in the display is the frst column of numbers# which represents the frst diit of
the olf scores. The leaf in the display is the second diit of the olf scores# with 7 diit
for each score. ,ecause there were ? scores in the =>s# there are ? diits to the riht of =.
'ere# the stem labeled = 5?6 stores all the scores between =? and =E. The stem D 5>6
stores all the scores between D> and D:.
C!artin" a Frequency Distribution
(ar C!arts
,ar charts are a useful raphical tool when you are plottin individual data values ne*t
to each other.
The historam that we visited earlier in the chapter is actually a special type of bar
chart that plots frequencies rather than actual data values.
'ow do & choose between a pie chart and a bar chartA &f your ob+ective is to compare
the relative size
of each class to one another# use a pie chart. ,ar charts are more useful when you want
to hihliht the actual data values.
'ine C!arts is used to help identify patterns between two sets of data.
3ine charts prove very useful when you are interested in e*plorin patterns between
two di-erent types of data. They are also helpful when you have many data points and
want to show all of them on one raph.
,ecause the line connectin
the data points seems to have an
overall upward trend# my suspicions
hold true. &t
seems the more showers our
waterloed darlins take# the hiher
the utility bill.
)easures of Central Tendency
There e*ist two broad cateories of descriptive statistics that are commonly used. The
frst# measures of central tendency# describes the center point of our data set with a
sinle value. &t$s a valuable tool to help us summarize many pieces of data with one
number. The second cateory# measures of dispersion describe how far individual
data values have strayed from the mean.
The mean or avera"e is the most common measure of central tendency and is
calculated by addin all the values in our data set and then dividin this result by
the number of observations.
A %ei"!ted mean allows you to assin more weiht to certain values and less
weiht to others.
)ean of rouped Data from a Frequency Distribution * e*ample(


The mean of a frequency distribution where data is rouped into classes is only an
appro*imation to the mean of the oriinal data set from which it was derived.
This is true because we make the assumption that the oriinal data values are at the
midpoint of each class# which is not necessarily the case. The true mean of the 9>
oriinal data values in the cell phone e*ample is only :.? calls per day rather than :.@.
The median is the value in the data set for which half the observations are hiher
and half the observations are lower. We fnd the median by arranin the data
values in ascendin order and identifyin the halfway point.
When there is an even number of data points# the median will be the averae of the two
center points.
Fsin our e*ample with the video ames# we rearrane our data set in ascendin order(
9 : : : ? @ = = E 7=
Accordin to the mean of this
frequency distribution# Cohn
averaes :.@ calls per day on his
cell phone.
,ecause we have an even number of data points 57>6# the median is the averae of the
two center points. &n this case# that will be the values ? and @# resultin in a median of
?.? hours of video ames per week. <otice
that there are four data values to the left 59# :# :# and :6 of these center points and four
data values to the riht 5=# =# E# and 7=6.
The mode is simply the observation in the data set that occurs the most
frequently.
&f you think all the data in your data set is relevant# then the mean is your best choice.
This measurement
is a-ected by both the number and manitude of your values. 'owever# very small or
very lare values can have a sinifcant impact on the mean# especially if the size of the
sample is small. &f this is a concern# perhaps you should consider usin the median. The
median is not as sensitive to a very lare or small value.
/onsider the followin data set from the oriinal video ame e*ample(9 : : : ? @ = = E
7=
The number 7= is rather lare when compared to the rest of the data. The mean of this
sample was @.@# whereas the median was ?.?. &f you think 7= is not a typical value that
you would e*pect in this data set# the median would be your best choice for central
tendency.
The poor lonely mode has limited applications. &t is primarily used to describe data at
the nominal scaleGthat is# data that is rouped in descriptive cateories such as ender.
&f @> percent of our survey respondents were male# then the mode of our data would be
male.
"rom !ata Analysis% !escriptive 4tatistics( mean# median# mode.
)easures of Dispersion
Ran"e is the simplest measure of dispersion and is calculated by fndin the
di-erence between the hihest value and the lowest value in the data set. = E D 77
: % rane H 77 B : H =
'owever# the limitation is that it only relies on two data points to describe the variation
in the sample. <o other values between the hihest and lowest points are part of the
rane calculation.
+ariance summarizes the squared deviation of each data value from the mean.
The variance is a measure of dispersion that describes the relative distance between the
data points in the set and the mean of the data set. This measure is widely used in
inferential statistics.

The frst step in calculatin the variance is to determine the mean of the data set. The
rest of the calculations can
be facilitated by the followin table. The fnal sample variance calculation becomes this(
s8H 8@#DI ?%7.
,sin" t!e Ra% Score )et!od is a more e-icient way to calculate the variance of a
data set.
s8H 5the sum of each data value after it has been squared% the square of the sum of all
the data values6I n%7
T!e +ariance of a &opulation
Standard deviation is simply the square root of the variance. Cust as with the
variance# there is a
standard deviation for both the sample and population. To calculate the standard
deviation# you must frst calculate the variance and then take the square root of the
result.
The standard deviation is actually a more useful measure than the variance because the
standard deviation is in the units of the oriinal data set.
Calculatin" t!e Standard Deviation of rouped Data
T!e Empirical Rule: %or-in" %it! Standard Deviation
The values of many lare data sets tend to cluster around the mean or median so that
the data distribution in the historam resembles a bell%shape# symmetrical curve. When
this is the case# t!e empirical rule tells us that appro*imately @D percent of the data
values will be within one standard deviation from the mean.
"or e*ample# suppose that the averae e*am score for my lare statistics class is DD
points and the standard deviation is :.> points and that the distribution of rades is bell%
shape around the mean. ,ecause one standard deviation above the mean would be E8
5DD J :6 and one standard deviation below the mean would be D: 5DD B :6# the empirical
rule tells me that appro*imately @D percent of the e*am scores will fall between D: and
E8 points.
Accordin to the empirical rule# if a distribution follows a bellshapeGa symmetrical
curve centered around the meanGwe would e*pect appro*imately @D# E?# and EE.=
percent of the values to fall within one# two# and three standard deviations around the
mean respectively.
&n eneral# we can use the followin equation to e*press the rane of values within k
standard deviations around the mean( KJI% k L.
C!ebys!ev.s T!eorem
/hebyshev$s theorem is a mathematical rule similar to the empirical rule e*cept that it
applies to any distribution rather than +ust bell%shape# symmetrical distributions.
/hebyshev$s theorem states that for any number k reater than 7# at least 57 B 7Ik
8
6*
7>> percent of the values will fall within k standard deviations from the mean. Fsin
this equation# we can state the followin(
% at least =? percent of the data values will fall within two standard deviations from the
mean by settin k H 8 into /hebyshev$s equation.
% at least DD.E percent of the data values will fall within three standard deviations from
the mean by settin k H 9
into the equation.
% at least E9.= percent of the data values will fall within four standard deviations from
the mean by settin k H : into the equation.
2*ample(
This table supports /hebyshev$s theorem# which predicts that at least =? percent of the
values will fall within two standard deviations from the mean. "rom the data set# we can
observe that E? percent actually fall between 8>.9 and :E.7 home runs 59D out of :>6.
The same e*planation holds true for three and four standard deviations around the
mean.
)easures of Relative &osibtion describe the percentae of the data below a certain
point.
/uartiles divide the data set into four equal sements after it has been arraned
in ascendin order.
Appro*imately 8? percent of the data points will fall below the frst quartile# M7.
Appro*imately ?> percent of the data points will fall below the second quartile# M8. And#
you uessed it# =? percent should fall below the third quartile# M9.
76 4tep 7( Arrane your data in ascendin order.
86 4tep 8( "ind the median of the data set. This is M8.
96 4tep 9( "ind the median of the lower half of the data set 5in parenthesis6. This is
M7.
:6 4tep :( "ind the median of the upper half of the data set 5in parenthesis6. This is
M9.
Interquartile ran"e % the &M0 measures the spread of the center half of our data
set. &t is simply
the di-erence between the third and frst quartiles# as follows( &M0 H M9 B M7. The
interquartile rane is used to identify outliers# which are the black sheep of our data
set. These are e*treme values whose accuracy is questioned and can cause unwanted
distortions in statistical results. Any values that are more than( M9 J 7.?&M0 or less
than( M7 B 7.?&M0 should be discarded.
2*ample( 7> :8 :? :@ ?7 ?8 ?D =9
4ince there are eiht data values# M7 will be the median of the frst four values 5the
midpoint between the second and third values6. M7H 5:8J:?6I8H :9.?
3ikewise# M9 will be the median of the last four values 5the midpoint between the si*th
and seventh values6.
M8H 5?8J?D6I8H ?@. &0M H M9% M7H ?@% :9.?H 78.?
Any values reater than M9 J 7.? &0MH =:.=? or less than M7% 7.? &0MH 8:.=? should be
considered an outliner# therefore the value 7> would be an outliner in this data set.
The values for variance and standard deviation reported by 2*cel are for a sample. &f
your data set represents a population# you need to recalculate the results usin N in the
denominator rather than n B 7.
&robability topics
Experiment. The process of measurin or observin an activity for the purpose of
collectin data. An e*ample is rollin a pair of dice.
0utcome. A particular result of an e*periment. An e*ample is rollin a pair of threes
with the dice.
Sample space. All the possible outcomes of the e*periment. The sample space for our
e*periment is the numbers N8# 9# :# ?# @# =# D# E# 7># 77# and 78O. 4tatistics people like
to put NO around the sample space values Event. 1ne or more outcomes that are of
interest for the e*periment and which isIare a subset of the sample space. An e*ample is
rollin a total of 8# 9# :# or ? with two dice.
Classical &robability refers to a situation when we know the number of possible
outcomes of the event of interest and can calculate the probability of that event with the
followin equation(
PQARH <umber of possible outcomes in which 2vent A occursI Total number of possible
outcomes in the sample space.
Empirical &robability % when we don$t know enouh about the underlyin process to
determine the number
of outcomes associated with an event. This type of probability observes the number of
occurrences of an event throuh an e*periment and calculates the probability from a
relative frequency distribution.
PQARH "requency in which 2vent A occursI Total number of observations.
1ne e*ample of empirical probability is to answer the ae%old question What is the
probability that Cohn will et out of bed in the mornin for school after his frst wake%up
callA
,ased on these observations# if 2vent A H Cohn ettin out of bed on the frst wake%up
call# then PQAR H >.7?
Fsin the previous table# we can also e*amine the probability of other events. 3et$s say
2vent , H Cohn requirin more than 8 wake%up calls to et out of bedS then PQ,R H>.:>
J >.8? H >.@?.
&f & choose to run another 8>%day e*periment of Cohn$s wakin behavior# & would most
likely see di-erent results than those in the previous table. 'owever# if & were to
observe 7>> days of this data# the relative frequencies would approach the true or
classical probabilities of the underlyin process. This pattern is known as the law of
lare numbers.
The law of lare numbers states that when an e*periment is conducted a lare number
of times# the empirical probabilities of the process will convere to the classical
probabilities.
Sub1ective probability
We use sub+ective probability when classical and empirical probabilities are not
available.
Fnder these circumstances# we rely on e*perience and intuition to estimate the
probabilities.
(asic &roperties of &robability * one event
&f PQAR H 7# then 2vent A must occur with certainty.
&f PQAR H ># then 2vent A will not occur with certainty.
The probability of 2vent A must be between > and 7.
The sum of all the probabilities for the events in the sample space must be equal to 7.
The complement to 2vent A is defned as all the outcomes in the sample space that are
not part of 2vent A and is denoted as A$. Fsin this defnition# we can state the
followin( PQAR J PQA$R H 7 or PQAR H 7 B PQA$R.
T!e Intersection of Events
2*ample(
<ow that my children are older and livin away from home# & cherish those moments
when the phone rins and & see one of their numbers appear on my caller &!.
2*perience has tauht me that & can cateorize these calls as either crisis# involvin
such thins as a computer# a car# an AT; card# or a cell phoneS or noncrisis# when
they call +ust to see if &$m alive and well enouh to help with their ne*t crisis.
The followin table# called a continency table# cateorizes the last ?> phone calls by
child and type of call.
/ontinency tables show the actual or relative frequency of two types of data at the
same time. &n this case# the data types are child and type of call.
2vent A H the ne*t phone call will come from /hristin.
2vent , H the ne*t phone call will involve a crisis.
PQARH 8>I?>H >.:
What about the probability that the ne*t phone call will come from /hristin and will
involve a crisisA
This event is known as the intersection of 2vents A and , and is described by AT,. The
number of phone calls from our continency table that meet both criteria is 7:# so( PQA
and ,R H PQAR T PQ,RH 7:I?>H >.8D
A continency table indicates the number of observations that are classifed accordin
to two variables. The intersection of 2vents A and , represents the number of instances
where 2vents A and , occur at the same time 5that is# the same phone call is both from
/hristin and a crisis6. The probability of the intersection of two events is known as a
1oint probability.
T!e union of Events A and , represents the number of instances where either 2vent A
or , occur 5that is# the number of calls that were either from /hristin or were a crisis6.
PQA and ,R H PQAR F PQ,RH 9:I?>H >.@D
/lassical probability requires knowlede of the underlyin process in order to count the
number of possible outcomes of the event of interest.
2mpirical probability relies on historical data from a frequency distribution to calculate
the likelihood that an event will occur.
The law of lare numbers states that when an e*periment is conducted a lare number
of times# the empirical probabilities of the process will convere to the classical
probabilities.
The intersection of 2vents A and , represents the number of instances where 2vents A
and , occur at the same time.
The union of 2vents A and , represents the number of instances where either 2vent A
or , occur.
Conditional &robability
We defne conditional probability as the probability of 2vent A knowin that 2vent , has
already occurred.
2*ample( the followin table shows the outcomes of our last 8> matches# alon with the
type of warm%up before we started keepin score.
Without any additional information# the simple probability of each of these events is as
follows( PQARHEI8>H>.:?
PQ,RH79I8>H>.@?# PQA$RH77I8>H>.??# PQ,$RH=I8>H>.9?
4imple or prior probabilities are always based on the total number of observations. &n
the previous e*ample# it is 8> matches.
Unowin this piece of info# what is the probability that !ebbie will win the matchA This
is the conditional probability of 2vent A iven that 2vent , has occurred. 3ookin at
the previous table# we can see that 2vent , has occurred 79 times. ,ecause !ebbie has
won : of those matches 5A6# the probability of A iven , is calculated as follows(
PQAI,RH:I79H>.97
We can also calculate the probability that !ebbie will win( PQAI,$RH?I=H>.=7
/onditional probabilities are also known as posterior probabilities. /onditional
probabilities are very useful for determinin the probabilities of compound events as
you will see in the followin sections.
Independent versus Dependent Events
2vents A and , are said to be independent of each other if the occurrence of 2vent ,
has no e-ect on the probability of 2vent A. Fsin conditional probability# 2vents A and ,
are independent of one another if(
PQAI,R H PQAR
&f 2vents A and , are not independent of one another# then they are said to be
dependent events.
&n the tennis e*ample# 2vents A and , are dependent because the probability of !ebbie
winnin depends on whether the warm%up is more or less than 7> minutes. We can also
demonstrate this by observin that( PQARHEI8>H>.:? and PQAI,RH:I79H>.97
These probabilities tell us that overall# !ebbie wins :? percent of the matches.
'owever# when there is a short warm%up# she only wins 97 percent of the time. ,ecause
these probabilities are not equal# 2vents A and , are dependent.
)ultiplication Rule of &robabilities
% to calculate the +oint probability of two events. &n other words# we are calculatin the
probability of these events occurrin at the same time.
"or two independent events# the multiplication rule states the followin( PQA and ,R H
PQAR V PQ,R
&f the two events are dependent# the multiplication rule becomes( PQA and ,R H PQAI,R V
PQ,R
)utually Exclusive Events
Two events are considered to be mutually e*clusive if they cannot occur at the same
time durin the e*periment.
2ddition Rule of &robabilities
We use the addition rule of probabilities to calculate the probability of the union of
eventsGthat is# the probability that either 2vent A or 2vent , will occur. "or two events
that are mutually e*clusive# the addition rule states the followin( PQA or ,R H PQAR J
PQ,R
&f the events are not mutually e*clusive# the addition rule becomes PQA or ,R H PQAR J
PQ,R B PQA and ,R.
When convertin frequencies to relative frequencies in a continency table# always
divide each number in the table by the total number of observations.
(ayer.s T!eorem

T!e Fundamental Countin" &rinciples
Accordin to the fundamental countin principle# if one event can occur in m ways and a
second event can occur in n ways# the total number of ways both events can occur
toether is m W n ways. And we can e*tend this principle to more than two events.
Permutations are the number of di-erent ways in which ob+ects can be arraned in
order. &n a permutation# each item appears only once. The number of permutations of n
distinct ob+ects is nX 5e*pressed as n factorial6.
Combinations are similar to permutations# e*cept that the order of the ob+ects is not
important. The number of combinations of n ob+ects taken r at a time can be found as
follows(
n
/
r
HnXI 5n%r6X rX
2*ample(
<ow that we know the total number of fve%card combinations from a ?8%card deck# we
can calculate the probability of a .ush# which is any fve cards that are all the same suit
5spades# clubs# hearts# or diamonds6. "or you poker veterans# & am includin a royal
.ush and a straiht .ush in this calculation. "irst# we need to count the number of fve%
card .ushes of one suit# let$s say diamonds. ,ecause there are 79 diamonds in the deck#
the number of combinations of these 79 diamonds# taken fve at a time# is as follows(
79
/
?
H 79XI ?X 579%?6XH 78D=.
,ecause there are four suits in the deck# the total number of fve%card .ushes from any
suit is 78D=* : H ?#7:D. Therefore# the probability of bein dealt a .ush# includin royal
and straiht# in a fve%card hand is PQ.ushRH ?7D:I 8 ?ED E@> H >.>>8
3&ER),T4n, r5
3C0)(I64n, r5
Random +ariables
A random variable is an outcome that takes on a numerical value as a result of an
e*periment. The value of the random variable# which is not known with certainty before
the e*periment# is often denoted by x.
All random variables are not created equal. The frst type are known as continuous
random variables# which are the result of a measurement on a continuous number
scale. The second type of random variable is discrete. !iscrete random variables are
the result of countin outcomes rather than measurin them. !iscrete random variables
can only take on a certain number of inteer values within an interval.
A random variable is continuous if it can assume any numerical value within an interval
as a result of measurin the outcome of an e*periment. A random variable is discrete if
it is limited to assumin only specifc inteer values as a result of countin the outcome
of an e*periment.
Discrete probability distributions
A listin of all the possible outcomes of an e*periment for a discrete random variable
alon with the relative frequency or probability of each outcome is called a discrete
probability distribution.

&f we defne the random variable x H the place /hristin fnished in a race# the previous
table would be the discrete probability distribution for the variable x. "rom this table#
we can state the probability that /hristin will fnish frst as follows( PQx H 7R H >.?:
1r we can state the probability that /hristin will fnish either frst or second as follows(
PQx H 7 or x H 8R H >.?: J >.8: H >.=D
Any discrete probability distribution needs to meet the followin requirements(
% each outcome in the distribution needs to be mutually e*clusiveGthat is# the value of
the random variable cannot fall into more than one of the frequency distribution classes.
"or e*ample# it is not possible for /hristin to take frst and second place in the same
race.
% the probability of each outcome# PQxR# must be between > and 7S that is# > YH PQxR YH
7 for all values of x. &n the previous e*ample# PQx = 9R H >.7:# which falls between > and
7.
% the sum of the probabilities for all the outcomes in the distribution needs to add up to
7S
T!e )ean of a Discrete &robability Distribution
The mean of a discrete probability distribution is simply a weihted averae calculated
usin the followin formula(
73 suma 8i x &98i:# unde KH the mean of the discrete probability distribution# ZiH the
value of the random variable for the ith outcome# PQZiRH the probability that the ith
outcome will occur# n H the number of outcomes in the distribution
% it represents the averae fnish of many races. The mean of a discrete probability
distribution does not have to equal one of the values of the random variable.
% another term for describin the mean of a probability distribution is the expected
value, 2QxR.
T!e +ariance and Standard Distribution of Discrete &robability Distribution
L
8
H suma 5Zi%K6 * PQZiR
LH sqrt L
8
C!aracteristics of a (inomial Experiment
A binomial e*periment has the followin characteristics(
576 the e*periment consists of a f*ed number of trials denoted by n;
586 each trial has only two possible outcomes# a success or a failureS
596 the probability of success and the probability of failure are constant throuhout the
e*perimentS
5:6 each trial is independent of any other trial in the e*periment.
T!e (inomial &robability Distribution
% allows us to calculate the probability of a specifc number of successes for a certain
number of trials. Therefore# the random variable for this distribution would be the
number of successes that were observed.
Anyway# let$s say on any particular day there is a 9> percent probability that Uaylee will
brin back one stolen paper and a => percent chance that she won$t. We will assume
that she will not brin back more than one paper a day. This scenario represents a
binominal e*periment# with each day bein a ,ernoulli trial with p = >.9> 5the
probability of a success6 and q H >.=> 5the probability of a failure6. We can calculate
the probability of
r successes in n trials usin the binomial distribution# as follows(
With this equation# we can calculate the probability that Uaylee will brin back three
papers over the ne*t fve days. nH?# rH number of papers

"rom this fure# we can see that the most likely number of papers that Uaylee will show
up with over ? days is 7.
"inally# we can calculate the probability of multiple events for this distribution. "or
instance# the probability that Uaylee will steal at least three papers over the ne*t fve
days is this(
PQr [H 9R H PQ9#?RJ PQ:#?R J PQ?#?R
% an easier way to arrive at these probabilities is to use a binomial probability table
% the probability table is oranized by values of n, the total number of trials. The number
of successes# r, are the rows of each section# whereas the probability of success# p, are
the columns. <otice that the sum of each block of probabilities for a particular value of
p adds to 7.>.
(I60)DIST4r, n, p, cumulative5
cumulative H "A342 if you want the probability of e*actly r successes
cumulative H T0F2 if you want the probability of r or fewer successes
T!e )ean and Standard Distribution for t!e (inomial Distribution
KH np# nH the number of trials# pH the probability of a success
)ou can calculate the standard deviation for a binomial probability distribution usin the
followin equationH sqrt 5npq6# qH the probability of failin.
T!e &oisson &rocess
A Poisson process counts the number of occurrences of an event over a period of time#
area# distance# or any other type of measurement. 0ather than bein limited to only two
outcomes# the Poisson process can have any
number of outcomes over the unit of measurement. The random variable for the
Poisson distribution would be
the actual number of occurrences.
The mean for a Poisson distribution is the averae number of occurrences that would be
e*pected over the unit of measurement. "or a Poisson process# the mean has to be the
same for each interval of measurement. "or instance# if the averae number of
customers walkin into the store each hour is 77# this averae needs to apply to every
one%hour increment.
The last characteristic of a Poisson process is that the number of occurrences durin
one interval is independent of the number of occurrences in other intervals. &n other
words# if si* customers walk into the store durin the frst hour of business# this would
have no e-ect on the number of customers arrivin durin the second hour.

e*emplu de distributie
4ome statistics books use the symbol lambda# to denote the mean of a Poisson
probability distribution.
'owever# reardless of the notation# it$s still the same equation.
PQx YH 8R H PQx = >RJ PQx H 7RJ PQx = 8R ........ the cumulative probability
% the variance of the distribution is the same as the mean( L
8
HK
% +ust like the binomial distribution# the Poisson probability distribution has a table that
allows you to look up the probabilities for certain mean values.
% the probability table is oranized by values of K, the averae number of occurrences.
<otice that the sum of each block of probabilities for a particular value of K adds to 7.
As with the binomial tables# one limitation of usin the Poisson tables is that you are
restricted to usin only the values of K that are shown in the table.
Technically# with a Poisson distribution# there is no upper limit to the number of
occurrences durin the interval. )ou$ll notice from the Poisson tables that the
probability of a lare number of occurrences is practically zero. ,ecause we cannot add
all the probabilities of an infnite number of occurrences 5if you can# you$re a much
better
statistician than & amX6# we need to take 7 minus the complement of PQx YH 9R or( PQx [
9R H 7% PQx YH 9R because( PQx= >RJ PQx H 7RJ PQx H 8RJ PQx H 9RJ ......J PQx H infnitR
H 7.>
&0ISS064x; 7; cumulative5 where(
cumulative H "A342 if you want the probability of e*actly x occurrences
cumulative H T0F2 if you want the probability of x or fewer occurrences
,sin" t!e &oisson Distribution as an 2pproximation to t!e (inomial
Distribution
We can use the Poisson distribution to calculate binomial probabilities under the
followin conditions(
% when the number of trials# n, is reater than or equal to 8> and \
% when the probability of a success# p, is less than or equal to >.>? \
% we replace KH np# n H the number of trials# p H the probability of a success
A Poisson process counts the number of occurrences of an event over a period of time#
area# distance# or any other type of measurement.
% the mean for a Poisson distribution is the averae number of occurrences that would
be e*pected over the unit of measurement and has to be the same for each interval of
measurement.
% the number of occurrences durin one interval of a Poisson process is independent of
the number of occurrences in other intervals.
% if the number of binomial trials is reater than or equal to 8> and the probability of a
success is less than or equal to >.>?# you can use the equation for the Poisson
distribution to appro*imate the binomial probabilities.
T!e 6ormal &robability Distribution
<ow let$s take on a new challene# continuous random variables and a continuous
probability distribution known as the normal distribution.0emember that we defned a
continuous random variable as
one that can assume any numerical value within an interval as a result of measurin the
outcome of an e*periment. 4ome e*amples of continuous random variables are weiht#
distance# speed# or time.
C!aracteristics of t!e normal probability distribution
% the mean# median# and mode are the same value
% the distribution is bell%shaped and symmetrical around the mean.
% the total area under the curve is equal to 7.
% the left and riht sides of the normal probability distribution e*tend indefnitely# never
quite touchin the horizontal a*is.
% the mean and standard deviation describe the shape of the distribution
% e*ample( "iure 9.? shows the impact of chanin the mean of the distribution to ?.>
inches# leavin the standard deviation at >.D inches.
% a smaller standard deviation results in a skinnier curve that$s tihter and taller
around the mean. A larer L 5standard deviation6 makes for a fatter curve that$s more
spread out and not as tall.
Calculatin" &robabilities for t!e 6ormal Distribution
/alculatin the standard ]%score B
z H 5*%K6 IL# where( x H the normally distributed random variable of interest
K H the mean of the normal distribution
L H the standard deviation of the normal distribution
z = the number of standard deviations between x and K# otherwise known as the
standard z-score.
% then we use the 4tandard <ormal Table and we discover the area below the raphic
% then the probability that the standard z%score will be less than or equal to * is the area
* 7>> percent.
% with continuous random variables# we cannot determine the probability of usin
e*actly @:.9 ounces of spray because this would be an infnitely small probability. This is
because & can use an infnite amount of quantities in any iven year. 1ne year# & could
use @7.=?= ounces and another year# ?9.:=8 ounces. That$s why with continuous
random variables we can only calculate the probabilities of certain intervals# like less
than @:.9 ounces or between ?>.? and ?D.7 ounces. /ompare this to discrete random
variables from previous chapters. ,ecause there were only a fnite number of values for
these variables# we could calculate the probability of e*actly x
occurrences or r successes.
% the neative score indicates that we are to the left of the distribution mean. <otice
that the standard normal table only shows positive z values. ,ut this is no problem
because the distribution is symmetric.
% e*ample( we can determine the area to the riht of J7.8 standard deviations as
follows( PQz [%7.8RH7% PQzYH% 7.8R H 7% >.77?7H >.DD:E. ,ecause PQ*[?:RH PQz[%7.8RH
>.DD:E. There is an DD.:E percent chance & will use more than :? ounces of spray. "i
77.7> The shaded area is the probability that * will be more than ?: ounces.
60R)DIST 4x; mean; std deviation; cumulative5; %!ere cumulative 3 F2'SE if
you %ant t!e probability mass function 4%e don.t5 or cumulative 3 TR,E if you
%ant t!e cumulative probability 4%e do5
,sin" 6ormal Distribution as an 2proximation to t!e (inomial Distribution
% the binomial equation will calculate the probability of r successes in n trials with p H
the probability of a success for each trial and q H the probability of a failure. &f np >= ?
and nq Y ?# we can use the normal distribution to appro*imate the binomial.
% as an e*ample# suppose my statistics class is composed of @> percent females. &f &
select 7? students at random# what is the probability that this roup will include D# E#
7># or 77 female studentsA "or this e*ample# n H 7?S p H >.@S q H >.:S and r H D# E# 7>#
and 77. We can use the normal appro*imation because np H 57?65>.@6 H E and
nq H 57?65>.:6 H @.
% when calculatin with the normal appro*imation to the binomial distribution# addin or
substractin >.? is knowm as the continuity correction factor. "or larer values of n# like
7>> or more# you can inore this correction factor.
Inferential Statistics
&nferential statistics enables us to make statements about a eneral population usin
the results of a random sample from that population.
"or instance# usin inferential statistics# the winner of a political election can be
accurately predicted very early in the pollin process based on the results of a relatively
small random sample that is properly chosen.
The term random samplin refers to a samplin procedure where every member in the
population has a chance of bein selected.
% we have to ensure that the fnal sample to be measured is representative of the
population from which it was taken. &f this is not the case# then we have a biased
sample, which can lead to misleadin results.
% there are four di-erent ways to ather a random sample( simple random# systematic#
cluster# and stratifed.
A simple random sample is a sample in which every member of the population has an
equal chance of bein chosen. & could randomly choose pacients usin a random number
table.
% random numbers can also be enerated with 2*cel usin the 0A<! function %%[ cell A7
contains the formula H0A<!56# which provides a random number between > and 7. This
random number would result in student 9?= bein chosen for the sample.
1ne way to avoid a personal bias when selectin people at random is to use systematic
samplin". This technique results in selectin every kth member of the population to be
in your sample. The value of k will depend on the size of the sample and the size of the
population.
&n eneral# if N H the size of the population and n H the size of the sample# the value of
k would be appro*imately Nn.
The beneft of systematic samplin is that it$s easier to conduct than a simple random
sample# often resultin in less time and money. The downside is the daner of selectin
a biased sample if there is a pattern in the population that is consistent with the value of
k.
Cluster samplin"
&f we can divide the population into roups# or clusters# then we can select a simple
random sample from these clusters to form the fnal sample. 2ach member of the
chosen clusters would be part of the fnal sample.
&n strati<ed samplin", we divide the population into mutually e*clusive roups# or
strata# and randomly sample from each of these roups 5like men and woman6. 1ther
e*amples of criteria that we can use to divide the population into strata are ae#
income# or occupation. 4tratifed samplin is helpful when it is important that the fnal
sample has certain characteristics of the overall population.
Samplin" errors
,y relyin on a sample# we e*pose ourselves to errors that can lead to inaccurate
conclusions about the population. The type of error that a statistician is most concerned
about is called samplin! error, which occurs when the sample measurement is di-erent
from the population measurement. ,ecause the population is rarely measured in its
entirety# the samplin error cannot be directly calculated.
1ne way to reduce the samplin error of a statistical study is to increase the size of the
sample. &n eneral# the larer the sample size# the smaller the samplin error. &f you
increase the sample size until it reaches the size of
the population# then the samplin error will be reduced to zero. ,ut in doin so# you
forfeit the benefts of samplin.
% online surveys( the respondents are self%selected# which means the sample is not
randomly chosen. The results of these surveys are most likely biased because the
respondents would not be representative of the population at lare. "or e*ample# people
without &nternet access would not be part of the sample and miht respond di-erently
than people with access to the &nternet.
The samplin" distribution of t!e mean H the mean of each sample is the
measurement of interest.
% discrete uniform probability distribution because each event has the same probability
% a discrete uniform probability distribution is a distribution that assins the same
probability to each discrete event 5and is discrete if it is countable6.

Accordin to the central limit t!eorem# as the sample size# n# ets larer# the sample
means tend to follow
a normal probability distribution. This holds true reardless of the distribution of the
population from which the sample was drawn.
The standard deviation of the sample means is formally known as the standard error
of t!e mean.
4tudents often confuse L and L
x
. The symbol L# the standard deviation of the
population# measures the variation within the population. The symbol L
x
# the standard
error# measures the variation of the sample means and will decrease as the sample size
increases. The theoretical samplin distribution of the mean displays all the possible
sample means alon with their classical probabilities.
Samplin" Distribution of &roportion
;y measurement of interest is the proportion of teenaers in my sample of size n# who
will aree with the statement ;y parents are an e*cellent resource when &$m lookin
for advice on an important matter in my life. The sample proportion# ps# is calculated
by" p
s
= 5number of succeses in the sample6I nS
,ecause & don$t know the population proportion# p, who would aree with the statement#
& need to collect data from samples and appro*imate the population proportion. With
proportion data# & want the sample size to be lare enouh so & can use the normal
probability distribution to appro*imate the binomial distribution. &f np [H ? and nq >=
?# we can use the normal distribution to appro*imate the binomial 5q H 7 B p, the
probability of a failure6.
&$m hopeful that p will be at least ? percent 5at least a few teenaers miht listen to
their parents6# so if & choose n H 7?># then( np H 57?>65>.>?6 H =.? and nq H 57?>65>.E?6
H 7:8.?
4uppose & choose 7> samples# each of size 7?># and record the number of areements
5successes6 in each sample in the table that follows.
<e*t & averae the sample proportions to appro*imate the population proportion# p
s

mediuH sum of sample proportionI number of samples H >.7@:
<ow# we calculate the standard error of the proportion( L
p
H sqrt 5 p57%p6In 6 H >.>9>
<ow &$m ready to answer the ae%old question# What is the probability that from my
ne*t sample of 7?> teenaers# 8> percent or less will aree with the statement( ^;y
parents are an e*cellent resource when &$m lookin for advice on an important matter in
my life$A
,ecause our sample size allows us to use the normal appro*imation to the binomial
distribution# we now calculate the z%score for the proportion usin the followin
equation( zH 5p
s
%p6I L
p
H J7.8># and usin the standard z%table# PQp
s
YH>.8>RH >.DD:E
Con<dence Intervals
The simplest estimate of a population is the point estimate, the most common bein
the sample mean. A point estimate is a sinle value that best describes the population
of interest. The advantae of a point estimate is that it is easy to calculate and easy to
understand. The disadvantae# however# is that & have no clue as to how accurate this
estimate really is. To deal with this uncertainty# we can use an interval estimate# which
provides a rane of values that best describes the population.
A con<dence level is the probability that the interval estimate will include the
population parameter. A parameter is defned as a numerical description of a
population characteristic# such as the mean.
&n eneral# we can construct a con<dence interval around our sample mean usin the
followin equations(
As described earlier# a confdence interval is a rane of values used to estimate a
population parameter and is associated with a specifc confdence level. A confdence
interval needs to be described in the conte*t of several samples. &f we select 7> samples
from our home shoppin population and construct E> percent confdence intervals
around each of the sample means# then theoretically E of the 7> intervals will contain
the true population mean# which remains unknown.
&t is easy to misinterpret the defnition of a confdence interval. "or e*ample# it is not
correct to state that there is a E> percent probability that the true population mean is
within the interval 5_@=.9D# _DE.786. 0ather# a correct statement would be that there
is a E> percent probability that any iven confdence interval from a random
sample will contain the true population mean.
,ecause there is a E> percent probability that any iven confdence interval will contain
the true population mean in the previous e*ample# we have a 7> percent chance that it
won$t. This 7> percent value is known as the level of sinifcance# `# which is
represented by the total white area in both tails.
The probability for the confdence interval is a complement to the sinifcance level. "or
e*ample# the sinifcance level for a E? percent confdence interval is ? percent# the
sinifcance level for a EE percent
confdence interval is 7 percent# and so on. &n eneral# a 57 B `6 confdence interval has
a sinifcance level equal to `.
The level of sinifcance 5`6 is the probability of makin a Type & error.
<otice that in "iure 7:.7# ? percent of the
area under the curve lies to the riht of
J7.@: and E? percent of the area under the
curve lies to the left. That$s why
you see >.E:E? 5close enouh to >.E?6
correspondin to a z%score of 7.@: in Table 9
of Appendi* ,. 0emember# however# that z H
7.@: corresponds to a E> percent confdence
interval# the shaded reion in the fure.
T!e e=ect of c!an"in" con<dence levelsH for increasin the confdence levelGour
interval estimate of the true population mean becomes wider and less precise. &f we
want more certainty that our confdence interval will contain the true population mean#
that confdence interval will become wider.
There is one way# however# to reduce the width of our confdence interval while
maintainin the same confdence level. We can do this by increasin the sample size.
Determinin" Sample Si>e for )ean
We can also calculate a minimum sample size that would be needed to provide a specifc
marin of error.
2H zL
*5mean6
Calculatin" a Con<dence Interval ?!en @ is ,n-no%n# as lon as n [H 9># we can
substitute s# the sample standard deviation# for L# the population standard deviation#
and follow the same procedure as before.
C06FIDE6CE 4alp!a; standardAdev; si>e5
Con<dence Intervals for t!e )ean %it! Small Samples
With a small sample size# we lose the use of our faithful friend# the central limit
theorem# and we need to assume that the population is normally 5or appro*imately6
distributed for all cases. The frst case that we$ll e*amine is when we know L# the
population standard deviation.
% when L is known# the procedure reverts back to the lare sample size case. We can do
this because we are now assumin the population is normally distributed.
% when L is unknown# here# we make a similar ad+ustment that we made earlier and
substitute s, the sample standard deviation# for L; the population standard deviation.
'owever# because of the small sample size# this substitution forces us to use a new
probability distribution known as the 4tudent$s t%distribution.
The t%distribution is a continuous probability distribution with the followin properties(
% it is bell%shaped and symmetrical around the mean.
% the shape of the curve depends on the de!rees o# #reedom 5d.f.6 which# when dealin
with the sample mean# would be equal to n B 7.
% the area under the curve is equal to 7.>.
% the t%distribution is .atter than the normal distribution. As the number of derees of
freedom increase# the shape of the t%distribution becomes similar to the normal
distribution as seen in "iure 7:.@. With more than 9> derees of freedom 5a sample
size of 9> or more6# the two distributions are practically identical.
The derees of freedom are the number of values that are free to be varied iven
information# such as the sample mean# is known.
"or e*ample# if & know that my sample of size 9 has a mean of 7># & can only vary two
values 5n B 76. After & set those two values# & have no control over the third value
because my sample averae must be 7>. "or this sample# & have 8 derees of freedom.
We can now set up our confdence intervals for the mean usin a small sample(

tc H critical t%value 5can be found in Table : in Appendi* ,6
We can use the t%distribution when all of the followin conditions have been met(
% the population follows the normal 5or appro*imately normal6 distribution.
% the sample size is less than 9>.
% the population standard deviation# L# is unknown and must be appro*imated by s, the
sample standard deviation.
Con<dence Intervals for t!e &roportion %it! 'ar"e Samples
% we can also estimate the proportion of a population by constructin a confdence
interval from a sample.
% proportion data follow the binomial distribution that can be appro*imated by the
normal distribution under
the followin conditions( np[H? and nq[H?.
% suppose & want to estimate the proportion of home shoppin customers who are
female based on the results of a sample H[ we calculate ps
% the confdence interval around the sample proportion can be calculated by(
1ur challene is that we are tryin to estimate p, the population proportion# but we
need a value for p to set up the confdence interval. 1ur solutionGestimate the
standard error by usin the sample proportion as an appro*imation for the population
proportion.
psH >.@8E
L
p
H >.>9@?
We are now ready to construct a E> percent confdence interval around our sample
proportion 5zc H 7.@:6(
% upper limitH>.@DE
% lower limitH>.?@E
1ur E> percent confdence interval for the proportion of female home shoppin
customers is 5>.?@E# >.@DE6.
Determinin" Sample Si>e for t!e &roportion: n3 pq4 >
c
BE5
C
% therefore# to obtain a EE percent confdence interval that provides a marin of error
no more than @ percent would require a sample size of :?E home shoppers.
Introduction to $ypot!esis Testin"
1ne thin statisticians like to do is to make a statement about a population parameter#
collect a sample from that population# measure the sample# and declare# in a scholarly
manner# whether or not the sample supports
the oriinal statement. This# in a nutshell# is what hypothesis testin is all about.
&n the statistical world# a !ypot!esis is an assumption about a population parameter.
&n each case# we have made a statement about the population that may or may not be
true. The purpose
of hypothesis testin is to make a statistical conclusion about acceptin or not
acceptin such statements.
3et$s say that my hypothesis is that it will take an averae of si* days to capture a loose
snake in a house. &n other words# & would like to test my belief that the population
mean# K# is equal to si* days. & do this by atherin a sample of people who have had a
loose snake in their home and calculate the averae number of days required to capture
it. 4uppose the sample averae is @.7 days. The hypothesis test will then tell me
whether or not @.7 days is sinifcantly di-erent from @.> days or if the di-erence is
merely due to chance.
T!e 6ull and 2lternative $ypot!esis
2very hypothesis test has both a null hypothesis and an alternative hypothesis. The null
!ypot!esis, denoted by $># represents the status quo and involves statin the belief
that the mean of the population is YH#H# or [H a specifc value. The null hypothesis is
believed to be true unless there is overwhelmin evidence to the contrary. &n this
e*ample# my null hypothesis would be stated as( $> ( H@.> days
The alternative !ypot!esis, denoted by $7# represents the opposite of the null
hypothesis and holds true if the null hypothesis is found to be false. The alternative
hypothesis always states the mean of the population is Y# H or [ a specifc value. &n this
e*ample# my alternative hypothesis would be stated as( $7 ( Y[@.> days.
The followin table shows the three valid combinations of the null and alternative
hypothesis.
6ote that the alternative hypothesis is never associated with [H# H# or YH.
)ou need to be careful how you state the null and alternative hypothesis. )our choice
will depend on the nature of the test and the motivation of the person conductin it.
&f the purpose is to test that the population mean is equal to a specifc value# such as
our snake e*ample# assin this statement as the null hypothesis# which results in the
followin( '
>
( KH@.> days and ( '
7
( KY[@.> days.
1ften hypothesis testin is performed by researchers who want to prove that their
discovery is an improvement over current products or procedures. "or e*ample# if &
invented a olf ball that & claimed would increase your distance o- the tee by more than
8> yards# & would set up my hypothesis as follows( '
>
( KYH 8> yards and '
7
( K[ 8>
yards. <ote that & used t!e alternative !ypot!esis to represent t!e claim that &
want to prove statistically so that & can make a fortune sellin these balls to desperate
olfers such as myself. ,ecause of this# the alternative hypothesis is also known as the
researc! !ypot!esis because it represents the position that the researcher wants to
establish.
T%o# Tail $ypot!esis Test is used whenever the alternative !ypot!esis is
expressed as DE. 1ur snake e*ample would involve a two%tail test because the
alternative hypothesis is stated as $7 (KY[ @.>. This test is shown raphically in "iure
7?.7 which# as you can see# is considered a two%tail hypothesis test.
The curve in the fure represents the samplin distribution of the mean for the number
of days to catch a snake. The mean of the population# assumed to be @.> days accordin
to the null hypothesis# is the mean of the samplin distribution and is desinated by
K
'1
.
The procedure is as follows(
% collect a sample of size n# and calculate the test statistic# which in this case is the
sample mean.
% plot the sample mean on the *%a*is of the samplin distribution curve.
% if the sample mean falls within the white reion# we do not re+ect '>. That is# we do
not have enouh evidence to support '7# the alternative hypothesis# which states that
the population mean is not equal to @.> days.
% if the sample mean falls in either shaded reion# otherwise known as the re+ection
reion# we re+ect '>. That is# we have enouh evidence to support '7# which results in
our belief that the true population mean is not equal to @.> days.
,ecause there are two re+ection reions in this fure# we have a two%tail hypothesis
test.
,ecause our conclusions are based on a sample# we will never have enouh evidence to
accept the null hypothesis. &t$s a much safer statement to say that we do not have
enouh evidence to re+ect $>. We can use the analoy of the leal system to e*plain. &f
a +ury fnds a defendant not uilty# they are not sayin the defendant is
innocent. 0ather# they are sayin that there is not enouh evidence to prove uilt.
0ne#Tail $ypot!esis Test involves the alternative hypothesis bein stated as D or E.
;y olf ball e*ample results in a one%tail test because the alternative hypothesis is
bein e*pressed as $7 ( K[ 8>.
'ere# there is only one re+ection reion# which is the shaded area on the riht tail of the
distribution. We
follow the same procedure outlined for the two%tail test and plot the sample mean#
which represents the
averae increase in distance from the tee with my new olf ball. Two possible scenarios
e*ist.
% &f the sample mean falls within the white reion# we do not re+ect $>. That is# we do
not have enouh
evidence to support $7# the alternative hypothesis# which states that my olf ball
increased distance o- the tee
by more than 8> yards. There oes my fortune down the drainX
% &f the sample mean falls in the re+ection reion# we re+ect $>. That is# we have enouh
evidence to support $7# which confrms my claim that my new olf ball will increase
distance o- the tee by more than 8> yards.
Errors occurin" durin" sample# type I and II errors
0emember that the purpose of the hypothesis test is to verify the validity of a claim
about a population based on a sinle sample. ,ecause we are relyin on a sample# we
e*pose ourselves to the risk that our conclusions about the population will be wron.
Fsin the olf ball e*ample# suppose that my sample falls within the 0e+ect $> reion
of the last fure. That is# accordin to the sample# my olf ball increases distance o-
the tee by more than 8> yards. ,ut what if the true population mean is actually much
less than 8> yardsA This can occur primarily because of samplin error.
This type of error# when we re+ect $> when in reality it$s true# is known as a %&pe '
error. The probability of makin a Type & error is known as `# the level of sinifcance.
We also can e*perience another type of error with hypothesis testin. 3et$s say the olf
ball sample fell within the !o <ot 0e+ect '> reion of the last fure. That is#
accordin to the sample# my olf ball does not increase the distance o- the tee by more
than 8> yards. ,ut what if the true population mean is actually much more than
8> yardsA This type of error# when we do not re+ect '> when in reality it$s false# is
known as a Type && error. The probability of makin a Type && error is known as a.
<ormally# with hypothesis testin# we decide on a value for ` that is somewhere
between >.>7 and >.7> before we collect the sample.
Example of a T%o# Tail $ypot!esis Test
& stated the hypotheses for the snake e*ample as( $> ( K H @.> days and '
7
( K Y[ @.>
days.
Where K H the mean number of days to catch a loose snake in a home.
3et$s say that & know that the standard deviation of the population# L# is >.? days# and
my sample size to test the hypothesis# n, is 9> homes.
We$ll also set ` H >.>?# which means &$m willin to accept a ? percent chance of
committin a Type & error. 1ur frst step is to calculate the standard error of the mean#
L
* 5mean6
H >.>E79 days.
3et$s assume the sample mean from the 9> homes is @.7 days. What is our conclusion
about our estimate of the population mean# KA To answer this# we ne*t have to
determine the critical z%score# which corresponds to `H>.>?. ,ecause this is a two%tail
test# this area needs to be evenly divided between both tails# with each tail receivin
`I8H >.>8?. Accordin to "iure 7?.9# we need to fnd the critical z%score that
corresponds to the area >.E?> J >.>8? H >.E=?. As you can see# the >.E?> area is
derived from 7 B `.
Fsin Table 9 in Appendi* ,# we look for the closest value to >.E=?> in the body of the
table. We can fnd this value by lookin across column 7.E and down row >.>@ to arrive
at the z%score of J7.E@ for the riht tail and B7.E@ for the left tail.
,sin" t!e Scale of t!e 0ri"inal +ariable
<ow let$s determine the re+ection reion usin the scale of the oriinal variable# which
in this case is the number of days. To calculate the upper and lower limits of the
re+ection reion# we use the followin equations.
% we use the z%scores from the standard normal distribution when n [H 9> and L is
known.
% limits of re+ection reionH K
'o
J z
c
L
* 5mean6
% where K
$>
H the population mean assumed by the null hypothesis
% for our snake e*ample( upper limitH @.7D days# lower limitH ?.D8 days
% because our sample mean is @.7 days# this falls within the !o <ot 0e+ect $> reion.
1ur conclusion is that the di-erence between @.7 days and @.> days is merely due to
chance variation# and we have support that the population mean is @ days.
,sin" t!e Standardi>ed 6ormal Scale
We can arrive at the same conclusion by settin up the boundaries for the re+ection
reion usin the standardized normal scale. We do this by calculatin the z%score that
corresponds to the sample mean as follows(
zH 5* B K
'>
6I L
*
H J7.>E
,e sure to distinuish between the calculated z%score and the critical z%score. The
calculated z%score# z, represents the number of standard deviations between the sample
mean and K$> # the population mean accordin to the null hypothesis. The critical z%
score# zc# is based on the sinifcance level# `# and determines the boundary for the
re+ection reion. ,ecause the calculated z%score of J7.>E is within the !o <ot 0e+ect
$> reion# the conclusions of both techniques are consistent.
Example of a 0ne#Tail $ypot!esis Test
,ecause & formulated the alternative hypothesis for the olf ball e*ample as [ 8># this
becomes a one%tail test. The hypothesis for this e*ample is stated as( '>( KYH8> yards
and '7( K[8> yards# where KH the mean increase in yards o- the tee usin my new
olf ball.
3et$s say that & know that the standard deviation of the population# L# is ?.9 yards and
my sample size to test the hypothesis# n, is :> olfers. "or this e*ample# we$ll set ` H
>.>7. The standard error of the mean# L
x
# will now be equal to LI sqrt 5n6 H >.D9D yards.
3et$s assume the sample mean from the :> olfers is 88.? yards. What is our conclusion
about our estimate of the population mean# KA
1nce aain# we ne*t have to determine the critical z%score# which corresponds to ` H
>.>7. ,ecause this is a one%tail test# this entire area needs to be in one re+ection reion
on the riht side of the distribution. We need to fnd the zscore that corresponds to the
area >.EE or 7 B `. To calculate the limit for this re+ection reion usin the scale of the
oriinal variable# we use( 3imitH K
'o
J z
c
L
*
H 8>J 8.99b>.D9DH 87.E? yards
,ecause our sample mean is 88.? yards# this falls within the 0e+ect $> reion. 1ur
conclusion is that we have enouh evidence to support the hypothesis that the mean
increase in distance o- the tee with my new balls e*ceeds
8> yards.
2dvanced Inferential Statistics
% we can determine whether two cateorical variables are related 5c!i#square6#
compare three or more populations 5analysis of variance6# and describe the strenth
and direction of the relationship between two variables 5simple
re"ression6.
T!e C!i#Square &robability Distribution
% we can confrm whether a set of data follows a specifc probability distribution# such
as the binomial or Poisson.
% to determine whether two variables are statistically independent we discussed the
di-erent type of data measurement scales# which were nominal# ordinal# interval# and
ratio. 'ere is a brief refresher of each(
% nominal level of measurement deals strictly with qualitative data.
1bservations are simply assined to predetermined cateories. 1ne e*ample is ender
of the respondent with the cateories bein male and female.
% ordinal measurement is the ne*t level up. &t has all the properties of
nominal data with the added feature that we can rank order the values from hihest to
lowest. An e*ample would be rankin a movie as reat# ood# fair# or poor.
% interval level of measurement involves strictly quantitative data. 'ere we
can use the mathematical operations of addition and subtraction when comparin
values. "or this data# the di-erence between the di-erent cateories can be measured
with actual numbers and also provides meaninful information. Temperature
measurement in derees "ahrenheit is a common e*ample here.
% ratio level is the hihest measurement scale. <ow we can perform all four
mathematical operations to compare values. 2*amples of this type of data are ae#
weiht# heiht# and salary. 0atio data has all the features of interval data with the
added beneft of a true zero point# meanin that a zero data value indicates the
absence of the ob+ect bein measured.
% the c!i#square distribution in this chapter will allow us to perform hypothesis
testin on nominal and ordinal data.
The two ma+or techniques that we will learn about are usin the chi%square distribution
to perform a oodness%of%ft test and to test for the independence of two variables.
F5 "oodness#of#<t test- uses a sample to test whether a frequency distribution fts
the predicted distribution.

/an we conclude that the e*pected movie ratins are true based on the observed
ratins of :>> peopleA
% statin the <ull and Alternative 'ypothesis(
'
o
( the sample of observed frequencies supports the claim about the e*pected
frequencies.
'
7
( there is no support for the claim pertainin to the e*pected frequencies.
% the total number of e*pected frequencies526 must be equal to the total number of
observed frequencies516.
% for our movie e*ample# the observed #requencies are simply the number of
observations collected for each cateory of our sample. The expected #requencies are
the e*pected number of observations for each cateory and are calculated in the
followin table.
% observed frequencies are the number of actual observations noted for each cateory
of a frequency distribution with chi%squared analysis. 2*pected frequencies are the
number of observations that would be e*pected for each cateory of a frequency
distribution assumin the null hypothesis is true with chi%squared analysis.

% calculatin the /hi%4quare 4tatistic(
% determinin the /ritical /hi%4quare 4core# which depends on the number of derees
of freedom d.fH k%7# kH number of cateories in the frequency distribution. "or our
e*ample# kH?. The /ritical /hi%4quare 4core is read from the table. "or `H>.7> and
d.f.H:# it$s =.==E.
% the calculated chi%square score of E.E? is within the 0e+ect $> reion# which leads
us to the conclusion that the actual movie%ratin frequency distribution di-ers from the
e*pected distribution. We will always re+ect $> as lon as
c
8
c
YH c
8
.
% also# because the calculated chi%square score for the oodness%of%ft test can only be
positive# the hypothesis test will always be a one%tail with the re+ection reion on the
riht side.
# C$I6+ 4probability; de"#freedom5
% the chi%square distribution is not symmetrical but rather has a positive skew. The
shape of the distribution will chane with the number of derees of freedom. As the
number of derees of freedom increases# the shape of the chi%square distribution
becomes more symmetrical.
2 "oodness#of#<t test %it! t!e binomial distribution(
% suppose that a certain ma+or leaue baseball player claims the probability that he will
et a hit at any iven time is 9> percent. The followin table is a frequency distribution
of the number of hits per ame over the last 7>> ames. Assume he has come to bat
four times in each of the ames.
% in other words# in 8@ ames he had > hits# in 9: ames he had 7 hit# etc. Test the claim
that this distribution follows a binomial distribution with p H >.9> usin `H >.>?.
The hypothesis statement would look like the followin(
$>( The distribution of hits by the baseball player can be described with the binomial
prob distribution usin p H >.9>.
$7( The distribution di-ers from the binomial probability distribution usin p H >.9>.
1ur frst step is to calculate the frequency distribution for the e*pected number of hits
per ame. To do this# we need to look up the binomial probabilities in Table for n H :
5the number of trials per ame6 and p H >.9> 5the probability of a success.
% before continuin# we need to make one ad+ustment to the e*pected frequencies.
When usin the chi%square test# we need at least fve observations in each of the
e*pected frequency cateories. &f there are less than fve# we need to combine
cateories. &n the previous table# we will combine 9 and : hits per ame into one
cateory to meet this requirement.
Accordin to "iure 7D.:# the calculated chi%square score of 8.8> is within the !o <ot
0e+ect $> reion# which leads us to the conclusion that the baseball player$s hittin
distribution can be described with the binomial distribution usin p H >.9>.
C5 C!i#Square Test for Independence
This is known as a contin"ency table, which shows the observed frequencies of two
variables. &n this case# the variables are warm%up time and tennis player. The table is
oranized into r rows and c columns. "or our table# r H 8 and c H 9. An intersection of a
row and column is known as a cell. A continency table has r x c cells# which in our
case# would be @.
The chi%square test of independence will determine whether the proportion of times
that !ebbie wins is the same for all three warm%up periods. &f the outcome of the
hypothesis test is that the proportions are not the same# we conclude that the lenth of
warm%up does impact the performance of the players.
"irst we state the hypotheses as(
$>( Warm%up time is independent of performance
$7( Warm%up time a-ects performance
The chi-square test of independence only
investigates whether a relationship exists
between two variables. It does not
conclude
anything about the direction of the
relationship. In other words, from a
statistical perspective,
Debbie cannot claim that she is
disadvantaged by the short warm-up
time.
She can only claim that warm-up time
has some effect on her performance.
2nalysis of +ariance
! 1ne% Way Analysis of dariance
&f you want to compare the means for three or more populations# A<1dA is the test for
you. 3et$s say &$m interested in determinin whether there is a di-erence in consumer
satisfaction ratins between three fast%food chains. & would collect a sample of
satisfaction ratins from each chain and test to see whether there is a sinifcant
di-erence
between the sample means.
2ssentially# &$m testin to see whether the variations in customer ratins from the
previous table are due to the fast%food chains or whether the variations are purely
random. &n other words# do customers perceive any di-erences in satisfaction between
the three chainsA &f & re+ect the null hypothesis# however# my only conclusion is that a
di-erence does e*ist. Analysis of variance does not allow me to compare population
means to one another to determine which is reater. That task requires further
analysis.
To use one%way A<1dA# the followin conditions must be present(
% the populations of interest must be normally distributed.
% the samples must be independent of each other.
% each population must have the same variance.
A factor in A<1dA describes the cause of the variation in the data. &n the previous
e*ample# the factor would be the fast%food chain. This would be considered a one%way
A<1dA because we are considerin only one factor. When only one factor is bein
considered# the procedure is known as one%way A<1dA. A level in A<1dA describes the
number of cateories within the factor of interest
A level in A<1dA describes the number of cateories within the factor of interest. "or
our e*ample# we have three levels based on the three di-erent fast%food chains bein
e*amined.
86 /ompletely 0andomized A<1dA
The simplest type of A<1dA is known as completely randomized one%way A<1dA#
which involves an independent random selection of observations for each level of one
factor.
2*ample(
&$m interested in comparin the e-ectiveness of three lawn fertilizers. 4uppose & select
7D random patches of my precious lawn and apply either "ertilizer 7# 8# or 9 to each of
them. After a week# & mow the patches and weih the rass clippins.
The factor in this e*ample is fertilizer. There are three levels# representin the three
types of fertilizer we are testin. The table that follows indicates the weiht of the
clippins in pounds from each patch. The mean and variance of each level are also
shown.
We$ll refer to the data for each type of fertilizer as a sample. "rom the previous table#
we have three samples# each consistin of si* observations. The hypotheses statement
can be stated as(
'
o
( K
7
HK
8
HK
9
and '
7
( not all K$s are equal# where K7#8#9 are the true population means
for the pounds of rass clippins for each type of fertilizer.
96 Partitionin the 4um of 4quares
The hypothesis test for A<1dA compares two types of variations from the samples. We
frst need to reconize that the total variation in the data from our samples can be
divided# or as statisticians like to say# partitioned# into two parts. The frst part is the
variation within each sample# which is o-icially known as the sum of squares within
5(()6. This can be found usin the followin equation( # where k H the
number of samples 5or levels6. "or the fertilizer e*ample# k H 9 and(
A<1dA does not require that all the sample sizes are equal# as they are in the fertilizer
e*ample.
"inally# the total variation of all the observations is known as the total sum of squares
5((%6 and can be found by(
This equation may look nasty# but it is +ust the di-erence between each observation and
the rand mean squared and then totaled over all of the observations. This is clarifed
more in the followin table.
<ote that we can determine the variance of the oriinal 7D observations# s
8
# by(
This result can be confrmed by usin the variance equation by usin 2*cel.
:6 !eterminin the /alculated "%4tatistic
To test the hypothesis for A<1dA# we need to compare the calculated test statistic to a
critical test statistic usin the "%distribution. The calculated "%statistic can be found
usin the equation( "H ;4,I ;4W#
where ;4, is the mean square between# found by ;4,H 44,I 5k%76 and
;4W is the mean square whithin# found by ;4WH 44WI < B k
2*ample( ;4,H 7>.D@I 59%76H ?.:9
;4WH 7D.9?I 57D%96H 7.88
"H ?.:9I7.88H :.:?
&f the variation between the samples 5;4,6 is much reater than the variation within
the samples 5;4W6# we will tend to re+ect the null hypothesis and conclude that there is
a di-erence between population means. To complete our test for this hypothesis# we
need to introduce the "%distribution.
The mean square between 5*(+6 is a measure of variation between the sample means.
The mean square within 5*()6 is a measure of variation within each sample. A lare
*(+ variation# relative to the *() variation# indicates that the sample means are not
very close to one another. This condition will result in a lare value of ,, the calculated
"%statistic. The larer the value of ,, the more likely it will e*ceed the critical "%statistic
5to be determined shortly6# leadin us to conclude there is a di-erence between
population means.
?6 !eterminin the /ritical "%4tatistic
We use the "%distribution to determine the critical "%statistic# which is compared to the
calculated "%statistic for the A<1dA hypothesis test. The critical "%statistic# "
`# k%7# <%k
#
depends on two di-erent derees of freedom# which are determined by( v
7
H k%7 and v
8
H
<%k.
"or our fertilizer e*ample( v
7
H9%7H8 and v
8
H 7D%9H7?.
The critical "%statistic is read from the "%distribution table.
2ven thouh we have re+ected $> and concluded that the population means are not all
equal# A<1dA does not allow us to make comparisons between means. &n other words#
we do not have enouh evidence to conclude that "ertilizer 8 produces more rass
clippins than "ertilizer 7. This requires another test known as pairwise comparisons#
which we$ll address later in this chapter.
@6 Fsin 2*cel to perform 1ne%way A<1dA
7. 4tart by placin the fertilizer data in /olumns A# ,# and / in a blank sheet.
8. eo to the Tools menu and select !ata Analysis. 50efer to the section &nstallin the
!ata Analysis Add%in from /hapter 8 if you don$t see the !ata Analysis command on
the Tools menu.6
9. "rom the !ata Analysis dialo bo*# select Anova( 4inle "actor as shown in "iure
7E.8 and click 1U.
:. 4et up the Anova( 4inle "actor dialo bo* accordin to "iure 7E.9.
?. /lick 1U. "iure 7E.: shows the fnal A<1dA results.
<otice that the p-value H >.>9>? for this test# meanin we can re+ect $># because this
p-value f f. &f you remember# we had set f H >.>? when we stated the hypothesis test.

=6 Pairwise /omparisons
D6 /ompletely 0andomized ,lock A<1dA
1ne concern in this scenario is that the variations in the lawns will account for some of
the variation in the three fertilizers# which may interfere with our hypothesis test. We
can control for this possibility by usin a completely randomized block A<1dA# which is
used in the previous table. The type of fertilizer is still the factor# and the lawns are
called blocks.
There are two hypotheses for the completely randomized block A<1dA. The frst
5primary6 hypothesis tests the equality of the population means# +ust like we did earlier
with one%way A<1dA( '
o
( K
7
HK
8
HK
9
and '
7
( not all K$s are equal# where K7#8#9 are the
true population means for the pounds of rass clippins for each type of fertilizer.
The secondary hypothesis tests the e-ectiveness of the blockin! variable as follows( '
o
(
the block means are all equal# and '
7
( the block means are not all equal.
The blockin variable would be an e-ective contributor to our A<1dA model if we can
re+ect $
o
and claim that the block means are not equal to each other.
E6 Partitionin the 4um of the 4quares
"or the completely randomized block A<1dA# the sum of squares total is partitioned
into three parts accordin to the followin equation( 44TH 44WJ44,J44,3# where(
44WH sum of squares within# 44,H sum of squares between# 44,3H sum of squares
for the blockin variable 5lawns6.
"ortunately for us# the calculations for ((% and (() are identical to the one%way
A<1dA procedure that we$ve already discussed# so those values remain unchaned
5((% H 8E.87 and ((+ H 7>.D@6. We can fnd the sum of squares block 5((+-6 by usin
the equation(
7>6 !eterminin the /alculated "%4tatistic
4ince we have two hypothesis tests for the completely randomized block A<1dA# we
have two calculated "%statistics. The "%statistic to test the equality of the population
means 5the oriinal hypothesis6 is found usin( "H ;4,I;4W# where ;4, is the means
square between# found by( ;4,H 44,I 5k%76 and ;4WH 44WI 5k%765b%76.
2*ample( ;4,H 7>.D@I59%76H ?.:9
;4WH 7=.@9I 59%765@%76H 7.=@
"H ?.:9%7.=@H 9.>E
The second "%statistic will test the sinifcance of the blockin variable 5the second
hypothesis6 and will be denoted ,$. We will determine this statistic usin( "$H ;4,3I
;4W# where ;4,3 is the mean square blockin 5 the second hypothesis6 and will be
denoted "$. We will determine this statistic usin( "$H ;4,3I ;4W# where ;4,3 is the
mean square blockin 5 44,3I b%76.
2*ample( ;4,3 H >.=8I @%7H >.7:
"$H ;4,3I ;4WH >.7:I 7.=@H >.>D
76 "ist# we e*amine the primary hypothesis# '> that all population means are equal
usin `H>.>?. The derees of freedom for this critical "%statistic would be( v7H k%7H 9%
7H8 and v8H 5k%765b%76H 59%765@%76H 7>.
86 The critical "%statistic from tables is "
>.>>?# 8# 7>
H :.7>9.
96 4ince the calculated "%statistic equals 9.>E and is less this critical "%statistic# we fail
to re+ect '> and cannot conclude that the fertilizer means are di-erent.
:6 We e*amine the secondary hypothesis '>$# concernin the e-ectiveness of the
blockin variable# also usin `H>.>?. The derees of freedom for this critical "%statistic
would be( v7$ H? and v8$H 8b?H7>
The critical "%statistic from the table is "H 9.98@. 4ince the calculated " statistic "$
equals >.>D and it$s less than this critical "%statistic# we fail to re+ect '>$ and cannot
conclude that the block means are di-erent.
?6 What does this meanA 4ince we failed to re+ect $> $# the hypothesis that states the
blockin means are equal# the blockin variable 5lawns6 proved not to be e-ective and
should not be included in the model. &ncludin an ine-ective blockin variable in the
A<1dA increases the chance of a Type && error in the primary hypothesis# $>. The
conclusion of the primary hypothesis in this e*ample would be more precise without the
blockin variable. &n fact# this is what essentially happened when we included the
blockin variable with the randomized block desin. With the blockin variable present
in the model# we failed to discover a di-erence in the population means. <ow o
back to the beinnin of the chapter. When we tested the population means usin one%
way A<1dA 5without a blockin variable6# we concluded that the population means
were indeed di-erent.
&n summary 5&t$s about timeX6# if you feel there is a variable present in your model that
could contribute undesirable variation# such as takin samples from di-erent lawns#
use the randomized block A<1dA. "irst test $> $# the blockin hypothesis.
&f you re+ect $> $# the blockin procedure was e-ective. Proceed to test $># the primary
hypothesis concernin the population means# and draw your conclusions.
&f you fail to re+ect $> $# the blockin procedure was not e-ective. 0edo the analysis
usin one%way A<1dA 5without blockin6 and draw your conclusions.
4ummary(
Analysis of variance# also known as A<1dA# compares the means of three or more
populations.
A factor in A<1dA describes the cause of the variation in the data. When only one
factor is bein considered# the procedure is known as one%way A<1dA.
A level in A<1dA describes the number of cateories within the factor of interest.
The simplest type of A<1dA is known as completely randomized one%way A<1dA#
which involves an independent random selection of observations for each level of one
factor.
/ompletely randomized block A<1dA controls for variations from other sources than
the factors of interest. This is accomplished by roupin the samples usin a blockin
variable.
After re+ectin $> usin A<1dA# we can determine which of the sample means are
di-erent usin the 4che-g test.
Correlation and Simple Re"ression
% how two variables relate to one another
% determine whether a relationship does indeed e*ist between the variables
% describe the nature of this relationship in mathematical terms
Independent versus Dependent +ariables
2*ample( investiate the relationship between the number of hours that a student
studies for a statistics e*am and the rade for that e*am.

1bviously# we would e*pect the number of hours studyin to a-ect the rade. The
'ours 4tudied variable is considered the independent variable 5*6 because it causes the
observed variation in the 2*am erade# which is considered the dependent variable 5y6.
The data from the previous table is considered ordered pairs of 5*#y6 values# such as
59#D@6 and 5?#E?6.
This causal relationship between independent and dependent variables only e*ists in
one direction# as shown here(
&ndependent variable 5x6 %[ !ependent variable 5&6
This relationship does not work in reverse. "or instance# we would not e*pect that the
e*am rade variable would cause the student to study a certain number of hours in our
previous e*ample.
1ther e*amples of independent and dependent variables are shown in the followin
table.
Correlation
/orrelation measures both the strenth and direction of the relationship between x and
&. "iure 8>.7 illustrates the di-erent types of correlation in a series of scatter plots#
which raphs each ordered pair of 5x,&6. The convention is to place the x variable on the
horizontal a*is and the & variable on the vertical a*is.
eraph A in "iure 8>.7 shows an e*ample of positive linear correlation where# as x
increases# & also tends to increase in a linear 5straiht line6 fashion.
eraph , shows a neative linear correlation where# as x increases# & tends to decrease
linearly.
eraph / indicates no correlation between x and &. This set of variables appears to have
no impact on each other.
And fnally# eraph ! is an e*ample of a nonlinear relationship between variables. As x
increases# & decreases at frst and then chanes direction and increases.
Correlation Coe=icient
The correlation coe-icient# r# provides us with both the strenth and direction of the
relationship between the independent and dependent variables. dalues of r rane
between B7.> and J7.>.
When r is positive# the relationship between x and & is positive 5eraph A from "iure
8>.76# and when r is neative# the relationship is neative 5eraph ,6.
A correlation coe-icient close to > is evidence that there is no relationship between x
and & 5eraph /6.
The strenth of the relationship between x and & is measured by how close the
correlation coe-icient is to J7.> or B7.> and can be viewed in "iure 8>.8.
eraph A illustrates a perfect positive correlation between x and & with r H J7.>.
eraph , shows a perfect neative correlation between x and & with r H B7.>.
eraphs / and ! are e*amples of weaker relationships between the independent and
dependent variables.
We can calculate the actual correlation coe-icient usin the followin equation(
2*ample(
Fsin these values alon with n H @# the number of ordered pairs# we have(
Testin" t!e Si"ni<cance of t!e Correlation Coe<cient
We can perform a hypothesis test to determine whether the population correlation
coe-icient# p, is sinifcantly di-erent from > based on the value of the calculated
correlation coe-icient# r. We can state the hypotheses as(
'>( pYH> and '7( p[>
This statement tests whether a positive correlation e*ists between x and &. & could also
choose a two%tail test that would investiate whether any correlation e*ists 5either
positive or neative6 by settin $> ( p H > and $7 ( p Y[ >.
The test statistic for the correlation coe-icient uses the 4tudent$s t%distribution as
follows# where r H the calculated correlation coe-icient from the ordered pairs and n H
the number of ordered pairs.
"or the e*am rade e*ample# the calculated t%statistic becomes(

The critical t%statistic is based on d.#. H n B 8 if we choose ` H >.>?# tc H 8.798 from
Table : in Appendi* , for a one%tail test. ,ecause t [ tc# we re+ect $> and conclude that
there is indeed a positive correlation coe-icient between hours of study and the e*am
rade.
,sin" Excel to Calculate Correlation Coe=icients %% /100235array7# array86#
array7 H the rane of data for the frst variable and array8 H the rane of data for the
second variable.
Simple Re"ression
The technique of simple reression enables us to describe a straiht line that best fts a
series of ordered pairs 5*#y6. The equation for a straiht line# known as a linear
equation# takes the form( yh H a J b*# where yhH the predicted value of y# iven a value
of *# * H the independent variable# a H the y%intercept for the straiht line# b H the
slope of the straiht line.
The y%intercept is the point where the line crosses the y%a*is# which in this case is a H 8.
The slope of the line# b, is shown as the ratio of the rise of the line over the run of the
line# shown as b H >.?. A positive slope indicates the line is risin from left to riht. A
neative slope# you uessed it# moves lower from left to riht. &f b H ># the line is
horizontal# which means there is no relationship between the independent and
dependent variables. &n other words# a chane in the value of x has no e-ect on the
value of &.
"iure 8>.? shows si* ordered pairs and a line that appears to ft the data described by
the equation &h H 8J >.?x.
"iure 8>.? shows a data point that corresponds to the ordered pair x H 8 and & H :.
<otice that the predicted value of & accordin to the line at x H 8 is h& H 9. We can
verify this usin the equation as follows( &hH 8 J >.?x H 8J >.? 586H 9.
The value of & represents an actual data point# while the value of h& is the predicted
value of & usin the linear equation# iven a value for x. 1ur ne*t step is to fnd the
linear equation that best fts a set of ordered pairs.
T!e 'east Squares )et!od
The least squares method is a mathematical procedure to identify the linear equation
that best fts a set of ordered pairs by fndin values for a, the y%interceptS and b, the
slope. The oal of the least squares method is to minimize the total squared error
between the values of & and &h . &f we defne the error as & f &h for each data point# the
least squares method will minimize# where where n is the number of ordered pairs
around the line that best fts the data(
Accordin to "iure 8>.@# the line that
best fts the data# the re!ression line,
will minimize
the total squared error of the four data
points. &$ll demonstrate how to
determine
this reression equation usin the
least squares method throuh the
followin
e*ample.
,ecause my oal is to investiate whether the number of items is increasin over time#
;onth will be the independent variable and <umber of &tems will be the dependent
variable.
The least squares method fnds the linear equation that best fts the data by
determinin the value for a, the y%interceptS and b, the slope# usin the followin
equations(

The reression line for the bathroom
counter e*ample would be(
,ecause the slope of this equation is a positive >.E=@# & have evidence that the number
of items on the counter is increasin over time at an averae rate of nearly one per
month. "iure 8>.= shows the reression line with the ordered pairs. ;y prediction for
the number of items on the counter in another si* months 5;onth 7@ from my data6 will
be(
&h H ?.79J >.E=@x H ?.79J >.E=@ 57@6H 8>.= H 87 items
Con<dence Interval for t!e Re"ression 'ine
Cust how accurate is my estimate for the number of items on the counter for a
particular monthA To answer this# we need to determine the standard error of the
estimate# se # usin the followin formula(
The standard error of the estimate measures the amount of dispersion of the observed
data around the reression line. &f the data points are very close to the line# the
standard error of the estimate is relatively low and vice versa. "or our bathroom
e*ample(
We are now ready to calculate a confdence interval for the mean of & around a
particular value of x. "or ;onth D 5x H D6 in the data# !ebbie has 77 items 5& H 776 on
the counter. The reression line predicted she would have(
&h H ?.79J >.E=@x H ?.79J >.E=@ 5D6H 78.E items.
where(
tc H the critical t%statistic from the 4tudents$
t%distribution
se H the standard error of the mean
n H the number of ordered pairs
4uppose we would like a E? percent confdence interval around the mean of & for
;onth D. To fnd our critical t%statistic# we look to Table : in Appendi* ,. This
procedure has n B 8 H 7> B 8 H D derees of freedom# resultin in tc H 8.9>@ from Table
: in Appendi* ,. 1ur confdence interval is then(
This interval is shown raphically on "iure 8>.D.
1ur E? percent confdence interval for the number of items on the counter in ; D is
between 7>.=: and 7?.>@ items.
Testin" t!e Slope of t!e Re"ression 'ine
0ecall that if the slope of the reression line# b, is equal to ># then there is no
relationship between x and &. &n our bathroom counter e*ample# we found the slope of
the reression line to be >.E=@. 'owever# because this result was based on a sample of
observations# we need to test to see whether >.E=@ is far enouh away from > to claim a
relationship really does e*ist between the variables. &f a is the slope of the true
population# then our hypotheses statement would be( '>H aH> and '7H a Y[>.
&f we re+ect the null hypothesis# we conclude that a relationship does e*ist between the
independent and dependent variables based on our sample. We$ll test this usin ` H
>.>7.
This hypothesis test requires the standard error of the slope# sb# which is found with
the followin equation(
where se is the standard error of the estimate that we calculated earlier.
"or our bathroom e*ample(
The test statistic for this hypothesis is# where a$> is the value of the population slope
accordin to the null hypothesis. .
"or this e*ample# our calculated t%statistic is(
The critical t%statistic is taken from the 4tudent$s t%distribution with n B 8 H 7> B 8 H D
derees of freedom. With a two%tail test and f H >.>7# tcH 9.9?? accordin to Table : in
Appendi* ,. ,ecause t [ tc# we re+ect the null hypothesis and conclude there is a
relationship between the month and the number of items on the bathroom countertop.
T!e Coe=icient of Determination
Another way of measurin the strenth of a relationship is with the coe.icient o#
determination, r8..This represents the percentae of the variation in & that is e*plained
by the reression line. We fnd this value by simply squarin r, the correlation
coe-icient. "or the bathroom e*ample# the correlation coe-icient is(
The coe-icient of determination becomes(
&n other words# @@.9 percent of the variation in the number of items on the counter is
e*plained by the ;onth variable. &f r8 H 7# all of the variation in & is e*plained by the
variable x. &f r8 H ># none of the variation in & is e*plained by the variable x.
2 simple Re"ression Example %it! 6e"ative Correlation
,oth of these past e*amples have involved a positive relationship between x and &. <ow
this e*ample will summarize performin simple reression with a neative relationship.
0ecently# & had the opportunity to bond with my son ,rian as we shopped for his frst
car when he turned 7@. ,rian had visions of ;ercedes and ,;Ws dancin in his
head# whereas & was thinkin more alon the line of 'ondas and Toyotas. After many
discussions on the matter# we compromised on lookin for 7EEE dolkswaen Cettas.
'owever# ,rian had two requirements(
f &t had to be black.
f &t had to be the new body style.
Apparently# somebody at dolkswaen had the brilliant idea back in 7EEE to subtly
chane the desin of the Cetta halfway throuh the production year. Personally# & would
never have noticed the di-erence. ,rian# on the other hand# wouldn$t be cauht dead
drivin the oriinal version# essentially eliminatin half the used 7EEE dolkswaen
Cettas on the market.
Anyway# what follows is a table showin the mileae of eiht cars with the new body
style and their askin price. The remainder of this chapter demonstrates the correlation
and reression technique usin this data.
The correlation coe-icient can be found usin(
The neative correlation indicates that as mileae 5*6 increases# the price 5y6 decreases
as we would e*pect. The coe-icient of determination becomes(
Appro*imately ?= percent of the variation in price is e*plained by the variation in
mileae. The reression line is determined usin(
We can describe the reression line by the equation(
What would the predicted price be for a car with :?#>>> milesA
The reression line would predict that a car with :?#>>> miles would be priced at
_79#>:7. What would be the E> percent confdence interval at * H :?#>>>A The
standard error of the estimate would be(
The critical t%statistic for n B 8 H D B 8 H @ derees of freedom and a E> percent
confdence interval is tc H 7.E:9 from Table : in Appendi* ,. 1ur confdence interval is
then(
The E> percent confdence interval for a car with :?#>>> miles is between _77#?DE and
_7:#:E9. &s the relationship between mileae and price statistically sinifcant at the f
H >.7> levelA 1ur hypotheses$ statement is( 'o( aH> and '7( aY[>. The standard error
of the slope# sb# is found usin(
The calculated test statistic for this hypothesis is(
The critical t%statistic is taken from the 4tudent$s t%distribution with n B 8 H D B 8 H @
derees of freedom. With a two%tail test and f H >.7> level# tc H 7.E:9 accordin to Table
: in Appendi* ,. ,ecause t tc # we re+ect the null hypothesis and conclude there is a
relationship between the mileae and price variable. We use the absolute values
because the calculated t%statistic is in the left tail of the t%distribution with a two%tail
hypothesis test.
2ssumptions for Simple Re"ression
"or all these results to be valid# we need to make sure that the underlyin assumptions
of simple reression are not violated. These assumptions are as follows(
f &ndividual di-erences between the data and the reression line# 5y% yi6# are
independent of one another.
f The observed values of y are normally distributed around the predicted value# hy.
f The variation of y around the reression line is equal for all values of *.
Simple +ersus )ultiple Re"ression
4imple reression is limited to e*aminin the relationship between a dependent
variable and only one independent variable. &f more than one independent variable is
involved in the relationship# then we need to raduate to multiple reression. The
reression equation for this method looks like this(
f The independent variable 5x6 causes variation in the dependent variable 5 &6.
f The correlation coe-icient# r, indicates both the strenth and direction of the
relationship between the independent and dependent variables.
f The technique of simple reression enables us to describe a straiht line that best fts
a series of ordered pairs 5x,&6.
f The least squares method is a mathematical procedure to identify the linear equation
that best fts a set of ordered pairs by fndin values for a, the y%interceptS and b, the
slope.
f The standard error of the estimate# se# measures the amount of dispersion of the
observed data around the reression line.
f The coe-icient of determination# r 8# represents the percentae of the variation in &
that is e*plained by the reression line.

Potrebbero piacerti anche