The term inference refers to a key concept in statistics in which we draw a conclusion
from available evidence.
The purpose of descriptive statistics is to summarize or display data so we can quickly obtain an overview. Inferential statistics allows us to make claims or conclusions about a population based on a sample of data from that population. A population represents all possible outcomes or measurements of interest. A sample is a subset of a population. We use the term population in statistics to represent all possible measurements or outcomes that are of interest to us in a particular study. The term sample refers to a portion of the population that is representative of the population from which it was selected. Data is simply defned as the value assined to a specifc observation or measurement. !ata that is used to describe somethin of interest about a population is called a parameter. "or instance# let$s say that the population of interest is my wife$s three%year%old preschool class and my measurement of interest is how many times the little urchins use the bathroom in a day. &f we averae the number of trips per child# this fure would be considered a parameter because the entire population was measured. 'owever# if we want to make a statement about the averae number of bathroom trips per day per three%year%old in the country# then !ebbie$s class could be our sample. We can consider the averae that we observe from her class a statistic if we assume it could be used to estimate all three year%olds in the country. !ata that describes a characteristic about a population is known as a parameter. !ata that describes a characteristic about a sample is known as a statistic. Information is data that is transformed into useful facts that can be used for a specifc purpose# such as makin a decision. We classify the sources of data into two broad cateories( primary and secondary. )ou can obtain primary data in many ways# such as direct observation# surveys# and e*periments. Direct observation( "ocus roups are a direct observational technique where the sub+ects are aware that data is bein collected. ,usinesses use focus roups to ather information in a roup settin controlled by a moderator. The sub+ects are usually paid for their time and are asked to comment on specifc topics. Experiments: This method is more direct than observation because the sub+ects will participate in an e*periment desined to determine the e-ectiveness of a treatment. An e*ample of a treatment could be the use of a new medical dru. Two roups would be established. The frst is the e*perimental roup who receive the new dru# and the second is the control roup who think they are ettin the new dru but are in fact ettin no medication. The reactions from each roup are measured and compared to determine whether the new dru was e-ective. The beneft of e*periments is that they allow the statistician to control factors that could in.uence the results# such as ender# ae# and education of the participants. The concern about collectin data throuh e*periments is that the response of the sub+ects miht be in.uenced by the fact that they are participatin in a study. The desin of e*periments for a statistical study is a very comple* topic and oes beyond the scope of this book. Surveys( This technique of data collection involves directly askin the sub+ect a series of questions. The questionnaire needs to be carefully desined to avoid any bias or confusion for those participatin. /oncerns also e*ist about the in.uence the survey will have on the participant$s responses. 0esearch has shown that the manner in which the questions are asked can a-ect the responses a person provides on a questionnaire. A question posed in a positive tone will tend to invoke a more positive response and vice versa. A ood stratey is to test your questionnaire with a small roup of people before releasin it to the eneral public. Another way to classify data is by one of two types( quantitative or qualitative. Types of measurement scales: A nominal level of measurement deals strictly with qualitative data. 1bservations are simply assined to predetermined cateories. 1ne e*ample is ender of the respondent# with the cateories bein male and female. This data type does not allow us to perform any mathematical operations# such as addin or multiplyin. We also cannot rankorder this list in any way from hihest to lowest. This type is considered the lowest level of data and# as a result# is the most restrictive when choosin a statistical technique to use for the analysis. )ou can use numbers at the nominal level of measurement. 2ven in this case# the rules of the nominal scale still remain. An e*ample would be zip codes or telephone numbers# which can$t be added or placed in a meaninful order of reater than or less than. 2ven thouh the data appears to be numbers# it$s handled +ust like qualitative data. 1n the food chain of data# ordinal is the ne*t level up. &t has all the properties of nominal data with the added feature that we can rank%order the values from hihest to lowest. An e*ample is if you were to have a lawnmower race. 3et$s say the fnishin order was 4cott# Tom# and ,ob. We still can$t perform mathematical operations on this data# but we can say that 4cott$s lawnmower was faster than ,ob$s. 'owever# we cannot say how much faster. 1rdinal data does not allow us to make measurements between the cateories and to say# for instance# that 4cott$s lawnmower is twice as ood as ,ob$s 5it$s not6. 1rdinal data can be either qualitative or quantitative. An e*ample of quantitative data is ratin movies with 7# 8# 9# or : stars. 'owever# we still may not claim that a :%star movie is : times as ood as a 7%star movie. ;ovin up the scale of data# we fnd ourselves at the interval level# which is strictly quantitative data. <ow we can et to work with the mathematical operations of addition and subtraction when comparin values. "or this data# we can measure the di-erence between the di-erent cateories with actual numbers and also provide meaninful information. Temperature measurement in derees "ahrenheit is a common e*ample here. "or instance# => derees is ? derees warmer than @? derees. 'owever# multiplication and division can$t be performed on this data. Why notA 4imply because we cannot arue that 7>> derees is twice as warm as ?> derees. The kin of data types is the ratio level. <ow we can perform all four mathematical operations to compare values with absolutely no feelins of uilt. 2*amples of this type of data are ae# weiht# heiht# and salary. 0atio data has all the features of interval data with the added beneft of a true > point. The term true zero point means that a > data value indicates the absence of the ob+ect bein measured. "or instance# > salary indicates the absence of any salary. The distinction between interval and ratio data is a fne line. To help identify the proper scale# use the twice as much rule. &f the phrase twice as much accurately describes the relationship between two values that di-er by a multiple of 8# then the data can be considered ratio level. &nterval data does not have a true > point. "or e*ample# > derees "ahrenheit does not represent the absence of temperature# even thouh it may feel like it. Frequency distributions is simply a table that oranizes the number of data values into intervals. The intervals in a frequency distribution are o-icially known as classes# and the number of observations in each class is known as class frequencies. /onstructin a frequency distribution( % from classes of equal size. % make classes mutually e*clusive# or in other words# prevent classes from overlappin. % try to have no fewer than ? classes and no more than 7? classes % avoid open%ended classes# if possible 5for instance# a hihest class of 7?Bover6. % include all data values from the oriinal table in a class. &n other words# the classes should be e*haustive. Relative Frequency Distribution 0ather than display the number of observations in each class# this method calculates the percentae of observations in each class by dividin the frequency of each class by the total number of observations. Cumulative Frequency Distribution /umulative frequency distributions indicate the percentae of observations that are less than or equal to the current class. &t totals the percentaes of each class as you move down the column. Cohn used his phone D times or less on D: percent of the days in the month. rap!in" a Frequency Distribution# t!e $isto"ram A historam is simply a bar raph showin the number of observations in each class as the heiht of each bar. % the frst thin we need to do is open 2*cel to a blank sheet and enter our data in /olumn A startin in /ell A7. % ne*t enter the upper limits to each class in /olumn , startin in /ell ,7. % o to the Tools menu at the top of the 2*cel window and select !ata Analysis. % The /hart Wizard allows me more control over the fnal appearance. Statistical Flo%er &o%er# t!e Stem and 'eaf Display The ma+or beneft of this approach is that all the oriinal data points are visible on the display.
The stem in the display is the frst column of numbers# which represents the frst diit of the olf scores. The leaf in the display is the second diit of the olf scores# with 7 diit for each score. ,ecause there were ? scores in the =>s# there are ? diits to the riht of =. 'ere# the stem labeled = 5?6 stores all the scores between =? and =E. The stem D 5>6 stores all the scores between D> and D:. C!artin" a Frequency Distribution (ar C!arts ,ar charts are a useful raphical tool when you are plottin individual data values ne*t to each other. The historam that we visited earlier in the chapter is actually a special type of bar chart that plots frequencies rather than actual data values. 'ow do & choose between a pie chart and a bar chartA &f your ob+ective is to compare the relative size of each class to one another# use a pie chart. ,ar charts are more useful when you want to hihliht the actual data values. 'ine C!arts is used to help identify patterns between two sets of data. 3ine charts prove very useful when you are interested in e*plorin patterns between two di-erent types of data. They are also helpful when you have many data points and want to show all of them on one raph. ,ecause the line connectin the data points seems to have an overall upward trend# my suspicions hold true. &t seems the more showers our waterloed darlins take# the hiher the utility bill. )easures of Central Tendency There e*ist two broad cateories of descriptive statistics that are commonly used. The frst# measures of central tendency# describes the center point of our data set with a sinle value. &t$s a valuable tool to help us summarize many pieces of data with one number. The second cateory# measures of dispersion describe how far individual data values have strayed from the mean. The mean or avera"e is the most common measure of central tendency and is calculated by addin all the values in our data set and then dividin this result by the number of observations. A %ei"!ted mean allows you to assin more weiht to certain values and less weiht to others. )ean of rouped Data from a Frequency Distribution * e*ample(
The mean of a frequency distribution where data is rouped into classes is only an appro*imation to the mean of the oriinal data set from which it was derived. This is true because we make the assumption that the oriinal data values are at the midpoint of each class# which is not necessarily the case. The true mean of the 9> oriinal data values in the cell phone e*ample is only :.? calls per day rather than :.@. The median is the value in the data set for which half the observations are hiher and half the observations are lower. We fnd the median by arranin the data values in ascendin order and identifyin the halfway point. When there is an even number of data points# the median will be the averae of the two center points. Fsin our e*ample with the video ames# we rearrane our data set in ascendin order( 9 : : : ? @ = = E 7= Accordin to the mean of this frequency distribution# Cohn averaes :.@ calls per day on his cell phone. ,ecause we have an even number of data points 57>6# the median is the averae of the two center points. &n this case# that will be the values ? and @# resultin in a median of ?.? hours of video ames per week. <otice that there are four data values to the left 59# :# :# and :6 of these center points and four data values to the riht 5=# =# E# and 7=6. The mode is simply the observation in the data set that occurs the most frequently. &f you think all the data in your data set is relevant# then the mean is your best choice. This measurement is a-ected by both the number and manitude of your values. 'owever# very small or very lare values can have a sinifcant impact on the mean# especially if the size of the sample is small. &f this is a concern# perhaps you should consider usin the median. The median is not as sensitive to a very lare or small value. /onsider the followin data set from the oriinal video ame e*ample(9 : : : ? @ = = E 7= The number 7= is rather lare when compared to the rest of the data. The mean of this sample was @.@# whereas the median was ?.?. &f you think 7= is not a typical value that you would e*pect in this data set# the median would be your best choice for central tendency. The poor lonely mode has limited applications. &t is primarily used to describe data at the nominal scaleGthat is# data that is rouped in descriptive cateories such as ender. &f @> percent of our survey respondents were male# then the mode of our data would be male. "rom !ata Analysis% !escriptive 4tatistics( mean# median# mode. )easures of Dispersion Ran"e is the simplest measure of dispersion and is calculated by fndin the di-erence between the hihest value and the lowest value in the data set. = E D 77 : % rane H 77 B : H = 'owever# the limitation is that it only relies on two data points to describe the variation in the sample. <o other values between the hihest and lowest points are part of the rane calculation. +ariance summarizes the squared deviation of each data value from the mean. The variance is a measure of dispersion that describes the relative distance between the data points in the set and the mean of the data set. This measure is widely used in inferential statistics.
The frst step in calculatin the variance is to determine the mean of the data set. The rest of the calculations can be facilitated by the followin table. The fnal sample variance calculation becomes this( s8H 8@#DI ?%7. ,sin" t!e Ra% Score )et!od is a more e-icient way to calculate the variance of a data set. s8H 5the sum of each data value after it has been squared% the square of the sum of all the data values6I n%7 T!e +ariance of a &opulation Standard deviation is simply the square root of the variance. Cust as with the variance# there is a standard deviation for both the sample and population. To calculate the standard deviation# you must frst calculate the variance and then take the square root of the result. The standard deviation is actually a more useful measure than the variance because the standard deviation is in the units of the oriinal data set. Calculatin" t!e Standard Deviation of rouped Data T!e Empirical Rule: %or-in" %it! Standard Deviation The values of many lare data sets tend to cluster around the mean or median so that the data distribution in the historam resembles a bell%shape# symmetrical curve. When this is the case# t!e empirical rule tells us that appro*imately @D percent of the data values will be within one standard deviation from the mean. "or e*ample# suppose that the averae e*am score for my lare statistics class is DD points and the standard deviation is :.> points and that the distribution of rades is bell% shape around the mean. ,ecause one standard deviation above the mean would be E8 5DD J :6 and one standard deviation below the mean would be D: 5DD B :6# the empirical rule tells me that appro*imately @D percent of the e*am scores will fall between D: and E8 points. Accordin to the empirical rule# if a distribution follows a bellshapeGa symmetrical curve centered around the meanGwe would e*pect appro*imately @D# E?# and EE.= percent of the values to fall within one# two# and three standard deviations around the mean respectively. &n eneral# we can use the followin equation to e*press the rane of values within k standard deviations around the mean( KJI% k L. C!ebys!ev.s T!eorem /hebyshev$s theorem is a mathematical rule similar to the empirical rule e*cept that it applies to any distribution rather than +ust bell%shape# symmetrical distributions. /hebyshev$s theorem states that for any number k reater than 7# at least 57 B 7Ik 8 6* 7>> percent of the values will fall within k standard deviations from the mean. Fsin this equation# we can state the followin( % at least =? percent of the data values will fall within two standard deviations from the mean by settin k H 8 into /hebyshev$s equation. % at least DD.E percent of the data values will fall within three standard deviations from the mean by settin k H 9 into the equation. % at least E9.= percent of the data values will fall within four standard deviations from the mean by settin k H : into the equation. 2*ample( This table supports /hebyshev$s theorem# which predicts that at least =? percent of the values will fall within two standard deviations from the mean. "rom the data set# we can observe that E? percent actually fall between 8>.9 and :E.7 home runs 59D out of :>6. The same e*planation holds true for three and four standard deviations around the mean. )easures of Relative &osibtion describe the percentae of the data below a certain point. /uartiles divide the data set into four equal sements after it has been arraned in ascendin order. Appro*imately 8? percent of the data points will fall below the frst quartile# M7. Appro*imately ?> percent of the data points will fall below the second quartile# M8. And# you uessed it# =? percent should fall below the third quartile# M9. 76 4tep 7( Arrane your data in ascendin order. 86 4tep 8( "ind the median of the data set. This is M8. 96 4tep 9( "ind the median of the lower half of the data set 5in parenthesis6. This is M7. :6 4tep :( "ind the median of the upper half of the data set 5in parenthesis6. This is M9. Interquartile ran"e % the &M0 measures the spread of the center half of our data set. &t is simply the di-erence between the third and frst quartiles# as follows( &M0 H M9 B M7. The interquartile rane is used to identify outliers# which are the black sheep of our data set. These are e*treme values whose accuracy is questioned and can cause unwanted distortions in statistical results. Any values that are more than( M9 J 7.?&M0 or less than( M7 B 7.?&M0 should be discarded. 2*ample( 7> :8 :? :@ ?7 ?8 ?D =9 4ince there are eiht data values# M7 will be the median of the frst four values 5the midpoint between the second and third values6. M7H 5:8J:?6I8H :9.? 3ikewise# M9 will be the median of the last four values 5the midpoint between the si*th and seventh values6. M8H 5?8J?D6I8H ?@. &0M H M9% M7H ?@% :9.?H 78.? Any values reater than M9 J 7.? &0MH =:.=? or less than M7% 7.? &0MH 8:.=? should be considered an outliner# therefore the value 7> would be an outliner in this data set. The values for variance and standard deviation reported by 2*cel are for a sample. &f your data set represents a population# you need to recalculate the results usin N in the denominator rather than n B 7. &robability topics Experiment. The process of measurin or observin an activity for the purpose of collectin data. An e*ample is rollin a pair of dice. 0utcome. A particular result of an e*periment. An e*ample is rollin a pair of threes with the dice. Sample space. All the possible outcomes of the e*periment. The sample space for our e*periment is the numbers N8# 9# :# ?# @# =# D# E# 7># 77# and 78O. 4tatistics people like to put NO around the sample space values Event. 1ne or more outcomes that are of interest for the e*periment and which isIare a subset of the sample space. An e*ample is rollin a total of 8# 9# :# or ? with two dice. Classical &robability refers to a situation when we know the number of possible outcomes of the event of interest and can calculate the probability of that event with the followin equation( PQARH <umber of possible outcomes in which 2vent A occursI Total number of possible outcomes in the sample space. Empirical &robability % when we don$t know enouh about the underlyin process to determine the number of outcomes associated with an event. This type of probability observes the number of occurrences of an event throuh an e*periment and calculates the probability from a relative frequency distribution. PQARH "requency in which 2vent A occursI Total number of observations. 1ne e*ample of empirical probability is to answer the ae%old question What is the probability that Cohn will et out of bed in the mornin for school after his frst wake%up callA ,ased on these observations# if 2vent A H Cohn ettin out of bed on the frst wake%up call# then PQAR H >.7? Fsin the previous table# we can also e*amine the probability of other events. 3et$s say 2vent , H Cohn requirin more than 8 wake%up calls to et out of bedS then PQ,R H>.:> J >.8? H >.@?. &f & choose to run another 8>%day e*periment of Cohn$s wakin behavior# & would most likely see di-erent results than those in the previous table. 'owever# if & were to observe 7>> days of this data# the relative frequencies would approach the true or classical probabilities of the underlyin process. This pattern is known as the law of lare numbers. The law of lare numbers states that when an e*periment is conducted a lare number of times# the empirical probabilities of the process will convere to the classical probabilities. Sub1ective probability We use sub+ective probability when classical and empirical probabilities are not available. Fnder these circumstances# we rely on e*perience and intuition to estimate the probabilities. (asic &roperties of &robability * one event &f PQAR H 7# then 2vent A must occur with certainty. &f PQAR H ># then 2vent A will not occur with certainty. The probability of 2vent A must be between > and 7. The sum of all the probabilities for the events in the sample space must be equal to 7. The complement to 2vent A is defned as all the outcomes in the sample space that are not part of 2vent A and is denoted as A$. Fsin this defnition# we can state the followin( PQAR J PQA$R H 7 or PQAR H 7 B PQA$R. T!e Intersection of Events 2*ample( <ow that my children are older and livin away from home# & cherish those moments when the phone rins and & see one of their numbers appear on my caller &!. 2*perience has tauht me that & can cateorize these calls as either crisis# involvin such thins as a computer# a car# an AT; card# or a cell phoneS or noncrisis# when they call +ust to see if &$m alive and well enouh to help with their ne*t crisis. The followin table# called a continency table# cateorizes the last ?> phone calls by child and type of call. /ontinency tables show the actual or relative frequency of two types of data at the same time. &n this case# the data types are child and type of call. 2vent A H the ne*t phone call will come from /hristin. 2vent , H the ne*t phone call will involve a crisis. PQARH 8>I?>H >.: What about the probability that the ne*t phone call will come from /hristin and will involve a crisisA This event is known as the intersection of 2vents A and , and is described by AT,. The number of phone calls from our continency table that meet both criteria is 7:# so( PQA and ,R H PQAR T PQ,RH 7:I?>H >.8D A continency table indicates the number of observations that are classifed accordin to two variables. The intersection of 2vents A and , represents the number of instances where 2vents A and , occur at the same time 5that is# the same phone call is both from /hristin and a crisis6. The probability of the intersection of two events is known as a 1oint probability. T!e union of Events A and , represents the number of instances where either 2vent A or , occur 5that is# the number of calls that were either from /hristin or were a crisis6. PQA and ,R H PQAR F PQ,RH 9:I?>H >.@D /lassical probability requires knowlede of the underlyin process in order to count the number of possible outcomes of the event of interest. 2mpirical probability relies on historical data from a frequency distribution to calculate the likelihood that an event will occur. The law of lare numbers states that when an e*periment is conducted a lare number of times# the empirical probabilities of the process will convere to the classical probabilities. The intersection of 2vents A and , represents the number of instances where 2vents A and , occur at the same time. The union of 2vents A and , represents the number of instances where either 2vent A or , occur. Conditional &robability We defne conditional probability as the probability of 2vent A knowin that 2vent , has already occurred. 2*ample( the followin table shows the outcomes of our last 8> matches# alon with the type of warm%up before we started keepin score. Without any additional information# the simple probability of each of these events is as follows( PQARHEI8>H>.:? PQ,RH79I8>H>.@?# PQA$RH77I8>H>.??# PQ,$RH=I8>H>.9? 4imple or prior probabilities are always based on the total number of observations. &n the previous e*ample# it is 8> matches. Unowin this piece of info# what is the probability that !ebbie will win the matchA This is the conditional probability of 2vent A iven that 2vent , has occurred. 3ookin at the previous table# we can see that 2vent , has occurred 79 times. ,ecause !ebbie has won : of those matches 5A6# the probability of A iven , is calculated as follows( PQAI,RH:I79H>.97 We can also calculate the probability that !ebbie will win( PQAI,$RH?I=H>.=7 /onditional probabilities are also known as posterior probabilities. /onditional probabilities are very useful for determinin the probabilities of compound events as you will see in the followin sections. Independent versus Dependent Events 2vents A and , are said to be independent of each other if the occurrence of 2vent , has no e-ect on the probability of 2vent A. Fsin conditional probability# 2vents A and , are independent of one another if( PQAI,R H PQAR &f 2vents A and , are not independent of one another# then they are said to be dependent events. &n the tennis e*ample# 2vents A and , are dependent because the probability of !ebbie winnin depends on whether the warm%up is more or less than 7> minutes. We can also demonstrate this by observin that( PQARHEI8>H>.:? and PQAI,RH:I79H>.97 These probabilities tell us that overall# !ebbie wins :? percent of the matches. 'owever# when there is a short warm%up# she only wins 97 percent of the time. ,ecause these probabilities are not equal# 2vents A and , are dependent. )ultiplication Rule of &robabilities % to calculate the +oint probability of two events. &n other words# we are calculatin the probability of these events occurrin at the same time. "or two independent events# the multiplication rule states the followin( PQA and ,R H PQAR V PQ,R &f the two events are dependent# the multiplication rule becomes( PQA and ,R H PQAI,R V PQ,R )utually Exclusive Events Two events are considered to be mutually e*clusive if they cannot occur at the same time durin the e*periment. 2ddition Rule of &robabilities We use the addition rule of probabilities to calculate the probability of the union of eventsGthat is# the probability that either 2vent A or 2vent , will occur. "or two events that are mutually e*clusive# the addition rule states the followin( PQA or ,R H PQAR J PQ,R &f the events are not mutually e*clusive# the addition rule becomes PQA or ,R H PQAR J PQ,R B PQA and ,R. When convertin frequencies to relative frequencies in a continency table# always divide each number in the table by the total number of observations. (ayer.s T!eorem
T!e Fundamental Countin" &rinciples Accordin to the fundamental countin principle# if one event can occur in m ways and a second event can occur in n ways# the total number of ways both events can occur toether is m W n ways. And we can e*tend this principle to more than two events. Permutations are the number of di-erent ways in which ob+ects can be arraned in order. &n a permutation# each item appears only once. The number of permutations of n distinct ob+ects is nX 5e*pressed as n factorial6. Combinations are similar to permutations# e*cept that the order of the ob+ects is not important. The number of combinations of n ob+ects taken r at a time can be found as follows( n / r HnXI 5n%r6X rX 2*ample( <ow that we know the total number of fve%card combinations from a ?8%card deck# we can calculate the probability of a .ush# which is any fve cards that are all the same suit 5spades# clubs# hearts# or diamonds6. "or you poker veterans# & am includin a royal .ush and a straiht .ush in this calculation. "irst# we need to count the number of fve% card .ushes of one suit# let$s say diamonds. ,ecause there are 79 diamonds in the deck# the number of combinations of these 79 diamonds# taken fve at a time# is as follows( 79 / ? H 79XI ?X 579%?6XH 78D=. ,ecause there are four suits in the deck# the total number of fve%card .ushes from any suit is 78D=* : H ?#7:D. Therefore# the probability of bein dealt a .ush# includin royal and straiht# in a fve%card hand is PQ.ushRH ?7D:I 8 ?ED E@> H >.>>8 3&ER),T4n, r5 3C0)(I64n, r5 Random +ariables A random variable is an outcome that takes on a numerical value as a result of an e*periment. The value of the random variable# which is not known with certainty before the e*periment# is often denoted by x. All random variables are not created equal. The frst type are known as continuous random variables# which are the result of a measurement on a continuous number scale. The second type of random variable is discrete. !iscrete random variables are the result of countin outcomes rather than measurin them. !iscrete random variables can only take on a certain number of inteer values within an interval. A random variable is continuous if it can assume any numerical value within an interval as a result of measurin the outcome of an e*periment. A random variable is discrete if it is limited to assumin only specifc inteer values as a result of countin the outcome of an e*periment. Discrete probability distributions A listin of all the possible outcomes of an e*periment for a discrete random variable alon with the relative frequency or probability of each outcome is called a discrete probability distribution.
&f we defne the random variable x H the place /hristin fnished in a race# the previous table would be the discrete probability distribution for the variable x. "rom this table# we can state the probability that /hristin will fnish frst as follows( PQx H 7R H >.?: 1r we can state the probability that /hristin will fnish either frst or second as follows( PQx H 7 or x H 8R H >.?: J >.8: H >.=D Any discrete probability distribution needs to meet the followin requirements( % each outcome in the distribution needs to be mutually e*clusiveGthat is# the value of the random variable cannot fall into more than one of the frequency distribution classes. "or e*ample# it is not possible for /hristin to take frst and second place in the same race. % the probability of each outcome# PQxR# must be between > and 7S that is# > YH PQxR YH 7 for all values of x. &n the previous e*ample# PQx = 9R H >.7:# which falls between > and 7. % the sum of the probabilities for all the outcomes in the distribution needs to add up to 7S T!e )ean of a Discrete &robability Distribution The mean of a discrete probability distribution is simply a weihted averae calculated usin the followin formula( 73 suma 8i x &98i:# unde KH the mean of the discrete probability distribution# ZiH the value of the random variable for the ith outcome# PQZiRH the probability that the ith outcome will occur# n H the number of outcomes in the distribution % it represents the averae fnish of many races. The mean of a discrete probability distribution does not have to equal one of the values of the random variable. % another term for describin the mean of a probability distribution is the expected value, 2QxR. T!e +ariance and Standard Distribution of Discrete &robability Distribution L 8 H suma 5Zi%K6 * PQZiR LH sqrt L 8 C!aracteristics of a (inomial Experiment A binomial e*periment has the followin characteristics( 576 the e*periment consists of a f*ed number of trials denoted by n; 586 each trial has only two possible outcomes# a success or a failureS 596 the probability of success and the probability of failure are constant throuhout the e*perimentS 5:6 each trial is independent of any other trial in the e*periment. T!e (inomial &robability Distribution % allows us to calculate the probability of a specifc number of successes for a certain number of trials. Therefore# the random variable for this distribution would be the number of successes that were observed. Anyway# let$s say on any particular day there is a 9> percent probability that Uaylee will brin back one stolen paper and a => percent chance that she won$t. We will assume that she will not brin back more than one paper a day. This scenario represents a binominal e*periment# with each day bein a ,ernoulli trial with p = >.9> 5the probability of a success6 and q H >.=> 5the probability of a failure6. We can calculate the probability of r successes in n trials usin the binomial distribution# as follows( With this equation# we can calculate the probability that Uaylee will brin back three papers over the ne*t fve days. nH?# rH number of papers
"rom this fure# we can see that the most likely number of papers that Uaylee will show up with over ? days is 7. "inally# we can calculate the probability of multiple events for this distribution. "or instance# the probability that Uaylee will steal at least three papers over the ne*t fve days is this( PQr [H 9R H PQ9#?RJ PQ:#?R J PQ?#?R % an easier way to arrive at these probabilities is to use a binomial probability table % the probability table is oranized by values of n, the total number of trials. The number of successes# r, are the rows of each section# whereas the probability of success# p, are the columns. <otice that the sum of each block of probabilities for a particular value of p adds to 7.>. (I60)DIST4r, n, p, cumulative5 cumulative H "A342 if you want the probability of e*actly r successes cumulative H T0F2 if you want the probability of r or fewer successes T!e )ean and Standard Distribution for t!e (inomial Distribution KH np# nH the number of trials# pH the probability of a success )ou can calculate the standard deviation for a binomial probability distribution usin the followin equationH sqrt 5npq6# qH the probability of failin. T!e &oisson &rocess A Poisson process counts the number of occurrences of an event over a period of time# area# distance# or any other type of measurement. 0ather than bein limited to only two outcomes# the Poisson process can have any number of outcomes over the unit of measurement. The random variable for the Poisson distribution would be the actual number of occurrences. The mean for a Poisson distribution is the averae number of occurrences that would be e*pected over the unit of measurement. "or a Poisson process# the mean has to be the same for each interval of measurement. "or instance# if the averae number of customers walkin into the store each hour is 77# this averae needs to apply to every one%hour increment. The last characteristic of a Poisson process is that the number of occurrences durin one interval is independent of the number of occurrences in other intervals. &n other words# if si* customers walk into the store durin the frst hour of business# this would have no e-ect on the number of customers arrivin durin the second hour.
e*emplu de distributie 4ome statistics books use the symbol lambda# to denote the mean of a Poisson probability distribution. 'owever# reardless of the notation# it$s still the same equation. PQx YH 8R H PQx = >RJ PQx H 7RJ PQx = 8R ........ the cumulative probability % the variance of the distribution is the same as the mean( L 8 HK % +ust like the binomial distribution# the Poisson probability distribution has a table that allows you to look up the probabilities for certain mean values. % the probability table is oranized by values of K, the averae number of occurrences. <otice that the sum of each block of probabilities for a particular value of K adds to 7. As with the binomial tables# one limitation of usin the Poisson tables is that you are restricted to usin only the values of K that are shown in the table. Technically# with a Poisson distribution# there is no upper limit to the number of occurrences durin the interval. )ou$ll notice from the Poisson tables that the probability of a lare number of occurrences is practically zero. ,ecause we cannot add all the probabilities of an infnite number of occurrences 5if you can# you$re a much better statistician than & amX6# we need to take 7 minus the complement of PQx YH 9R or( PQx [ 9R H 7% PQx YH 9R because( PQx= >RJ PQx H 7RJ PQx H 8RJ PQx H 9RJ ......J PQx H infnitR H 7.> &0ISS064x; 7; cumulative5 where( cumulative H "A342 if you want the probability of e*actly x occurrences cumulative H T0F2 if you want the probability of x or fewer occurrences ,sin" t!e &oisson Distribution as an 2pproximation to t!e (inomial Distribution We can use the Poisson distribution to calculate binomial probabilities under the followin conditions( % when the number of trials# n, is reater than or equal to 8> and \ % when the probability of a success# p, is less than or equal to >.>? \ % we replace KH np# n H the number of trials# p H the probability of a success A Poisson process counts the number of occurrences of an event over a period of time# area# distance# or any other type of measurement. % the mean for a Poisson distribution is the averae number of occurrences that would be e*pected over the unit of measurement and has to be the same for each interval of measurement. % the number of occurrences durin one interval of a Poisson process is independent of the number of occurrences in other intervals. % if the number of binomial trials is reater than or equal to 8> and the probability of a success is less than or equal to >.>?# you can use the equation for the Poisson distribution to appro*imate the binomial probabilities. T!e 6ormal &robability Distribution <ow let$s take on a new challene# continuous random variables and a continuous probability distribution known as the normal distribution.0emember that we defned a continuous random variable as one that can assume any numerical value within an interval as a result of measurin the outcome of an e*periment. 4ome e*amples of continuous random variables are weiht# distance# speed# or time. C!aracteristics of t!e normal probability distribution % the mean# median# and mode are the same value % the distribution is bell%shaped and symmetrical around the mean. % the total area under the curve is equal to 7. % the left and riht sides of the normal probability distribution e*tend indefnitely# never quite touchin the horizontal a*is. % the mean and standard deviation describe the shape of the distribution % e*ample( "iure 9.? shows the impact of chanin the mean of the distribution to ?.> inches# leavin the standard deviation at >.D inches. % a smaller standard deviation results in a skinnier curve that$s tihter and taller around the mean. A larer L 5standard deviation6 makes for a fatter curve that$s more spread out and not as tall. Calculatin" &robabilities for t!e 6ormal Distribution /alculatin the standard ]%score B z H 5*%K6 IL# where( x H the normally distributed random variable of interest K H the mean of the normal distribution L H the standard deviation of the normal distribution z = the number of standard deviations between x and K# otherwise known as the standard z-score. % then we use the 4tandard <ormal Table and we discover the area below the raphic % then the probability that the standard z%score will be less than or equal to * is the area * 7>> percent. % with continuous random variables# we cannot determine the probability of usin e*actly @:.9 ounces of spray because this would be an infnitely small probability. This is because & can use an infnite amount of quantities in any iven year. 1ne year# & could use @7.=?= ounces and another year# ?9.:=8 ounces. That$s why with continuous random variables we can only calculate the probabilities of certain intervals# like less than @:.9 ounces or between ?>.? and ?D.7 ounces. /ompare this to discrete random variables from previous chapters. ,ecause there were only a fnite number of values for these variables# we could calculate the probability of e*actly x occurrences or r successes. % the neative score indicates that we are to the left of the distribution mean. <otice that the standard normal table only shows positive z values. ,ut this is no problem because the distribution is symmetric. % e*ample( we can determine the area to the riht of J7.8 standard deviations as follows( PQz [%7.8RH7% PQzYH% 7.8R H 7% >.77?7H >.DD:E. ,ecause PQ*[?:RH PQz[%7.8RH >.DD:E. There is an DD.:E percent chance & will use more than :? ounces of spray. "i 77.7> The shaded area is the probability that * will be more than ?: ounces. 60R)DIST 4x; mean; std deviation; cumulative5; %!ere cumulative 3 F2'SE if you %ant t!e probability mass function 4%e don.t5 or cumulative 3 TR,E if you %ant t!e cumulative probability 4%e do5 ,sin" 6ormal Distribution as an 2proximation to t!e (inomial Distribution % the binomial equation will calculate the probability of r successes in n trials with p H the probability of a success for each trial and q H the probability of a failure. &f np >= ? and nq Y ?# we can use the normal distribution to appro*imate the binomial. % as an e*ample# suppose my statistics class is composed of @> percent females. &f & select 7? students at random# what is the probability that this roup will include D# E# 7># or 77 female studentsA "or this e*ample# n H 7?S p H >.@S q H >.:S and r H D# E# 7># and 77. We can use the normal appro*imation because np H 57?65>.@6 H E and nq H 57?65>.:6 H @. % when calculatin with the normal appro*imation to the binomial distribution# addin or substractin >.? is knowm as the continuity correction factor. "or larer values of n# like 7>> or more# you can inore this correction factor. Inferential Statistics &nferential statistics enables us to make statements about a eneral population usin the results of a random sample from that population. "or instance# usin inferential statistics# the winner of a political election can be accurately predicted very early in the pollin process based on the results of a relatively small random sample that is properly chosen. The term random samplin refers to a samplin procedure where every member in the population has a chance of bein selected. % we have to ensure that the fnal sample to be measured is representative of the population from which it was taken. &f this is not the case# then we have a biased sample, which can lead to misleadin results. % there are four di-erent ways to ather a random sample( simple random# systematic# cluster# and stratifed. A simple random sample is a sample in which every member of the population has an equal chance of bein chosen. & could randomly choose pacients usin a random number table. % random numbers can also be enerated with 2*cel usin the 0A<! function %%[ cell A7 contains the formula H0A<!56# which provides a random number between > and 7. This random number would result in student 9?= bein chosen for the sample. 1ne way to avoid a personal bias when selectin people at random is to use systematic samplin". This technique results in selectin every kth member of the population to be in your sample. The value of k will depend on the size of the sample and the size of the population. &n eneral# if N H the size of the population and n H the size of the sample# the value of k would be appro*imately Nn. The beneft of systematic samplin is that it$s easier to conduct than a simple random sample# often resultin in less time and money. The downside is the daner of selectin a biased sample if there is a pattern in the population that is consistent with the value of k. Cluster samplin" &f we can divide the population into roups# or clusters# then we can select a simple random sample from these clusters to form the fnal sample. 2ach member of the chosen clusters would be part of the fnal sample. &n strati<ed samplin", we divide the population into mutually e*clusive roups# or strata# and randomly sample from each of these roups 5like men and woman6. 1ther e*amples of criteria that we can use to divide the population into strata are ae# income# or occupation. 4tratifed samplin is helpful when it is important that the fnal sample has certain characteristics of the overall population. Samplin" errors ,y relyin on a sample# we e*pose ourselves to errors that can lead to inaccurate conclusions about the population. The type of error that a statistician is most concerned about is called samplin! error, which occurs when the sample measurement is di-erent from the population measurement. ,ecause the population is rarely measured in its entirety# the samplin error cannot be directly calculated. 1ne way to reduce the samplin error of a statistical study is to increase the size of the sample. &n eneral# the larer the sample size# the smaller the samplin error. &f you increase the sample size until it reaches the size of the population# then the samplin error will be reduced to zero. ,ut in doin so# you forfeit the benefts of samplin. % online surveys( the respondents are self%selected# which means the sample is not randomly chosen. The results of these surveys are most likely biased because the respondents would not be representative of the population at lare. "or e*ample# people without &nternet access would not be part of the sample and miht respond di-erently than people with access to the &nternet. The samplin" distribution of t!e mean H the mean of each sample is the measurement of interest. % discrete uniform probability distribution because each event has the same probability % a discrete uniform probability distribution is a distribution that assins the same probability to each discrete event 5and is discrete if it is countable6.
Accordin to the central limit t!eorem# as the sample size# n# ets larer# the sample means tend to follow a normal probability distribution. This holds true reardless of the distribution of the population from which the sample was drawn. The standard deviation of the sample means is formally known as the standard error of t!e mean. 4tudents often confuse L and L x . The symbol L# the standard deviation of the population# measures the variation within the population. The symbol L x # the standard error# measures the variation of the sample means and will decrease as the sample size increases. The theoretical samplin distribution of the mean displays all the possible sample means alon with their classical probabilities. Samplin" Distribution of &roportion ;y measurement of interest is the proportion of teenaers in my sample of size n# who will aree with the statement ;y parents are an e*cellent resource when &$m lookin for advice on an important matter in my life. The sample proportion# ps# is calculated by" p s = 5number of succeses in the sample6I nS ,ecause & don$t know the population proportion# p, who would aree with the statement# & need to collect data from samples and appro*imate the population proportion. With proportion data# & want the sample size to be lare enouh so & can use the normal probability distribution to appro*imate the binomial distribution. &f np [H ? and nq >= ?# we can use the normal distribution to appro*imate the binomial 5q H 7 B p, the probability of a failure6. &$m hopeful that p will be at least ? percent 5at least a few teenaers miht listen to their parents6# so if & choose n H 7?># then( np H 57?>65>.>?6 H =.? and nq H 57?>65>.E?6 H 7:8.? 4uppose & choose 7> samples# each of size 7?># and record the number of areements 5successes6 in each sample in the table that follows. <e*t & averae the sample proportions to appro*imate the population proportion# p s
mediuH sum of sample proportionI number of samples H >.7@: <ow# we calculate the standard error of the proportion( L p H sqrt 5 p57%p6In 6 H >.>9> <ow &$m ready to answer the ae%old question# What is the probability that from my ne*t sample of 7?> teenaers# 8> percent or less will aree with the statement( ^;y parents are an e*cellent resource when &$m lookin for advice on an important matter in my life$A ,ecause our sample size allows us to use the normal appro*imation to the binomial distribution# we now calculate the z%score for the proportion usin the followin equation( zH 5p s %p6I L p H J7.8># and usin the standard z%table# PQp s YH>.8>RH >.DD:E Con<dence Intervals The simplest estimate of a population is the point estimate, the most common bein the sample mean. A point estimate is a sinle value that best describes the population of interest. The advantae of a point estimate is that it is easy to calculate and easy to understand. The disadvantae# however# is that & have no clue as to how accurate this estimate really is. To deal with this uncertainty# we can use an interval estimate# which provides a rane of values that best describes the population. A con<dence level is the probability that the interval estimate will include the population parameter. A parameter is defned as a numerical description of a population characteristic# such as the mean. &n eneral# we can construct a con<dence interval around our sample mean usin the followin equations( As described earlier# a confdence interval is a rane of values used to estimate a population parameter and is associated with a specifc confdence level. A confdence interval needs to be described in the conte*t of several samples. &f we select 7> samples from our home shoppin population and construct E> percent confdence intervals around each of the sample means# then theoretically E of the 7> intervals will contain the true population mean# which remains unknown. &t is easy to misinterpret the defnition of a confdence interval. "or e*ample# it is not correct to state that there is a E> percent probability that the true population mean is within the interval 5_@=.9D# _DE.786. 0ather# a correct statement would be that there is a E> percent probability that any iven confdence interval from a random sample will contain the true population mean. ,ecause there is a E> percent probability that any iven confdence interval will contain the true population mean in the previous e*ample# we have a 7> percent chance that it won$t. This 7> percent value is known as the level of sinifcance# `# which is represented by the total white area in both tails. The probability for the confdence interval is a complement to the sinifcance level. "or e*ample# the sinifcance level for a E? percent confdence interval is ? percent# the sinifcance level for a EE percent confdence interval is 7 percent# and so on. &n eneral# a 57 B `6 confdence interval has a sinifcance level equal to `. The level of sinifcance 5`6 is the probability of makin a Type & error. <otice that in "iure 7:.7# ? percent of the area under the curve lies to the riht of J7.@: and E? percent of the area under the curve lies to the left. That$s why you see >.E:E? 5close enouh to >.E?6 correspondin to a z%score of 7.@: in Table 9 of Appendi* ,. 0emember# however# that z H 7.@: corresponds to a E> percent confdence interval# the shaded reion in the fure. T!e e=ect of c!an"in" con<dence levelsH for increasin the confdence levelGour interval estimate of the true population mean becomes wider and less precise. &f we want more certainty that our confdence interval will contain the true population mean# that confdence interval will become wider. There is one way# however# to reduce the width of our confdence interval while maintainin the same confdence level. We can do this by increasin the sample size. Determinin" Sample Si>e for )ean We can also calculate a minimum sample size that would be needed to provide a specifc marin of error. 2H zL *5mean6 Calculatin" a Con<dence Interval ?!en @ is ,n-no%n# as lon as n [H 9># we can substitute s# the sample standard deviation# for L# the population standard deviation# and follow the same procedure as before. C06FIDE6CE 4alp!a; standardAdev; si>e5 Con<dence Intervals for t!e )ean %it! Small Samples With a small sample size# we lose the use of our faithful friend# the central limit theorem# and we need to assume that the population is normally 5or appro*imately6 distributed for all cases. The frst case that we$ll e*amine is when we know L# the population standard deviation. % when L is known# the procedure reverts back to the lare sample size case. We can do this because we are now assumin the population is normally distributed. % when L is unknown# here# we make a similar ad+ustment that we made earlier and substitute s, the sample standard deviation# for L; the population standard deviation. 'owever# because of the small sample size# this substitution forces us to use a new probability distribution known as the 4tudent$s t%distribution. The t%distribution is a continuous probability distribution with the followin properties( % it is bell%shaped and symmetrical around the mean. % the shape of the curve depends on the de!rees o# #reedom 5d.f.6 which# when dealin with the sample mean# would be equal to n B 7. % the area under the curve is equal to 7.>. % the t%distribution is .atter than the normal distribution. As the number of derees of freedom increase# the shape of the t%distribution becomes similar to the normal distribution as seen in "iure 7:.@. With more than 9> derees of freedom 5a sample size of 9> or more6# the two distributions are practically identical. The derees of freedom are the number of values that are free to be varied iven information# such as the sample mean# is known. "or e*ample# if & know that my sample of size 9 has a mean of 7># & can only vary two values 5n B 76. After & set those two values# & have no control over the third value because my sample averae must be 7>. "or this sample# & have 8 derees of freedom. We can now set up our confdence intervals for the mean usin a small sample(
tc H critical t%value 5can be found in Table : in Appendi* ,6 We can use the t%distribution when all of the followin conditions have been met( % the population follows the normal 5or appro*imately normal6 distribution. % the sample size is less than 9>. % the population standard deviation# L# is unknown and must be appro*imated by s, the sample standard deviation. Con<dence Intervals for t!e &roportion %it! 'ar"e Samples % we can also estimate the proportion of a population by constructin a confdence interval from a sample. % proportion data follow the binomial distribution that can be appro*imated by the normal distribution under the followin conditions( np[H? and nq[H?. % suppose & want to estimate the proportion of home shoppin customers who are female based on the results of a sample H[ we calculate ps % the confdence interval around the sample proportion can be calculated by( 1ur challene is that we are tryin to estimate p, the population proportion# but we need a value for p to set up the confdence interval. 1ur solutionGestimate the standard error by usin the sample proportion as an appro*imation for the population proportion. psH >.@8E L p H >.>9@? We are now ready to construct a E> percent confdence interval around our sample proportion 5zc H 7.@:6( % upper limitH>.@DE % lower limitH>.?@E 1ur E> percent confdence interval for the proportion of female home shoppin customers is 5>.?@E# >.@DE6. Determinin" Sample Si>e for t!e &roportion: n3 pq4 > c BE5 C % therefore# to obtain a EE percent confdence interval that provides a marin of error no more than @ percent would require a sample size of :?E home shoppers. Introduction to $ypot!esis Testin" 1ne thin statisticians like to do is to make a statement about a population parameter# collect a sample from that population# measure the sample# and declare# in a scholarly manner# whether or not the sample supports the oriinal statement. This# in a nutshell# is what hypothesis testin is all about. &n the statistical world# a !ypot!esis is an assumption about a population parameter. &n each case# we have made a statement about the population that may or may not be true. The purpose of hypothesis testin is to make a statistical conclusion about acceptin or not acceptin such statements. 3et$s say that my hypothesis is that it will take an averae of si* days to capture a loose snake in a house. &n other words# & would like to test my belief that the population mean# K# is equal to si* days. & do this by atherin a sample of people who have had a loose snake in their home and calculate the averae number of days required to capture it. 4uppose the sample averae is @.7 days. The hypothesis test will then tell me whether or not @.7 days is sinifcantly di-erent from @.> days or if the di-erence is merely due to chance. T!e 6ull and 2lternative $ypot!esis 2very hypothesis test has both a null hypothesis and an alternative hypothesis. The null !ypot!esis, denoted by $># represents the status quo and involves statin the belief that the mean of the population is YH#H# or [H a specifc value. The null hypothesis is believed to be true unless there is overwhelmin evidence to the contrary. &n this e*ample# my null hypothesis would be stated as( $> ( H@.> days The alternative !ypot!esis, denoted by $7# represents the opposite of the null hypothesis and holds true if the null hypothesis is found to be false. The alternative hypothesis always states the mean of the population is Y# H or [ a specifc value. &n this e*ample# my alternative hypothesis would be stated as( $7 ( Y[@.> days. The followin table shows the three valid combinations of the null and alternative hypothesis. 6ote that the alternative hypothesis is never associated with [H# H# or YH. )ou need to be careful how you state the null and alternative hypothesis. )our choice will depend on the nature of the test and the motivation of the person conductin it. &f the purpose is to test that the population mean is equal to a specifc value# such as our snake e*ample# assin this statement as the null hypothesis# which results in the followin( ' > ( KH@.> days and ( ' 7 ( KY[@.> days. 1ften hypothesis testin is performed by researchers who want to prove that their discovery is an improvement over current products or procedures. "or e*ample# if & invented a olf ball that & claimed would increase your distance o- the tee by more than 8> yards# & would set up my hypothesis as follows( ' > ( KYH 8> yards and ' 7 ( K[ 8> yards. <ote that & used t!e alternative !ypot!esis to represent t!e claim that & want to prove statistically so that & can make a fortune sellin these balls to desperate olfers such as myself. ,ecause of this# the alternative hypothesis is also known as the researc! !ypot!esis because it represents the position that the researcher wants to establish. T%o# Tail $ypot!esis Test is used whenever the alternative !ypot!esis is expressed as DE. 1ur snake e*ample would involve a two%tail test because the alternative hypothesis is stated as $7 (KY[ @.>. This test is shown raphically in "iure 7?.7 which# as you can see# is considered a two%tail hypothesis test. The curve in the fure represents the samplin distribution of the mean for the number of days to catch a snake. The mean of the population# assumed to be @.> days accordin to the null hypothesis# is the mean of the samplin distribution and is desinated by K '1 . The procedure is as follows( % collect a sample of size n# and calculate the test statistic# which in this case is the sample mean. % plot the sample mean on the *%a*is of the samplin distribution curve. % if the sample mean falls within the white reion# we do not re+ect '>. That is# we do not have enouh evidence to support '7# the alternative hypothesis# which states that the population mean is not equal to @.> days. % if the sample mean falls in either shaded reion# otherwise known as the re+ection reion# we re+ect '>. That is# we have enouh evidence to support '7# which results in our belief that the true population mean is not equal to @.> days. ,ecause there are two re+ection reions in this fure# we have a two%tail hypothesis test. ,ecause our conclusions are based on a sample# we will never have enouh evidence to accept the null hypothesis. &t$s a much safer statement to say that we do not have enouh evidence to re+ect $>. We can use the analoy of the leal system to e*plain. &f a +ury fnds a defendant not uilty# they are not sayin the defendant is innocent. 0ather# they are sayin that there is not enouh evidence to prove uilt. 0ne#Tail $ypot!esis Test involves the alternative hypothesis bein stated as D or E. ;y olf ball e*ample results in a one%tail test because the alternative hypothesis is bein e*pressed as $7 ( K[ 8>. 'ere# there is only one re+ection reion# which is the shaded area on the riht tail of the distribution. We follow the same procedure outlined for the two%tail test and plot the sample mean# which represents the averae increase in distance from the tee with my new olf ball. Two possible scenarios e*ist. % &f the sample mean falls within the white reion# we do not re+ect $>. That is# we do not have enouh evidence to support $7# the alternative hypothesis# which states that my olf ball increased distance o- the tee by more than 8> yards. There oes my fortune down the drainX % &f the sample mean falls in the re+ection reion# we re+ect $>. That is# we have enouh evidence to support $7# which confrms my claim that my new olf ball will increase distance o- the tee by more than 8> yards. Errors occurin" durin" sample# type I and II errors 0emember that the purpose of the hypothesis test is to verify the validity of a claim about a population based on a sinle sample. ,ecause we are relyin on a sample# we e*pose ourselves to the risk that our conclusions about the population will be wron. Fsin the olf ball e*ample# suppose that my sample falls within the 0e+ect $> reion of the last fure. That is# accordin to the sample# my olf ball increases distance o- the tee by more than 8> yards. ,ut what if the true population mean is actually much less than 8> yardsA This can occur primarily because of samplin error. This type of error# when we re+ect $> when in reality it$s true# is known as a %&pe ' error. The probability of makin a Type & error is known as `# the level of sinifcance. We also can e*perience another type of error with hypothesis testin. 3et$s say the olf ball sample fell within the !o <ot 0e+ect '> reion of the last fure. That is# accordin to the sample# my olf ball does not increase the distance o- the tee by more than 8> yards. ,ut what if the true population mean is actually much more than 8> yardsA This type of error# when we do not re+ect '> when in reality it$s false# is known as a Type && error. The probability of makin a Type && error is known as a. <ormally# with hypothesis testin# we decide on a value for ` that is somewhere between >.>7 and >.7> before we collect the sample. Example of a T%o# Tail $ypot!esis Test & stated the hypotheses for the snake e*ample as( $> ( K H @.> days and ' 7 ( K Y[ @.> days. Where K H the mean number of days to catch a loose snake in a home. 3et$s say that & know that the standard deviation of the population# L# is >.? days# and my sample size to test the hypothesis# n, is 9> homes. We$ll also set ` H >.>?# which means &$m willin to accept a ? percent chance of committin a Type & error. 1ur frst step is to calculate the standard error of the mean# L * 5mean6 H >.>E79 days. 3et$s assume the sample mean from the 9> homes is @.7 days. What is our conclusion about our estimate of the population mean# KA To answer this# we ne*t have to determine the critical z%score# which corresponds to `H>.>?. ,ecause this is a two%tail test# this area needs to be evenly divided between both tails# with each tail receivin `I8H >.>8?. Accordin to "iure 7?.9# we need to fnd the critical z%score that corresponds to the area >.E?> J >.>8? H >.E=?. As you can see# the >.E?> area is derived from 7 B `. Fsin Table 9 in Appendi* ,# we look for the closest value to >.E=?> in the body of the table. We can fnd this value by lookin across column 7.E and down row >.>@ to arrive at the z%score of J7.E@ for the riht tail and B7.E@ for the left tail. ,sin" t!e Scale of t!e 0ri"inal +ariable <ow let$s determine the re+ection reion usin the scale of the oriinal variable# which in this case is the number of days. To calculate the upper and lower limits of the re+ection reion# we use the followin equations. % we use the z%scores from the standard normal distribution when n [H 9> and L is known. % limits of re+ection reionH K 'o J z c L * 5mean6 % where K $> H the population mean assumed by the null hypothesis % for our snake e*ample( upper limitH @.7D days# lower limitH ?.D8 days % because our sample mean is @.7 days# this falls within the !o <ot 0e+ect $> reion. 1ur conclusion is that the di-erence between @.7 days and @.> days is merely due to chance variation# and we have support that the population mean is @ days. ,sin" t!e Standardi>ed 6ormal Scale We can arrive at the same conclusion by settin up the boundaries for the re+ection reion usin the standardized normal scale. We do this by calculatin the z%score that corresponds to the sample mean as follows( zH 5* B K '> 6I L * H J7.>E ,e sure to distinuish between the calculated z%score and the critical z%score. The calculated z%score# z, represents the number of standard deviations between the sample mean and K$> # the population mean accordin to the null hypothesis. The critical z% score# zc# is based on the sinifcance level# `# and determines the boundary for the re+ection reion. ,ecause the calculated z%score of J7.>E is within the !o <ot 0e+ect $> reion# the conclusions of both techniques are consistent. Example of a 0ne#Tail $ypot!esis Test ,ecause & formulated the alternative hypothesis for the olf ball e*ample as [ 8># this becomes a one%tail test. The hypothesis for this e*ample is stated as( '>( KYH8> yards and '7( K[8> yards# where KH the mean increase in yards o- the tee usin my new olf ball. 3et$s say that & know that the standard deviation of the population# L# is ?.9 yards and my sample size to test the hypothesis# n, is :> olfers. "or this e*ample# we$ll set ` H >.>7. The standard error of the mean# L x # will now be equal to LI sqrt 5n6 H >.D9D yards. 3et$s assume the sample mean from the :> olfers is 88.? yards. What is our conclusion about our estimate of the population mean# KA 1nce aain# we ne*t have to determine the critical z%score# which corresponds to ` H >.>7. ,ecause this is a one%tail test# this entire area needs to be in one re+ection reion on the riht side of the distribution. We need to fnd the zscore that corresponds to the area >.EE or 7 B `. To calculate the limit for this re+ection reion usin the scale of the oriinal variable# we use( 3imitH K 'o J z c L * H 8>J 8.99b>.D9DH 87.E? yards ,ecause our sample mean is 88.? yards# this falls within the 0e+ect $> reion. 1ur conclusion is that we have enouh evidence to support the hypothesis that the mean increase in distance o- the tee with my new balls e*ceeds 8> yards. 2dvanced Inferential Statistics % we can determine whether two cateorical variables are related 5c!i#square6# compare three or more populations 5analysis of variance6# and describe the strenth and direction of the relationship between two variables 5simple re"ression6. T!e C!i#Square &robability Distribution % we can confrm whether a set of data follows a specifc probability distribution# such as the binomial or Poisson. % to determine whether two variables are statistically independent we discussed the di-erent type of data measurement scales# which were nominal# ordinal# interval# and ratio. 'ere is a brief refresher of each( % nominal level of measurement deals strictly with qualitative data. 1bservations are simply assined to predetermined cateories. 1ne e*ample is ender of the respondent with the cateories bein male and female. % ordinal measurement is the ne*t level up. &t has all the properties of nominal data with the added feature that we can rank order the values from hihest to lowest. An e*ample would be rankin a movie as reat# ood# fair# or poor. % interval level of measurement involves strictly quantitative data. 'ere we can use the mathematical operations of addition and subtraction when comparin values. "or this data# the di-erence between the di-erent cateories can be measured with actual numbers and also provides meaninful information. Temperature measurement in derees "ahrenheit is a common e*ample here. % ratio level is the hihest measurement scale. <ow we can perform all four mathematical operations to compare values. 2*amples of this type of data are ae# weiht# heiht# and salary. 0atio data has all the features of interval data with the added beneft of a true zero point# meanin that a zero data value indicates the absence of the ob+ect bein measured. % the c!i#square distribution in this chapter will allow us to perform hypothesis testin on nominal and ordinal data. The two ma+or techniques that we will learn about are usin the chi%square distribution to perform a oodness%of%ft test and to test for the independence of two variables. F5 "oodness#of#<t test- uses a sample to test whether a frequency distribution fts the predicted distribution.
/an we conclude that the e*pected movie ratins are true based on the observed ratins of :>> peopleA % statin the <ull and Alternative 'ypothesis( ' o ( the sample of observed frequencies supports the claim about the e*pected frequencies. ' 7 ( there is no support for the claim pertainin to the e*pected frequencies. % the total number of e*pected frequencies526 must be equal to the total number of observed frequencies516. % for our movie e*ample# the observed #requencies are simply the number of observations collected for each cateory of our sample. The expected #requencies are the e*pected number of observations for each cateory and are calculated in the followin table. % observed frequencies are the number of actual observations noted for each cateory of a frequency distribution with chi%squared analysis. 2*pected frequencies are the number of observations that would be e*pected for each cateory of a frequency distribution assumin the null hypothesis is true with chi%squared analysis.
% calculatin the /hi%4quare 4tatistic( % determinin the /ritical /hi%4quare 4core# which depends on the number of derees of freedom d.fH k%7# kH number of cateories in the frequency distribution. "or our e*ample# kH?. The /ritical /hi%4quare 4core is read from the table. "or `H>.7> and d.f.H:# it$s =.==E. % the calculated chi%square score of E.E? is within the 0e+ect $> reion# which leads us to the conclusion that the actual movie%ratin frequency distribution di-ers from the e*pected distribution. We will always re+ect $> as lon as c 8 c YH c 8 . % also# because the calculated chi%square score for the oodness%of%ft test can only be positive# the hypothesis test will always be a one%tail with the re+ection reion on the riht side. # C$I6+ 4probability; de"#freedom5 % the chi%square distribution is not symmetrical but rather has a positive skew. The shape of the distribution will chane with the number of derees of freedom. As the number of derees of freedom increases# the shape of the chi%square distribution becomes more symmetrical. 2 "oodness#of#<t test %it! t!e binomial distribution( % suppose that a certain ma+or leaue baseball player claims the probability that he will et a hit at any iven time is 9> percent. The followin table is a frequency distribution of the number of hits per ame over the last 7>> ames. Assume he has come to bat four times in each of the ames. % in other words# in 8@ ames he had > hits# in 9: ames he had 7 hit# etc. Test the claim that this distribution follows a binomial distribution with p H >.9> usin `H >.>?. The hypothesis statement would look like the followin( $>( The distribution of hits by the baseball player can be described with the binomial prob distribution usin p H >.9>. $7( The distribution di-ers from the binomial probability distribution usin p H >.9>. 1ur frst step is to calculate the frequency distribution for the e*pected number of hits per ame. To do this# we need to look up the binomial probabilities in Table for n H : 5the number of trials per ame6 and p H >.9> 5the probability of a success. % before continuin# we need to make one ad+ustment to the e*pected frequencies. When usin the chi%square test# we need at least fve observations in each of the e*pected frequency cateories. &f there are less than fve# we need to combine cateories. &n the previous table# we will combine 9 and : hits per ame into one cateory to meet this requirement. Accordin to "iure 7D.:# the calculated chi%square score of 8.8> is within the !o <ot 0e+ect $> reion# which leads us to the conclusion that the baseball player$s hittin distribution can be described with the binomial distribution usin p H >.9>. C5 C!i#Square Test for Independence This is known as a contin"ency table, which shows the observed frequencies of two variables. &n this case# the variables are warm%up time and tennis player. The table is oranized into r rows and c columns. "or our table# r H 8 and c H 9. An intersection of a row and column is known as a cell. A continency table has r x c cells# which in our case# would be @. The chi%square test of independence will determine whether the proportion of times that !ebbie wins is the same for all three warm%up periods. &f the outcome of the hypothesis test is that the proportions are not the same# we conclude that the lenth of warm%up does impact the performance of the players. "irst we state the hypotheses as( $>( Warm%up time is independent of performance $7( Warm%up time a-ects performance The chi-square test of independence only investigates whether a relationship exists between two variables. It does not conclude anything about the direction of the relationship. In other words, from a statistical perspective, Debbie cannot claim that she is disadvantaged by the short warm-up time. She can only claim that warm-up time has some effect on her performance. 2nalysis of +ariance ! 1ne% Way Analysis of dariance &f you want to compare the means for three or more populations# A<1dA is the test for you. 3et$s say &$m interested in determinin whether there is a di-erence in consumer satisfaction ratins between three fast%food chains. & would collect a sample of satisfaction ratins from each chain and test to see whether there is a sinifcant di-erence between the sample means. 2ssentially# &$m testin to see whether the variations in customer ratins from the previous table are due to the fast%food chains or whether the variations are purely random. &n other words# do customers perceive any di-erences in satisfaction between the three chainsA &f & re+ect the null hypothesis# however# my only conclusion is that a di-erence does e*ist. Analysis of variance does not allow me to compare population means to one another to determine which is reater. That task requires further analysis. To use one%way A<1dA# the followin conditions must be present( % the populations of interest must be normally distributed. % the samples must be independent of each other. % each population must have the same variance. A factor in A<1dA describes the cause of the variation in the data. &n the previous e*ample# the factor would be the fast%food chain. This would be considered a one%way A<1dA because we are considerin only one factor. When only one factor is bein considered# the procedure is known as one%way A<1dA. A level in A<1dA describes the number of cateories within the factor of interest A level in A<1dA describes the number of cateories within the factor of interest. "or our e*ample# we have three levels based on the three di-erent fast%food chains bein e*amined. 86 /ompletely 0andomized A<1dA The simplest type of A<1dA is known as completely randomized one%way A<1dA# which involves an independent random selection of observations for each level of one factor. 2*ample( &$m interested in comparin the e-ectiveness of three lawn fertilizers. 4uppose & select 7D random patches of my precious lawn and apply either "ertilizer 7# 8# or 9 to each of them. After a week# & mow the patches and weih the rass clippins. The factor in this e*ample is fertilizer. There are three levels# representin the three types of fertilizer we are testin. The table that follows indicates the weiht of the clippins in pounds from each patch. The mean and variance of each level are also shown. We$ll refer to the data for each type of fertilizer as a sample. "rom the previous table# we have three samples# each consistin of si* observations. The hypotheses statement can be stated as( ' o ( K 7 HK 8 HK 9 and ' 7 ( not all K$s are equal# where K7#8#9 are the true population means for the pounds of rass clippins for each type of fertilizer. 96 Partitionin the 4um of 4quares The hypothesis test for A<1dA compares two types of variations from the samples. We frst need to reconize that the total variation in the data from our samples can be divided# or as statisticians like to say# partitioned# into two parts. The frst part is the variation within each sample# which is o-icially known as the sum of squares within 5(()6. This can be found usin the followin equation( # where k H the number of samples 5or levels6. "or the fertilizer e*ample# k H 9 and( A<1dA does not require that all the sample sizes are equal# as they are in the fertilizer e*ample. "inally# the total variation of all the observations is known as the total sum of squares 5((%6 and can be found by( This equation may look nasty# but it is +ust the di-erence between each observation and the rand mean squared and then totaled over all of the observations. This is clarifed more in the followin table. <ote that we can determine the variance of the oriinal 7D observations# s 8 # by( This result can be confrmed by usin the variance equation by usin 2*cel. :6 !eterminin the /alculated "%4tatistic To test the hypothesis for A<1dA# we need to compare the calculated test statistic to a critical test statistic usin the "%distribution. The calculated "%statistic can be found usin the equation( "H ;4,I ;4W# where ;4, is the mean square between# found by ;4,H 44,I 5k%76 and ;4W is the mean square whithin# found by ;4WH 44WI < B k 2*ample( ;4,H 7>.D@I 59%76H ?.:9 ;4WH 7D.9?I 57D%96H 7.88 "H ?.:9I7.88H :.:? &f the variation between the samples 5;4,6 is much reater than the variation within the samples 5;4W6# we will tend to re+ect the null hypothesis and conclude that there is a di-erence between population means. To complete our test for this hypothesis# we need to introduce the "%distribution. The mean square between 5*(+6 is a measure of variation between the sample means. The mean square within 5*()6 is a measure of variation within each sample. A lare *(+ variation# relative to the *() variation# indicates that the sample means are not very close to one another. This condition will result in a lare value of ,, the calculated "%statistic. The larer the value of ,, the more likely it will e*ceed the critical "%statistic 5to be determined shortly6# leadin us to conclude there is a di-erence between population means. ?6 !eterminin the /ritical "%4tatistic We use the "%distribution to determine the critical "%statistic# which is compared to the calculated "%statistic for the A<1dA hypothesis test. The critical "%statistic# " `# k%7# <%k # depends on two di-erent derees of freedom# which are determined by( v 7 H k%7 and v 8 H <%k. "or our fertilizer e*ample( v 7 H9%7H8 and v 8 H 7D%9H7?. The critical "%statistic is read from the "%distribution table. 2ven thouh we have re+ected $> and concluded that the population means are not all equal# A<1dA does not allow us to make comparisons between means. &n other words# we do not have enouh evidence to conclude that "ertilizer 8 produces more rass clippins than "ertilizer 7. This requires another test known as pairwise comparisons# which we$ll address later in this chapter. @6 Fsin 2*cel to perform 1ne%way A<1dA 7. 4tart by placin the fertilizer data in /olumns A# ,# and / in a blank sheet. 8. eo to the Tools menu and select !ata Analysis. 50efer to the section &nstallin the !ata Analysis Add%in from /hapter 8 if you don$t see the !ata Analysis command on the Tools menu.6 9. "rom the !ata Analysis dialo bo*# select Anova( 4inle "actor as shown in "iure 7E.8 and click 1U. :. 4et up the Anova( 4inle "actor dialo bo* accordin to "iure 7E.9. ?. /lick 1U. "iure 7E.: shows the fnal A<1dA results. <otice that the p-value H >.>9>? for this test# meanin we can re+ect $># because this p-value f f. &f you remember# we had set f H >.>? when we stated the hypothesis test.
=6 Pairwise /omparisons D6 /ompletely 0andomized ,lock A<1dA 1ne concern in this scenario is that the variations in the lawns will account for some of the variation in the three fertilizers# which may interfere with our hypothesis test. We can control for this possibility by usin a completely randomized block A<1dA# which is used in the previous table. The type of fertilizer is still the factor# and the lawns are called blocks. There are two hypotheses for the completely randomized block A<1dA. The frst 5primary6 hypothesis tests the equality of the population means# +ust like we did earlier with one%way A<1dA( ' o ( K 7 HK 8 HK 9 and ' 7 ( not all K$s are equal# where K7#8#9 are the true population means for the pounds of rass clippins for each type of fertilizer. The secondary hypothesis tests the e-ectiveness of the blockin! variable as follows( ' o ( the block means are all equal# and ' 7 ( the block means are not all equal. The blockin variable would be an e-ective contributor to our A<1dA model if we can re+ect $ o and claim that the block means are not equal to each other. E6 Partitionin the 4um of the 4quares "or the completely randomized block A<1dA# the sum of squares total is partitioned into three parts accordin to the followin equation( 44TH 44WJ44,J44,3# where( 44WH sum of squares within# 44,H sum of squares between# 44,3H sum of squares for the blockin variable 5lawns6. "ortunately for us# the calculations for ((% and (() are identical to the one%way A<1dA procedure that we$ve already discussed# so those values remain unchaned 5((% H 8E.87 and ((+ H 7>.D@6. We can fnd the sum of squares block 5((+-6 by usin the equation( 7>6 !eterminin the /alculated "%4tatistic 4ince we have two hypothesis tests for the completely randomized block A<1dA# we have two calculated "%statistics. The "%statistic to test the equality of the population means 5the oriinal hypothesis6 is found usin( "H ;4,I;4W# where ;4, is the means square between# found by( ;4,H 44,I 5k%76 and ;4WH 44WI 5k%765b%76. 2*ample( ;4,H 7>.D@I59%76H ?.:9 ;4WH 7=.@9I 59%765@%76H 7.=@ "H ?.:9%7.=@H 9.>E The second "%statistic will test the sinifcance of the blockin variable 5the second hypothesis6 and will be denoted ,$. We will determine this statistic usin( "$H ;4,3I ;4W# where ;4,3 is the mean square blockin 5 the second hypothesis6 and will be denoted "$. We will determine this statistic usin( "$H ;4,3I ;4W# where ;4,3 is the mean square blockin 5 44,3I b%76. 2*ample( ;4,3 H >.=8I @%7H >.7: "$H ;4,3I ;4WH >.7:I 7.=@H >.>D 76 "ist# we e*amine the primary hypothesis# '> that all population means are equal usin `H>.>?. The derees of freedom for this critical "%statistic would be( v7H k%7H 9% 7H8 and v8H 5k%765b%76H 59%765@%76H 7>. 86 The critical "%statistic from tables is " >.>>?# 8# 7> H :.7>9. 96 4ince the calculated "%statistic equals 9.>E and is less this critical "%statistic# we fail to re+ect '> and cannot conclude that the fertilizer means are di-erent. :6 We e*amine the secondary hypothesis '>$# concernin the e-ectiveness of the blockin variable# also usin `H>.>?. The derees of freedom for this critical "%statistic would be( v7$ H? and v8$H 8b?H7> The critical "%statistic from the table is "H 9.98@. 4ince the calculated " statistic "$ equals >.>D and it$s less than this critical "%statistic# we fail to re+ect '>$ and cannot conclude that the block means are di-erent. ?6 What does this meanA 4ince we failed to re+ect $> $# the hypothesis that states the blockin means are equal# the blockin variable 5lawns6 proved not to be e-ective and should not be included in the model. &ncludin an ine-ective blockin variable in the A<1dA increases the chance of a Type && error in the primary hypothesis# $>. The conclusion of the primary hypothesis in this e*ample would be more precise without the blockin variable. &n fact# this is what essentially happened when we included the blockin variable with the randomized block desin. With the blockin variable present in the model# we failed to discover a di-erence in the population means. <ow o back to the beinnin of the chapter. When we tested the population means usin one% way A<1dA 5without a blockin variable6# we concluded that the population means were indeed di-erent. &n summary 5&t$s about timeX6# if you feel there is a variable present in your model that could contribute undesirable variation# such as takin samples from di-erent lawns# use the randomized block A<1dA. "irst test $> $# the blockin hypothesis. &f you re+ect $> $# the blockin procedure was e-ective. Proceed to test $># the primary hypothesis concernin the population means# and draw your conclusions. &f you fail to re+ect $> $# the blockin procedure was not e-ective. 0edo the analysis usin one%way A<1dA 5without blockin6 and draw your conclusions. 4ummary( Analysis of variance# also known as A<1dA# compares the means of three or more populations. A factor in A<1dA describes the cause of the variation in the data. When only one factor is bein considered# the procedure is known as one%way A<1dA. A level in A<1dA describes the number of cateories within the factor of interest. The simplest type of A<1dA is known as completely randomized one%way A<1dA# which involves an independent random selection of observations for each level of one factor. /ompletely randomized block A<1dA controls for variations from other sources than the factors of interest. This is accomplished by roupin the samples usin a blockin variable. After re+ectin $> usin A<1dA# we can determine which of the sample means are di-erent usin the 4che-g test. Correlation and Simple Re"ression % how two variables relate to one another % determine whether a relationship does indeed e*ist between the variables % describe the nature of this relationship in mathematical terms Independent versus Dependent +ariables 2*ample( investiate the relationship between the number of hours that a student studies for a statistics e*am and the rade for that e*am.
1bviously# we would e*pect the number of hours studyin to a-ect the rade. The 'ours 4tudied variable is considered the independent variable 5*6 because it causes the observed variation in the 2*am erade# which is considered the dependent variable 5y6. The data from the previous table is considered ordered pairs of 5*#y6 values# such as 59#D@6 and 5?#E?6. This causal relationship between independent and dependent variables only e*ists in one direction# as shown here( &ndependent variable 5x6 %[ !ependent variable 5&6 This relationship does not work in reverse. "or instance# we would not e*pect that the e*am rade variable would cause the student to study a certain number of hours in our previous e*ample. 1ther e*amples of independent and dependent variables are shown in the followin table. Correlation /orrelation measures both the strenth and direction of the relationship between x and &. "iure 8>.7 illustrates the di-erent types of correlation in a series of scatter plots# which raphs each ordered pair of 5x,&6. The convention is to place the x variable on the horizontal a*is and the & variable on the vertical a*is. eraph A in "iure 8>.7 shows an e*ample of positive linear correlation where# as x increases# & also tends to increase in a linear 5straiht line6 fashion. eraph , shows a neative linear correlation where# as x increases# & tends to decrease linearly. eraph / indicates no correlation between x and &. This set of variables appears to have no impact on each other. And fnally# eraph ! is an e*ample of a nonlinear relationship between variables. As x increases# & decreases at frst and then chanes direction and increases. Correlation Coe=icient The correlation coe-icient# r# provides us with both the strenth and direction of the relationship between the independent and dependent variables. dalues of r rane between B7.> and J7.>. When r is positive# the relationship between x and & is positive 5eraph A from "iure 8>.76# and when r is neative# the relationship is neative 5eraph ,6. A correlation coe-icient close to > is evidence that there is no relationship between x and & 5eraph /6. The strenth of the relationship between x and & is measured by how close the correlation coe-icient is to J7.> or B7.> and can be viewed in "iure 8>.8. eraph A illustrates a perfect positive correlation between x and & with r H J7.>. eraph , shows a perfect neative correlation between x and & with r H B7.>. eraphs / and ! are e*amples of weaker relationships between the independent and dependent variables. We can calculate the actual correlation coe-icient usin the followin equation( 2*ample( Fsin these values alon with n H @# the number of ordered pairs# we have( Testin" t!e Si"ni<cance of t!e Correlation Coe<cient We can perform a hypothesis test to determine whether the population correlation coe-icient# p, is sinifcantly di-erent from > based on the value of the calculated correlation coe-icient# r. We can state the hypotheses as( '>( pYH> and '7( p[> This statement tests whether a positive correlation e*ists between x and &. & could also choose a two%tail test that would investiate whether any correlation e*ists 5either positive or neative6 by settin $> ( p H > and $7 ( p Y[ >. The test statistic for the correlation coe-icient uses the 4tudent$s t%distribution as follows# where r H the calculated correlation coe-icient from the ordered pairs and n H the number of ordered pairs. "or the e*am rade e*ample# the calculated t%statistic becomes(
The critical t%statistic is based on d.#. H n B 8 if we choose ` H >.>?# tc H 8.798 from Table : in Appendi* , for a one%tail test. ,ecause t [ tc# we re+ect $> and conclude that there is indeed a positive correlation coe-icient between hours of study and the e*am rade. ,sin" Excel to Calculate Correlation Coe=icients %% /100235array7# array86# array7 H the rane of data for the frst variable and array8 H the rane of data for the second variable. Simple Re"ression The technique of simple reression enables us to describe a straiht line that best fts a series of ordered pairs 5*#y6. The equation for a straiht line# known as a linear equation# takes the form( yh H a J b*# where yhH the predicted value of y# iven a value of *# * H the independent variable# a H the y%intercept for the straiht line# b H the slope of the straiht line. The y%intercept is the point where the line crosses the y%a*is# which in this case is a H 8. The slope of the line# b, is shown as the ratio of the rise of the line over the run of the line# shown as b H >.?. A positive slope indicates the line is risin from left to riht. A neative slope# you uessed it# moves lower from left to riht. &f b H ># the line is horizontal# which means there is no relationship between the independent and dependent variables. &n other words# a chane in the value of x has no e-ect on the value of &. "iure 8>.? shows si* ordered pairs and a line that appears to ft the data described by the equation &h H 8J >.?x. "iure 8>.? shows a data point that corresponds to the ordered pair x H 8 and & H :. <otice that the predicted value of & accordin to the line at x H 8 is h& H 9. We can verify this usin the equation as follows( &hH 8 J >.?x H 8J >.? 586H 9. The value of & represents an actual data point# while the value of h& is the predicted value of & usin the linear equation# iven a value for x. 1ur ne*t step is to fnd the linear equation that best fts a set of ordered pairs. T!e 'east Squares )et!od The least squares method is a mathematical procedure to identify the linear equation that best fts a set of ordered pairs by fndin values for a, the y%interceptS and b, the slope. The oal of the least squares method is to minimize the total squared error between the values of & and &h . &f we defne the error as & f &h for each data point# the least squares method will minimize# where where n is the number of ordered pairs around the line that best fts the data( Accordin to "iure 8>.@# the line that best fts the data# the re!ression line, will minimize the total squared error of the four data points. &$ll demonstrate how to determine this reression equation usin the least squares method throuh the followin e*ample. ,ecause my oal is to investiate whether the number of items is increasin over time# ;onth will be the independent variable and <umber of &tems will be the dependent variable. The least squares method fnds the linear equation that best fts the data by determinin the value for a, the y%interceptS and b, the slope# usin the followin equations(
The reression line for the bathroom counter e*ample would be( ,ecause the slope of this equation is a positive >.E=@# & have evidence that the number of items on the counter is increasin over time at an averae rate of nearly one per month. "iure 8>.= shows the reression line with the ordered pairs. ;y prediction for the number of items on the counter in another si* months 5;onth 7@ from my data6 will be( &h H ?.79J >.E=@x H ?.79J >.E=@ 57@6H 8>.= H 87 items Con<dence Interval for t!e Re"ression 'ine Cust how accurate is my estimate for the number of items on the counter for a particular monthA To answer this# we need to determine the standard error of the estimate# se # usin the followin formula( The standard error of the estimate measures the amount of dispersion of the observed data around the reression line. &f the data points are very close to the line# the standard error of the estimate is relatively low and vice versa. "or our bathroom e*ample( We are now ready to calculate a confdence interval for the mean of & around a particular value of x. "or ;onth D 5x H D6 in the data# !ebbie has 77 items 5& H 776 on the counter. The reression line predicted she would have( &h H ?.79J >.E=@x H ?.79J >.E=@ 5D6H 78.E items. where( tc H the critical t%statistic from the 4tudents$ t%distribution se H the standard error of the mean n H the number of ordered pairs 4uppose we would like a E? percent confdence interval around the mean of & for ;onth D. To fnd our critical t%statistic# we look to Table : in Appendi* ,. This procedure has n B 8 H 7> B 8 H D derees of freedom# resultin in tc H 8.9>@ from Table : in Appendi* ,. 1ur confdence interval is then( This interval is shown raphically on "iure 8>.D. 1ur E? percent confdence interval for the number of items on the counter in ; D is between 7>.=: and 7?.>@ items. Testin" t!e Slope of t!e Re"ression 'ine 0ecall that if the slope of the reression line# b, is equal to ># then there is no relationship between x and &. &n our bathroom counter e*ample# we found the slope of the reression line to be >.E=@. 'owever# because this result was based on a sample of observations# we need to test to see whether >.E=@ is far enouh away from > to claim a relationship really does e*ist between the variables. &f a is the slope of the true population# then our hypotheses statement would be( '>H aH> and '7H a Y[>. &f we re+ect the null hypothesis# we conclude that a relationship does e*ist between the independent and dependent variables based on our sample. We$ll test this usin ` H >.>7. This hypothesis test requires the standard error of the slope# sb# which is found with the followin equation( where se is the standard error of the estimate that we calculated earlier. "or our bathroom e*ample( The test statistic for this hypothesis is# where a$> is the value of the population slope accordin to the null hypothesis. . "or this e*ample# our calculated t%statistic is( The critical t%statistic is taken from the 4tudent$s t%distribution with n B 8 H 7> B 8 H D derees of freedom. With a two%tail test and f H >.>7# tcH 9.9?? accordin to Table : in Appendi* ,. ,ecause t [ tc# we re+ect the null hypothesis and conclude there is a relationship between the month and the number of items on the bathroom countertop. T!e Coe=icient of Determination Another way of measurin the strenth of a relationship is with the coe.icient o# determination, r8..This represents the percentae of the variation in & that is e*plained by the reression line. We fnd this value by simply squarin r, the correlation coe-icient. "or the bathroom e*ample# the correlation coe-icient is( The coe-icient of determination becomes( &n other words# @@.9 percent of the variation in the number of items on the counter is e*plained by the ;onth variable. &f r8 H 7# all of the variation in & is e*plained by the variable x. &f r8 H ># none of the variation in & is e*plained by the variable x. 2 simple Re"ression Example %it! 6e"ative Correlation ,oth of these past e*amples have involved a positive relationship between x and &. <ow this e*ample will summarize performin simple reression with a neative relationship. 0ecently# & had the opportunity to bond with my son ,rian as we shopped for his frst car when he turned 7@. ,rian had visions of ;ercedes and ,;Ws dancin in his head# whereas & was thinkin more alon the line of 'ondas and Toyotas. After many discussions on the matter# we compromised on lookin for 7EEE dolkswaen Cettas. 'owever# ,rian had two requirements( f &t had to be black. f &t had to be the new body style. Apparently# somebody at dolkswaen had the brilliant idea back in 7EEE to subtly chane the desin of the Cetta halfway throuh the production year. Personally# & would never have noticed the di-erence. ,rian# on the other hand# wouldn$t be cauht dead drivin the oriinal version# essentially eliminatin half the used 7EEE dolkswaen Cettas on the market. Anyway# what follows is a table showin the mileae of eiht cars with the new body style and their askin price. The remainder of this chapter demonstrates the correlation and reression technique usin this data. The correlation coe-icient can be found usin( The neative correlation indicates that as mileae 5*6 increases# the price 5y6 decreases as we would e*pect. The coe-icient of determination becomes( Appro*imately ?= percent of the variation in price is e*plained by the variation in mileae. The reression line is determined usin( We can describe the reression line by the equation( What would the predicted price be for a car with :?#>>> milesA The reression line would predict that a car with :?#>>> miles would be priced at _79#>:7. What would be the E> percent confdence interval at * H :?#>>>A The standard error of the estimate would be( The critical t%statistic for n B 8 H D B 8 H @ derees of freedom and a E> percent confdence interval is tc H 7.E:9 from Table : in Appendi* ,. 1ur confdence interval is then( The E> percent confdence interval for a car with :?#>>> miles is between _77#?DE and _7:#:E9. &s the relationship between mileae and price statistically sinifcant at the f H >.7> levelA 1ur hypotheses$ statement is( 'o( aH> and '7( aY[>. The standard error of the slope# sb# is found usin( The calculated test statistic for this hypothesis is( The critical t%statistic is taken from the 4tudent$s t%distribution with n B 8 H D B 8 H @ derees of freedom. With a two%tail test and f H >.7> level# tc H 7.E:9 accordin to Table : in Appendi* ,. ,ecause t tc # we re+ect the null hypothesis and conclude there is a relationship between the mileae and price variable. We use the absolute values because the calculated t%statistic is in the left tail of the t%distribution with a two%tail hypothesis test. 2ssumptions for Simple Re"ression "or all these results to be valid# we need to make sure that the underlyin assumptions of simple reression are not violated. These assumptions are as follows( f &ndividual di-erences between the data and the reression line# 5y% yi6# are independent of one another. f The observed values of y are normally distributed around the predicted value# hy. f The variation of y around the reression line is equal for all values of *. Simple +ersus )ultiple Re"ression 4imple reression is limited to e*aminin the relationship between a dependent variable and only one independent variable. &f more than one independent variable is involved in the relationship# then we need to raduate to multiple reression. The reression equation for this method looks like this( f The independent variable 5x6 causes variation in the dependent variable 5 &6. f The correlation coe-icient# r, indicates both the strenth and direction of the relationship between the independent and dependent variables. f The technique of simple reression enables us to describe a straiht line that best fts a series of ordered pairs 5x,&6. f The least squares method is a mathematical procedure to identify the linear equation that best fts a set of ordered pairs by fndin values for a, the y%interceptS and b, the slope. f The standard error of the estimate# se# measures the amount of dispersion of the observed data around the reression line. f The coe-icient of determination# r 8# represents the percentae of the variation in & that is e*plained by the reression line.