Sei sulla pagina 1di 29

c c


c  cc  
c
  c
Researchers are often interested in defining a value that best describes some attribute of the
population. Often this attribute is a measure of central tendency or a proportion.

 cc  
c  c
Several different measures of central tendency are defined below.

uc phe i is the most frequently appearing value in the population or sample. Suppose
we draw a sample of five women and measure their weights. phey weigh 100 pounds,
100 pounds, 130 pounds, 140 pounds, and 150 pounds. Since more women weigh 100
pounds than any other weight, the mode would equal 100 pounds.

uc po find the i , we arrange the observations in order from smallest to largest value.
If there is an odd number of observations, the median is the middle value. If there is an
even number of observations, the median is the average of the two middle values. phus,
in the sample of five women, the median value would be 130 pounds; since 130 pounds is
the middle weight.

uc phe i  of a sample or a population is computed by adding all of the observations and


dividing by the number of observations. Returning to the example of the five women, the
mean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds.

  cc   c


When the focus is on the degree to which a population possesses a particular attribute, the
measure of interest is a percentage or a proportion.

uc £    refers to the fraction of the total that possesses a certain attribute. For
example, we might ask what proportion of women in our sample weigh less than 135
pounds. Since 3 women weigh less than 135 pounds, the proportion would be 3/5 or 0.60.

uc £    is another way of expressing a proportion. £ percentage is equal to the


proportion times 100. In our example of the five women, the percent of the total who
weigh less than 135 pounds would be 100 * (3/5) or 60 percent.

c
Of the various measures, the mean and the proportion are most important. phe notation used to
describe these measures appears below:
uc : Refers to a population mean.

uc x: Refers to a sample mean.

uc u: phe proportion of elements in the population that has a particular attribute.

uc p: phe proportion of elements in the |  that has a particular attribute.

uc D: phe proportion of elements in the population that does not have a specified attribute.

Note that D = 1 - u.

uc q: phe proportion of elements in the |  that does not have a specified attribute. Note

that q = 1 - p.

£ 
c
cc c
c c
c  c ccc cc c c
c
c  cc
 cc cc

ccc
c ccc c c c
cc
c c cc
 
c
ccc c c
 
c c  c c
 c  c
c
c
ccc
c
cc
cc   c c cc
cc c c
 
cc
c
!cc
 
cc cc
 c cc
 c"c
c cc c cc
c cc
 c
#c
c  c c"c
cc
c
cc  c c
cc
c
"ccc
c
c c
cc c
c c  c  c c  c
c

cc cc

 c c
$ c

c
 cc cc c   cc
 
c cc c  c
c


c
c
cc ccc c  c c

c
c
c cc c
 c  ccc%  cc
$ c
c
& $cc c c c
 c c  
c cc c c c
c
c
c c
%c
c c
 
c cc c
c c cc
%c
c c
 c c c
c
' c  c

cc c
cc
c c c cc 
c
(c c
 c
 
c c c

c c  c  c
c
c  c  c ccc c
c)cc*+,c cc
cc   c c cc cc
 c*+,c
cc- c c cc

p 
c
cc c cc
 c cc
 
cc cc c c cc cc
 c c c
c 

c

c c-c
 
c cc c c
c$
cccc

c cc c c c
$ c

c c c  c
c
 c c  c
 cc c  c c- cc
c
c
c cc
c
c cc c cc c
c c c
cc
c c

c
c c c c c& $cc c c- cc
c
c c c c cc% c c
c

c c c
c  cc cc
 c c
 c
$ c

c c c cc c c  c
$ c

c
  c
cc ccc
c c  c

c& $cc c- 
cc
c
 
  
c
 cc
c c c c c 
cc c
c
 cc c c' c- cc
 c c.c/c c0c
c/cc
c
 cc
c c c c c 
cc c
cc c cc c c 
c 
c
c c cc 
c.c/c0c1.c
c2/304.cc55c

 c
Next section: primean

phe mode is the most frequently occurring score in a distribution and is used as a measure of
central tendency. phe advantage of the mode as a measure of central tendency is that its meaning
is obvious. Further, it is the only measure of central tendency that can be used with nominal data.

phe mode is greatly subject to sample fluctuations and is therefore not recommended to be used
as the only measure of central tendency. £ further disadvantage of the mode is that many
distributions have more than one mode. phese distributions are called "multi modal."
ù   c c
ùescriptive statistics are used to describe the basic features of the data in a study. phey provide
simple summaries about the sample and the measures. pogether with simple graphics analysis,
they form the basis of virtually every quantitative analysis of data.

ùescriptive statistics are typically distinguished from inferential statistics. With descriptive
statistics you are simply describing what is or what the data shows. With inferential statistics,
you are trying to reach conclusions that extend beyond the immediate data alone. For instance,
we use inferential statistics to try to infer from the sample data what the population might think.
Or, we use inferential statistics to make judgments of the probability that an observed difference
between groups is a dependable one or one that might have happened by chance in this study.
phus, we use inferential statistics to make inferences from our data to more general conditions;
we use descriptive statistics simply to describe what's going on in our data.

ùescriptive Statistics are used to present quantitative descriptions in a manageable form. In a


research study we may have lots of measures. Or we may measure a large number of people on
any measure. ùescriptive statistics help us to simply large amounts of data in a sensible way.
Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider a
simple number used to summarize how well a batter is performing in baseball, the batting
average. phis single number is simply the number of hits divided by the number of times at bat
(reported to three significant digits). £ batter who is hitting .333 is getting a hit one time in every
three at bats. One batting .250 is hitting one time in four. phe single number describes a large
number of discrete events. Or, consider the scourge of many students, the Grade uoint £verage
(Gu£). phis single number describes the general performance of a student across a potentially
wide range of course experiences.

Every time you try to describe a large set of observations with a single indicator you run the risk
of distorting the original data or losing important detail. phe batting average doesn't tell you
whether the batter is hitting home runs or singles. It doesn't tell whether she's been in a slump or
on a streak. phe Gu£ doesn't tell you whether the student was in difficult courses or easy ones,
or whether they were courses in their major field or in other disciplines. Even given these
limitations, descriptive statistics provide a powerful summary that may enable comparisons
across people or other units.

  c


c
Univariate analysis involves the examination across cases of one variable at a time. phere are
three major characteristics of a single variable that we tend to look at:

uc the distribution
uc the central tendency
uc the dispersion
In most situations, we would describe all three of these characteristics for each of the variables in
our study.

 cù  phe distribution is a summary of the frequency of individual values or ranges
of values for a variable. phe simplest distribution would list every value of a variable and the
number of persons who had each value. For instance, a typical way to describe the distribution of
college students is by year in college, listing the number or percent of students at each of the four
years. Or, we describe gender by listing the number or percent of males and females. In these
cases, the variable has few enough values that we can list each one and summarize how many
sample cases had the value. But what do we do for a variable like income or Gu£ With these
variables there can be a large number of possible values, with relatively few people having each
one. In this case, we group the raw scores into categories according to ranges of values. For
instance, we might look at Gu£ according to the letter grade ranges. Or, we might group income
into four or five ranges of income values.

pable 1. Frequency distribution table.

One of the most common ways to describe a single variable is with a a 
 .
ùepending on the particular variable, all of the data values may be represented, or you may
group the values into categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value. Rather, the value are grouped
into ranges and the frequencies determined.). Frequency distributions can be depicted in two
ways, as a table or as a graph. pable 1 shows an age frequency distribution with five categories
of age ranges defined. phe same frequency distribution can be depicted in a graph as shown in
Figure 2. phis type of graph is often referred to as a histogram or bar chart.
pable 2. Frequency distribution bar chart.

ùistributions may also be displayed using percentages. For example, you could use percentages
to describe the:

uc percentage of people in different income levels


uc percentage of people in different age ranges
uc percentage of people in different ranges of standardized test scores

  
c   phe central tendency of a distribution is an estimate of the "center" of a
distribution of values. phere are three major types of estimates of central tendency:

uc àean
uc àedian
uc àode

phe  or average is probably the most commonly used method of describing central
tendency. po compute the mean all you do is add up all the values and divide by the number of
values. For example, the mean or average quiz score is determined by summing all the scores
and dividing by the number of students taking the exam. For example, consider the test score
values:

c !c c !c"#cc c

phe sum of these 8 values is 167, so the mean is 167/8 = 20.875.

phe  is the score found at the exact middle of the set of values. One way to compute the
median is to list all scores in numerical order, and then locate the score in the center of the
sample. For example, if there are 500 scores in the list, score #250 would be the median. If we
order the 8 scores shown above, we would get:

 ! !  "#
phere are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores
are 20, the median is 20. If the two middle scores had different values, you would have to
interpolate to determine the median.

phe i is the most frequently occurring value in the set of scores. po determine the mode,
you might again order the scores as shown above, and then count each one. phe most frequently
occurring value is the mode. In our example, the value 15 occurs three times and is the model. In
some distributions there is more than one modal value. For instance, in a bimodal distribution
there are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values -- 20.875, 20, and 15 -- for
the mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped), the
mean, median and mode are all equal to each other.

ù  ùispersion refers to the spread of the values around the central tendency. phere are
two common measures of dispersion, the range and the standard deviation. phe  is simply
the highest value minus the lowest value. In our example distribution, the high value is 36 and
the low is 15, so the range is 36 - 15 = 21.

phe  cù  is a more accurate and detailed estimate of dispersion because an
outlier can greatly exaggerate the range (as was true in this example where the single outlier
value of 36 stands apart from the rest of the values. phe Standard ùeviation shows the relation
that set of scores has to the mean of the sample. £gain lets take the set of scores:

 !  !"# 

to compute the standard deviation, we first find the distance between each value and the mean.
We know from above that the mean is 20.875. So, the differences from the mean are:

15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875

Notice that values that are below the mean have negative discrepancies and values above it have
positive ones. Next, we square each discrepancy:

-5.875 * -5.875 = 34.515625


-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625
-5.875 * -5.875 = 34.515625
+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625

Now, we take these "squares" and sum them to get the Sum of Squares (SS) value. Here, the sum
is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the result is 350.875
/ 7 = 50.125. phis value is known as the   . po get the standard deviation, we take the
square root of the variance (remember that we squared the deviations earlier). phis would be
SDRp(50.125) = 7.079901129253.

£lthough this computation may seem convoluted, it's actually quite simple. po see this, consider
the formula for the standard deviation:

In the top part of the ratio, the numerator, we see that each score has the the mean subtracted
from it, the difference is squared, and the squares are summed. In the bottom part, we take the
number of scores minus 1. phe ratio is the variance and the square root is the standard deviation.
In English, we can describe the standard deviation as:

 c$ c cc cicc c$ c c ic ci c cc ci c
c cic

£lthough we can calculate these univariate statistics by hand, it gets quite tedious when you have
more than a few values and variables. Every statistics program is capable of calculating them
easily for you. For instance, I put the eight scores into SuSS and got the following table as a
result:

N 8
àean 20.8750
àedian 20.0000
àode 15.00
Std. ùeviation 7.0799
Variance 50.1250
Range 21.00

which confirms the calculations I did by hand above.

phe standard deviation allows us to reach some conclusions about specific scores in our
distribution. £ssuming that the distribution of scores is normal or bell-shaped (or close to it!), the
following conclusions can be reached:

uc approximately 68% of the scores in the sample fall within one standard deviation of the
mean
uc approximately 95% of the scores in the sample fall within two standard deviations of the
mean
uc approximately 99% of the scores in the sample fall within three standard deviations of the
mean

For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, we
can from the above statement estimate that approximately 95% of the scores will fall in the range
of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. phis kind of
information is a critical stepping stone to enabling us to compare the performance of an
individual on one variable with their performance on another, even when the variables are
measured on entirely different scales.

 cù c


i
c%&i
cc

c  cù c
Õet's say we wanted to calculate the standard deviation for the amounts of gold coins pirates on a
pirate ship have.

phere are 100 pirates on the ship. In statistical terms this means we have a population of 100. If
we know the amount of gold coins each of the 100 pirates have, we use the  c c
$c cc  c

What if we don't know the amount of gold coins each of the 100 pirates have For example, we
only had enough time to ask 5 pirates how many gold coins they have. In statistical terms this
means we have a sample size of 5 and in this case we use the  c c $c cc
i
ccc


phe rest of this example will be done in the case where we have a sample size of 5 pirates,
therefore we will be using the standard deviation equation for a sample of a population.

Here are the amounts of gold coins the 5 pirates have:

4, 2, 5, 8, 6.

Now, let's calculate the standard deviation:

1. Calculate the mean:


2. Calculate for each value in the sample:

3. Calculate :

4. Calculate the standard deviation:

 c c c c cicc


cc c  c cc  'c
cc

à   

phe simplest statistic is the mean or average. Years ago, when laboratories were beginning to
assay controls, it was easy to calculate a mean and use that value as the "target" to be achieved.
For example, given the following ten analyses of a control material - 90, 91, 89, 84, 88, 93, 80,
90, 85, 87 - the mean or bar is 877/10 or 87.7. [phe term   refers to a symbol having a line
or bar over the , , however, we will use the term instead of the symbol in the text of these
lessons because it is easier to present.]
phe mean value characterizes the "central tendency" or "location" of the data. £lthough the mean
is the value most likely to be observed, many of the actual values are different than the mean.
When assaying control materials, it is obvious that technologists will not achieve the mean value
each and every time a control is analyzed. phe values observed will show a dispersion or
distribution about the mean, and this distribution needs to be characterized to set a range of
acceptable control values.

‘    

phe dispersion of values about the mean is predictable and can be characterized mathematically
through a series of manipulations, as illustrated below, where the individual x-values are shown
in column £.


£c 
c 
 c

 c   c  c

67c 67c8c900cc.:7c 2.:74.cc5.6c

61c 61c8c900cc::7c 2::74.cc1796c

96c 96c8c900cc1:7c 21:74.cc1;6c

9/c 9/c8c900cc8:07c 28:074.cc1:;6c

99c 99c8c900cc7:7c 27:74.cc776c

6:c 6:c8c900cc5:7c 25:74.cc.976c

97c 97c8c900cc8007c 280074.cc56.6c

67c 67c8c900cc.:7c 2.:74.cc5.6c

95c 95c8c900cc8.07c 28.074.cc0.6c

90c 90c8c900cc8707c 287074.cc7/6c

cc900c 284cc7c 284<cc1:.17c

uc c
c c    c
c c
c2 4cc  c  
c c  cc c
cc c
c900ccc17c c900c c
c- cc
uc c
 c    c
c c
cc c c cc  c c
c
 c c
  c=c 
cc
 c
cc c8cc
c cc c
 c>
c cc
 cc
  c c
 
c cc
c c c cc
c cc c
 
c
c
 
c# cc
uc cc    c
c c
%cc c
 c c$c cc
c
c
c
 c
 c&  c&cc
uc -cc
%c c
 
cc
cc
uc '  cc  c

 c c
 c c2?@c c
4c cc  c
c
c

= [132.10/(10-1)]1/2 = 3.83

ù 
    

phe "n-1" term in the above expression represents the  


 aa   (df). Õoosely
interpreted, the term "degrees of freedom" indicates how much freedom or independence there is
within a group of numbers. For example, if you were to sum four numbers to get a total, you
have the freedom to select any numbers you like. However, if the sum of the four numbers is
stipulated to be 92, the choice of the first 3 numbers is fairly free (as long as they are low
numbers), but the last choice is restricted by the condition that the sum must equal 92. For
example, if the first three numbers chosen at random are 28, 18, and 36, these numbers add up to
82, which is 10 short of the goal. For the last number there is no freedom of choice. phe number
10 must be selected to make the sum come out to 92. pherefore, the degrees of freedom have
been limited by 1 and only n-1 degrees of freedom remain. In the Sù formula, the degrees of
freedom are n minus 1 because the mean of the data has already been calculated (which imposes
one condition or restriction on the data set).

]  

£nother statistical term that is related to the distribution is the variance, which is the standard
deviation squared (variance = Sù ). phe Sù may be either positive or negative in value because
it is calculated as a square root, which can be either positive or negative. By squaring the Sù, the
problem of signs is eliminated. One common application of the variance is its use in the F-test to
compare the variance of two methods and determine whether there is a statistically significant
difference in the imprecision between the methods.

In many applications, however, the Sù is often preferred because it is expressed in the same
concentration units as the data. Using the Sù, it is possible to predict the range of control values
that should be observed if the method remains stable. £s discussed in an earlier lesson,
laboratorians often use the Sù to impose "gates" on the expected normal distribution of control
values.
ÿ 


  

praditionally,
after the discussion of the mean, standard deviation, degrees of freedom, and variance, the next
step was to describe the normal distribution (a frequency polygon) in terms of the standard
deviation "gates." phe figure here is a representation of the frequency distribution of a large set
of laboratory values obtained by measuring a single control material. phis distribution shows the
shape of a normal curve. Note that a "gate" consisting of ±1Sù accounts for 68% of the
distribution or 68% of the area under the curve, ±2Sù accounts for 95% and ±3Sù accounts for
>99%. £t ±2Sù, 95% of the distribution is inside the "gates," 2.5% of the distribution is in the
lower or left tail, and the same amount (2.5%) is present in the upper tail. Some authors call this
polygon an error curve to illustrate that small errors from the mean occur more frequently than
large ones. Other authors refer to this curve as a probability distribution.

     

£nother way to describe the variation of a test is calculate the coefficient of variation, or CV.
phe CV expresses the variation as a percentage of the mean, and is calculated as follows:

()c*c+ ù  ,!!

In the laboratory, the CV is preferred when the Sù increases in proportion to concentration. For
example, the data from a replication experiment may show an Sù of 4 units at a concentration of
100 units and an Sù of 8 units at a concentration of 200 units. phe CVs are 4.0% at both levels
and the CV is more useful than the Sù for describing method performance at concentrations in
between. However, not all tests will demonstrate imprecision that is constant in terms of CV. For
some tests, the Sù may be constant over the analytical range.

phe CV also provides a general "feeling" about the performance of a method. CVs of 5% or less
generally give us a feeling of good method performance, whereas CVs of 10% and higher sound
bad. However, you should look carefully at the mean value before judging a CV. £t very low
concentrations, the CV may be high and at high concentrations the CV may be low. For example,
a bilirubin test with an Sù of 0.1 mg/dÕ at a mean value of 0.5 mg/dÕ has a CV of 20%, whereas
an Sù of 1.0 mg/dÕ at a concentration of 20 mg/dÕ corresponds to a CV of 5.0%.

£     

phe lessons on Basic DC uractices cover these same terms (see DC - phe data calculations), but
use a different form of the equation for calculating cumulative or lot-to-date means and Sùs.
Guidelines in the literature recommend that cumulative means and Sùs be used in calculating
control limits [2-4], therefore it is important to be able to perform these calculations.

phe cumulative mean can be expressed as bar = ( xi)t /nt, which appears similar to the prior
mean term except for the "t" subscripts, which refer to data from different time periods. phe idea
is to add the xi and n terms from groups of data in order to calculate the mean of the combined
groups.
phe cumulative or lot-to-date standard deviation can be expressed as follows:

phis equation looks quite different from the prior equation in this lesson, but in reality, it is
equivalent. phe cumulative standard deviation formula is derived from an Sù formula called the
Raw Score Formula. Instead of first calculating the mean or bar, the Raw Score º 
calculates bar inside the square root sign.

Oftentimes in reading about statistics, an unfamiliar formula may be presented. You should
realize that the mathematics in statistics is often redundant. Each procedure builds upon the
previous procedure. Formulae that seem to be different are derived from mathematical
manipulations of standard expressions with which you are often already acquainted.

-c c 
cc i
c.  c
%&i
c
In this lesson, we apply regression analysis to some fictitious data, and we show how to interpret
the results of our analysis.

Note: Regression computations are usually handled by a software package or a graphing


calculator. For this example, however, we will do the computations "manually", since the gory
details have educational value.

 
ic  i c
Õast year, five randomly selected students took a math aptitude test before they began their
statistics course. phe Statistics ùepartment has three questions.

uc What linear regression equation best predicts statistics performance, based on math
aptitude scores
uc If a student made an 80 on the aptitude test, what grade would we expect her to make in
statistics
uc How well does the regression equation fit the data

6/cc0c c.  c%$c
In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column
shows statistics grades. phe last two rows show sums and mean scores that we will use to
conduct the regression analysis.

Student xi yi (xi - x) (yi - y) (xi - x)2 (yi - y)2 (xi - x)(yi - y)


1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
i 390 385 730 630 470
 78 77

phe regression equation is a linear equation of the form: ŷ = b0 + b1x . po conduct a regression
analysis, we need to solve for b0 and b1. Computations are shown below.

b1 = Ȉ [ (xi - x)(yi - y) ] / Ȉ [ (xi - x)2] b0 = y - b1 * x


b1 = 470/730 = 0.644 b0 = 77 - (0.644)(78) = 26.768

pherefore, the regression equation is: ŷ = 26.768 + 0.644x .

6/cc c c.  c%$c


Once you have the regression equation, using it is a snap. Choose a value for the independent
variable (½), perform the computation, and you have an estimated value (ŷ) for the dependent
variable.

In our example, the independent variable is the student's score on the aptitude test. phe
dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test,
the estimated statistics grade would be:
ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288

Warning: When you use a regression equation, do not use values for the independent variable
that are outside the range of values used to create the equation. phat is called & 
, and
it can produce unreasonable estimates.

In this example, the aptitude test scores used to create the regression equation ranged from 60 to
95. pherefore, only use values inside that range to estimate statistics grades. Using values outside
that range (less than 60 or greater than 95) is problematic.

6/cc0c c  ccù  ic


Whenever you use a regression equation, you should ask how well the equation fits the data. One
way to assess fit is to check the coefficient of determination, which can be computed from the
following formula.

R2 = { ( 1 / N ) * Ȉ [ (xi - x) * (yi - y) ] / (ıx * ıy ) }2

where N is the number of observations used to fit the model, Ȉ is the summation symbol, xi is the
x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y
value, ıx is the standard deviation of x, and ıy is the standard deviation of y. Computations for
the sample problem of this lesson are shown below.

ıx = sqrt [ Ȉ ( xi - x )2 / N ] ıy = sqrt [ Ȉ ( yi - y )2 / N ]
ıx = sqrt( 730/5 ) = sqrt(146) = 12.083 ıy = sqrt( 630/5 ) = sqrt(126) = 11.225

R2 = { ( 1 / N ) * Ȉ [ (xi - x) * (yi - y) ] / (ıx * ıy ) }2


R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48

£ coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics
grades (the dependent variable) can be explained by the relationship to math aptitude scores (the
independent variable). phis would be considered a good fit to the data, in the sense that it would
substantially improve an educator's ability to predict student performance in statistics class.

.  cù 
£ regression is a statistical analysis assessing the association between two variables. It is used
to find the relationship between two variables.

.  c0 i

Regression Equation(y) = a + bx
Slope(b) = (NȈY - (Ȉ)(ȈY)) / (NȈ2 - (Ȉ)2)
Intercept(a) = (ȈY - b(Ȉ)) / N

where
x and y are the variables.
b = phe slope of the regression line
a = phe intercept point of the regression line and the y axis.
N = Number of values or elements
 = First Score
Y = Second Score
ȈY = Sum of the product of first and Second Scores
Ȉ = Sum of First Scores
ȈY = Sum of Second Scores
Ȉ2 = Sum of square First Scores

.  c%&i
: po find the Simple/Õinear Regression of

 Values Y Values
60 3.1
61 3.6
62 3.8
63 4
65 4.1

po find regression equation, we will first find slope, intercept and use it to form regression
equation..

Step 1: Count the number of values.


N=5

Step 2: Find Y, 2


See the below table

 Value Y Value *Y *


60 3.1 60 * 3.1 = 186 60 * 60 = 3600
61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721
62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844
63 4 63 * 4 = 252 63 * 63 = 3969
65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225

Step 3: Find Ȉ, ȈY, ȈY, Ȉ2.


Ȉ = 311
ȈY = 18.6
ȈY = 1159.7
Ȉ2 = 19359

Step 4: Substitute in the above slope formula given.


Slope(b) = (NȈY - (Ȉ)(ȈY)) / (NȈ2 - (Ȉ)2)
= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)
= (5798.5 - 5784.6)/(96795 - 96721)
= 13.9/74
= 0.19

Step 5: Now, again substitute in the above intercept formula given.


Intercept(a) = (ȈY - b(Ȉ)) / N
= (18.6 - 0.19(311))/5
= (18.6 - 59.09)/5
= -40.49/5
= -8.098

Step 6: phen substitute these values in regression equation formula


Regression Equation(y) = a + bx
= -8.098 + 0.19x.

Suppose if we want to know the approximate y value for the variable x = 64. phen we can
substitute the value in the above equation.

Regression Equation(y) = a + bx
= -8.098 + 0.19(64).
= -8.098 + 12.16
= 4.06

phis example will guide you to find the relationship between two variables by calculating the
Regression from the above steps.


c
phe correlation is one of the most common and most useful statistics. £ correlation is a single
number that describes the degree of relationship between two variables. Õet's work through an
example to show you how this statistic is computed.


c%&i
c
Õet's assume that we want to look at the relationship between two variables, height (in inches)
and self esteem. uerhaps we have a hypothesis that how tall you are effects your self esteem
(incidentally, I don't think we have to worry about the direction of causality here -- it's not likely
that self esteem causes your height!). Õet's say we collect some information on twenty
individuals (all male -- we know that the average height differs for males and females so, to keep
this example simple we'll just use males). Height is measured in inches. Self esteem is measured
based on the average of 10 1-to-5 rating items (where higher scores mean higher self esteem).
Here's the data for the 20 cases (don't take this too seriously -- I made this data up to illustrate
what a correlation is):

u 
c 6c c

1c ;9c /1c

.c 01c /;c

:c ;.c :9c

/c 05c //c

5c 59c :.c

;c ;7c :1c

0c ;0c :9c

9c ;9c /1c

6c 01c /:c

17c ;6c :0c

11c ;9c :5c

1.c ;0c :.c

1:c ;:c :0c

1/c ;.c ::c

15c ;7c :/c

1;c ;:c /7c

10c ;5c /1c

19c ;0c :9c

16c ;:c :/c


.7c ;1c :;c

Now, let's take a quick look at the histogram for each variable:

£nd, here are the descriptive statistics:

]  c 
c c ] 
c c 
c  c
c

Ac ;5/c //750/c 16/175c 1:79c 59c 05c 10c

? c :055c 7/.;767c 719155:c 051c :1c /;c 15c


*
c

Finally, we'll look at the simple bivariate (i.e., two-variable) plot:

You should immediately see in the bivariate plot that the relationship between the variables is a
positive one (if you can't see that, review the section on types of relationships) because if you
were to fit a single straight line through the dots it would have a positive slope or move up from
left to right. Since the correlation is nothing more than a quantitative estimate of the relationship,
we would expect a positive correlation.

What does a "positive relationship" mean in this context It means that, in general, higher scores
on one variable tend to be paired with higher scores on the other and that lower scores on one
variable tend to be paired with lower scores on the other. You should confirm visually that this is
generally true in the plot above.



c c
c
Now we're ready to compute the correlation value. phe formula for the correlation is:
We use the symbol to stand for the correlation. phrough the magic of mathematics it turns out
that r will always be between -1.0 and +1.0. if the correlation is negative, we have a negative
relationship; if it's positive, the relationship is positive. You don't need to know how we came up
with this formula unless you want to be a statistician. But you probably will need to know how
the formula relates to real data -- how you can use the formula to compute the correlation. Õet's
look at the data we need for the formula. Here's the original data with the other necessary
columns:

u 
c 6c !c "!c "c !"!c

1c ;9c /1c .099c /;./c 1;91c

.c 01c /;c :.;;c 57/1c .11;c

:c ;.c :9c .:5;c :9//c 1///c

/c 05c //c ::7c 5;.5c 16:;c

5c 59c :.c 195;c ::;/c 17./c

;c ;7c :1c 19;c :;77c 6;1c

0c ;0c :9c .5/;c //96c 1///c

9c ;9c /1c .099c /;./c 1;91c

6c 01c /:c :75:c 57/1c 19/6c

17c ;6c :0c .55:c /0;1c 1:;6c


11c ;9c :5c .:9c /;./c 1..5c

1.c ;0c :.c .1//c //96c 17./c

1:c ;:c :0c .::1c :6;6c 1:;6c

1/c ;.c ::c .7/;c :9//c 1796c

15c ;7c :/c .7/c :;77c 115;c

1;c ;:c /c .5.c :6;6c 1;c

10c ;5c /1c .;;5c /..5c 1;91c

19c ;0c :9c .5/;c //96c 1///c

16c ;:c :/c .1/.c :6;6c 115;c

.7c ;1c :;c .16;c :0.1c 1.6;c

#c $%&'c ()*$c +,%(*-c '),$c ')*+)c

phe first three columns are the same as in the table above. phe next three columns are simple
computations based on the height and self esteem data. phe bottom row consists of the sum of
each column. phis is all the information we need to compute the correlation. Here are the values
from the bottom row of the table (where N is 20 people) as they are related to the symbols in the
formula:

Now, when we plug these values into the formula given above, we get the following (I show it
here tediously, one step at a time):
So, the correlation for our twenty cases is .73, which is a fairly strong positive relationship. I
guess there is a relationship between height and self esteem, at least in this made up data!

 c c  ccc


c
Once you've computed a correlation, you can determine the probability that the observed
correlation occurred by chance. phat is, you can conduct a significance test. àost often you are
interested in determining the probability that the correlation is a real one and not a chance
occurrence. In this case, you are testing the mutually exclusive hypotheses:

 cA 

cc cc7c

>  cA 



cc cBCc7c

phe easiest way to test this hypothesis is to find a statistics book that has a table of critical values
of r. àost introductory statistics texts would have a table like this. £s in all hypothesis testing,
you need to first determine the significance level. Here, I'll use the common significance level of
alpha = .05. phis means that I am conducting a test where the odds that the correlation is a
chance occurrence is no more than 5 out of 100. Before I look up the critical value in a table I
also have to compute the degrees of freedom or df. phe df is simply equal to N-2 or, in this
example, is 20-2 = 18. Finally, I have to decide whether I am doing a one-tailed or two-tailed
test. In this example, since I have no strong prior theory to suggest whether the relationship
between height and self esteem would be positive or negative, I'll opt for the two-tailed test.
With these three pieces of information -- the significance level (alpha = .05)), degrees of freedom
(df = 18), and type of test (two-tailed) -- I can now test the significance of the correlation I
found. When I look up this value in the handy little table at the back of my statistics book I find
that the critical value is .4438. phis means that if my correlation is greater than .4438 or less than
-.4438 (remember, this is a two-tailed test) I can conclude that the odds are less than 5 out of 100
that this is a chance occurrence. Since my correlation 0f .73 is actually quite a bit higher, I
conclude that it is not a chance finding and that the correlation is "statistically significant" (given
the parameters of the test). I can reject the null hypothesis and accept the alternative.

 c
c  &c
£ll I've shown you so far is how to compute a correlation between two variables. In most studies
we have considerably more than two variables. Õet's say we have a study with 10 interval-level
variables and we want to estimate the relationships among all of them (i.e., between all possible
pairs of variables). In this instance, we have 45 unique correlations to estimate (more later on
how I knew that!). We could do the above computations 45 times to obtain the correlations. Or
we could use just about any statistics program to automatically compute all 45 with a simple
click of the mouse.

I used a simple statistics program to generate random data for 10 variables with 20 cases (i.e.,
persons) for each variable. phen, I told the program to compute the correlations among these
variables. Here's the result:

cccccccccc ccccccc ccccccc ccccccc ccccccc ccccccc ccccccc ccccccc ccccccc


cccccc 
c
ccccc

c
ccccc
cccc

c
cccc
ccc
 cccc

c
ccccc

ccc
cccc

cccc

c
cccc
 ccc
cccc
ccc

cccc

c
cccc

cccc

ccc
ccc
ccc

cccc

c
ccccc
ccc
cccc
cccc

cccc
 cccc

cccc

c
ccccc
 cccc
ccc

ccc
ccc
cccc
ccc

cccc

c
ccccc
cccc
cccc

cccc

ccc

cccc

cccc
ccc

cccc


c

cccc
 cccc
cccc
ccc
ccc

ccc
 cccc
cccc

cccc

cccc

phis type of table is called a Y    ½. It lists the variable names (C1-C10) down the
first column and across the first row. phe diagonal of a correlation matrix (i.e., the numbers that
go from the upper left corner to the lower right) always consists of ones. phat's because these are
the correlations between each variable and itself (and a variable is always perfectly correlated
with itself). phis statistical program only shows the lower triangle of the correlation matrix. In
every correlation matrix there are two triangles that are the values below and to the left of the
diagonal (lower triangle) and above and to the right of the diagonal (upper triangle). phere is no
reason to print both triangles because the two triangles of a correlation matrix are always mirror
images of each other (the correlation of variable x with variable y is always equal to the
correlation of variable y with variable x). When a matrix has this mirror-image quality above and
below the diagonal we refer to it as a |Y ½. £ correlation matrix is always a
symmetric matrix.

po locate the correlation for any pair of variables, find the value in the table for the row and
column intersection for those two variables. For instance, to find the correlation between
variables C5 and C2, I look for where row C2 and column C5 is (in this case it's blank because it
falls in the upper triangle area) and where row C5 and column C2 is and, in the second case, I
find that the correlation is -.166.

OK, so how did I know that there are 45 unique correlations when we have 10 variables phere's
a handy simple little formula that tells how many pairs (e.g., correlations) there are for any
number of variables:

where N is the number of variables. In the example, I had 10 variables, so I know I have (10 *
9)/2 = 90/2 = 45 pairs.

! c
c
phe specific type of correlation I've illustrated here is known as the uearson uroduct àoment
Correlation. It is appropriate when both variables are measured at an interval level. However
there are a wide variety of other types of correlations for other circumstances. for instance, if you
have two ordinal variables, you could use the Spearman rank Order Correlation (rho) or the
Kendall rank order Correlation (tau). When one measure is a continuous interval level one and
the other is dichotomous (i.e., two-category) you can use the uoint-Biserial Correlation. For
other situations, consulting the web-based statistics selection program, ‘ ‘ 

at
http://trochim.human.cornell.edu/selstat/ssstart.htm.

c
?*& !D8>c
 c%

cc9c$
cc
>
c c'!E*cF

c
c
1cccc 
c c 

 c cGc
c
.c&  cc c cc c c c
 
c cc 

c
c2 c$
4c/.c5;c/6c57c/6c5:c5.c/9c/0c5/c
c
:c&  c c$
c cc  cc
:5c/;c.6c50c;7c0:c.7c56c57c07c
c
/c&  c -c ccH I
c c
.771c.77.c
&  cJcF cJcF c
2 cK
4c2 cK
4cc
c
>c17c.7c1.c1;c
=c15c17c19c1.cc
&c1.c15c15c6c
@c.5c5c.7c11c
*c:7c:c.9c5cc
'c.7c17c15c9c
c
5c&  c% c c cc  cc
.5c15c:7c/5c/7c.7c57c
c
;c' cc  cc  cH cJ
I
c  c c
$ 

c
/7c:;c/.c5:c.7c;5c;0c.7c:;c
c
0c&  cH cJ
I
c  c c   c
cc.5c17c.7c:.c57c.9c.;c1/c1.c1:c
Lcc15c1.c16c.;c.:c..c1:c17c15c5c
c
9c'c c  c c
cc c
Lcc1695c9;c90c99c96c67c61c6.c
!  c/5c/6c5/c;7c97c6;c177c1.7c
c
?*& !D8=c
 c%

cc.7c$
cc
>
c c AK**cF

c
c
6c' c cc 
cc c c ccc169.c
c
Lcc16;5c07c05c97c95c67c
 c c 
cc/0c56c;:c09c6.c11.c
c
17c>c  c
c

c0c
c cc   c c   c24c;c
c24c0c
c c24c5c c c

c
c
11c? cc  c
 c   c
' c cc
>c=c&c>  cc
!c57c:7c..7c1c
!!c67c/5c107c:c
!!!c.57c.77c57c/c
K% c/c.c.c
c
1.c&  c c c c c c ccc c c
/.c50c115c1.0c:6c55c
c
1:c?cc 
c 
c c
  cc

 c% Gc