00 mi piace00 non mi piace

113 visualizzazioni29 pagineMar 21, 2011

© Attribution Non-Commercial (BY-NC)

DOCX, PDF, TXT o leggi online da Scribd

Attribution Non-Commercial (BY-NC)

113 visualizzazioni

00 mi piace00 non mi piace

Attribution Non-Commercial (BY-NC)

Sei sulla pagina 1di 29

c
cc

c

c

Researchers are often interested in defining a value that best describes some attribute of the

population. Often this attribute is a measure of central tendency or a proportion.

cc

c
c

Several different measures of central tendency are defined below.

uc phe i
is the most frequently appearing value in the population or sample. Suppose

we draw a sample of five women and measure their weights. phey weigh 100 pounds,

100 pounds, 130 pounds, 140 pounds, and 150 pounds. Since more women weigh 100

pounds than any other weight, the mode would equal 100 pounds.

uc po find the i
, we arrange the observations in order from smallest to largest value.

If there is an odd number of observations, the median is the middle value. If there is an

even number of observations, the median is the average of the two middle values. phus,

in the sample of five women, the median value would be 130 pounds; since 130 pounds is

the middle weight.

dividing by the number of observations. Returning to the example of the five women, the

mean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds.

When the focus is on the degree to which a population possesses a particular attribute, the

measure of interest is a percentage or a proportion.

uc £ refers to the fraction of the total that possesses a certain attribute. For

example, we might ask what proportion of women in our sample weigh less than 135

pounds. Since 3 women weigh less than 135 pounds, the proportion would be 3/5 or 0.60.

proportion times 100. In our example of the five women, the percent of the total who

weigh less than 135 pounds would be 100 * (3/5) or 60 percent.

c

Of the various measures, the mean and the proportion are most important. phe notation used to

describe these measures appears below:

uc : Refers to a population mean.

uc D: phe proportion of elements in the population that does not have a specified attribute.

Note that D = 1 - u.

uc q: phe proportion of elements in the | that does not have a specified attribute. Note

that q = 1 - p.

£

c

cc c

cc

c
c
ccc ccc c

c

ccc

cc cc

ccc

cccc c c c

cc

cc
cc

c

ccc cc

c c
c c

c c

c

c

ccc

c

cc

cc
c c cc

cc cc

cc

c

!cc

cccc

c cc

c"c

ccc c cc

ccc

c

#c

c
cc"c

cc

c

cc
cc

cc

c

"ccc

c

c c

ccc

cc
c cc
c

c

cc cc

c c

$c

c

cc cc
c
cc

c cc
c c

c

c

c

cc cccc cc

c

c

c ccc

c ccc%
cc

$c

c

&
$ccc c c

cc

ccc c c c

c

c

cc

%c

cc

ccc c

c
c cc

%c

cc

cc c

c

'c
c

cc c

cc

c c ccc

c

(cc

c

cc
c

cc
c c

c

c
c cccc c

c)cc*+,ccc

cc
c ccc
cc

c*+,c

cc- c
ccc

p

c

cc
ccc

c
cc

cccc c c
cc
cc

c c c

c

c

cc-c

c cc c c

c$

cccc

c cc cc
c

$c

c c c c

c

cc c

cc c cc-
cc

c

c

ccc

c

ccc ccc c

c
c c

cc

cc

c

cc cc c&
$ccc c-
cc

c

c c c ccc%
c c

c

c c c

c
cc cc

c c

c

$c

c c
c cc c c
c

$c

c

c

cc ccc

c c
c

c&
$ccc-

cc

c

c

cc

c cc cc

cc c

c

cc
c c'c-
cc

cc.c/c c0c

c/cc

c

cc

c c c cc

cc c

cc cccc
c

c

c

c ccc

c.c/c0c1.c

c2/304.cc55c

c

Next section: primean

phe mode is the most frequently occurring score in a distribution and is used as a measure of

central tendency. phe advantage of the mode as a measure of central tendency is that its meaning

is obvious. Further, it is the only measure of central tendency that can be used with nominal data.

phe mode is greatly subject to sample fluctuations and is therefore not recommended to be used

as the only measure of central tendency. £ further disadvantage of the mode is that many

distributions have more than one mode. phese distributions are called "multi modal."

ù
c c

ùescriptive statistics are used to describe the basic features of the data in a study. phey provide

simple summaries about the sample and the measures. pogether with simple graphics analysis,

they form the basis of virtually every quantitative analysis of data.

ùescriptive statistics are typically distinguished from inferential statistics. With descriptive

statistics you are simply describing what is or what the data shows. With inferential statistics,

you are trying to reach conclusions that extend beyond the immediate data alone. For instance,

we use inferential statistics to try to infer from the sample data what the population might think.

Or, we use inferential statistics to make judgments of the probability that an observed difference

between groups is a dependable one or one that might have happened by chance in this study.

phus, we use inferential statistics to make inferences from our data to more general conditions;

we use descriptive statistics simply to describe what's going on in our data.

research study we may have lots of measures. Or we may measure a large number of people on

any measure. ùescriptive statistics help us to simply large amounts of data in a sensible way.

Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider a

simple number used to summarize how well a batter is performing in baseball, the batting

average. phis single number is simply the number of hits divided by the number of times at bat

(reported to three significant digits). £ batter who is hitting .333 is getting a hit one time in every

three at bats. One batting .250 is hitting one time in four. phe single number describes a large

number of discrete events. Or, consider the scourge of many students, the Grade uoint £verage

(Gu£). phis single number describes the general performance of a student across a potentially

wide range of course experiences.

Every time you try to describe a large set of observations with a single indicator you run the risk

of distorting the original data or losing important detail. phe batting average doesn't tell you

whether the batter is hitting home runs or singles. It doesn't tell whether she's been in a slump or

on a streak. phe Gu£ doesn't tell you whether the student was in difficult courses or easy ones,

or whether they were courses in their major field or in other disciplines. Even given these

limitations, descriptive statistics provide a powerful summary that may enable comparisons

across people or other units.

c

Univariate analysis involves the examination across cases of one variable at a time. phere are

three major characteristics of a single variable that we tend to look at:

uc the distribution

uc the central tendency

uc the dispersion

In most situations, we would describe all three of these characteristics for each of the variables in

our study.

cù phe distribution is a summary of the frequency of individual values or ranges

of values for a variable. phe simplest distribution would list every value of a variable and the

number of persons who had each value. For instance, a typical way to describe the distribution of

college students is by year in college, listing the number or percent of students at each of the four

years. Or, we describe gender by listing the number or percent of males and females. In these

cases, the variable has few enough values that we can list each one and summarize how many

sample cases had the value. But what do we do for a variable like income or Gu£ With these

variables there can be a large number of possible values, with relatively few people having each

one. In this case, we group the raw scores into categories according to ranges of values. For

instance, we might look at Gu£ according to the letter grade ranges. Or, we might group income

into four or five ranges of income values.

One of the most common ways to describe a single variable is with a a

.

ùepending on the particular variable, all of the data values may be represented, or you may

group the values into categories first (e.g., with age, price, or temperature variables, it would

usually not be sensible to determine the frequencies for each value. Rather, the value are grouped

into ranges and the frequencies determined.). Frequency distributions can be depicted in two

ways, as a table or as a graph. pable 1 shows an age frequency distribution with five categories

of age ranges defined. phe same frequency distribution can be depicted in a graph as shown in

Figure 2. phis type of graph is often referred to as a histogram or bar chart.

pable 2. Frequency distribution bar chart.

ùistributions may also be displayed using percentages. For example, you could use percentages

to describe the:

uc percentage of people in different age ranges

uc percentage of people in different ranges of standardized test scores

c
phe central tendency of a distribution is an estimate of the "center" of a

distribution of values. phere are three major types of estimates of central tendency:

uc àean

uc àedian

uc àode

phe
or average is probably the most commonly used method of describing central

tendency. po compute the mean all you do is add up all the values and divide by the number of

values. For example, the mean or average quiz score is determined by summing all the scores

and dividing by the number of students taking the exam. For example, consider the test score

values:

phe
is the score found at the exact middle of the set of values. One way to compute the

median is to list all scores in numerical order, and then locate the score in the center of the

sample. For example, if there are 500 scores in the list, score #250 would be the median. If we

order the 8 scores shown above, we would get:

! ! "#

phere are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores

are 20, the median is 20. If the two middle scores had different values, you would have to

interpolate to determine the median.

phe i
is the most frequently occurring value in the set of scores. po determine the mode,

you might again order the scores as shown above, and then count each one. phe most frequently

occurring value is the mode. In our example, the value 15 occurs three times and is the model. In

some distributions there is more than one modal value. For instance, in a bimodal distribution

there are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values -- 20.875, 20, and 15 -- for

the mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped), the

mean, median and mode are all equal to each other.

ù
ùispersion refers to the spread of the values around the central tendency. phere are

two common measures of dispersion, the range and the standard deviation. phe
is simply

the highest value minus the lowest value. In our example distribution, the high value is 36 and

the low is 15, so the range is 36 - 15 = 21.

phe cù
is a more accurate and detailed estimate of dispersion because an

outlier can greatly exaggerate the range (as was true in this example where the single outlier

value of 36 stands apart from the rest of the values. phe Standard ùeviation shows the relation

that set of scores has to the mean of the sample. £gain lets take the set of scores:

to compute the standard deviation, we first find the distance between each value and the mean.

We know from above that the mean is 20.875. So, the differences from the mean are:

15 - 20.875 = -5.875

20 - 20.875 = -0.875

21 - 20.875 = +0.125

20 - 20.875 = -0.875

36 - 20.875 = 15.125

15 - 20.875 = -5.875

25 - 20.875 = +4.125

15 - 20.875 = -5.875

Notice that values that are below the mean have negative discrepancies and values above it have

positive ones. Next, we square each discrepancy:

-0.875 * -0.875 = 0.765625

+0.125 * +0.125 = 0.015625

-0.875 * -0.875 = 0.765625

15.125 * 15.125 = 228.765625

-5.875 * -5.875 = 34.515625

+4.125 * +4.125 = 17.015625

-5.875 * -5.875 = 34.515625

Now, we take these "squares" and sum them to get the Sum of Squares (SS) value. Here, the sum

is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the result is 350.875

/ 7 = 50.125. phis value is known as the
. po get the standard deviation, we take the

square root of the variance (remember that we squared the deviations earlier). phis would be

SDRp(50.125) = 7.079901129253.

£lthough this computation may seem convoluted, it's actually quite simple. po see this, consider

the formula for the standard deviation:

In the top part of the ratio, the numerator, we see that each score has the the mean subtracted

from it, the difference is squared, and the squares are summed. In the bottom part, we take the

number of scores minus 1. phe ratio is the variance and the square root is the standard deviation.

In English, we can describe the standard deviation as:

c$
c cc
cicc
c$
c
c ic
ci
c
cc
ci
c

c
cic

£lthough we can calculate these univariate statistics by hand, it gets quite tedious when you have

more than a few values and variables. Every statistics program is capable of calculating them

easily for you. For instance, I put the eight scores into SuSS and got the following table as a

result:

N 8

àean 20.8750

àedian 20.0000

àode 15.00

Std. ùeviation 7.0799

Variance 50.1250

Range 21.00

phe standard deviation allows us to reach some conclusions about specific scores in our

distribution. £ssuming that the distribution of scores is normal or bell-shaped (or close to it!), the

following conclusions can be reached:

uc approximately 68% of the scores in the sample fall within one standard deviation of the

mean

uc approximately 95% of the scores in the sample fall within two standard deviations of the

mean

uc approximately 99% of the scores in the sample fall within three standard deviations of the

mean

For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, we

can from the above statement estimate that approximately 95% of the scores will fall in the range

of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. phis kind of

information is a critical stepping stone to enabling us to compare the performance of an

individual on one variable with their performance on another, even when the variables are

measured on entirely different scales.

i

c%&i

cc

c cù c

Õet's say we wanted to calculate the standard deviation for the amounts of gold coins pirates on a

pirate ship have.

phere are 100 pirates on the ship. In statistical terms this means we have a population of 100. If

we know the amount of gold coins each of the 100 pirates have, we use the c
c

$c cc
c

What if we don't know the amount of gold coins each of the 100 pirates have For example, we

only had enough time to ask 5 pirates how many gold coins they have. In statistical terms this

means we have a sample size of 5 and in this case we use the c
c
$c cc

i

ccc

phe rest of this example will be done in the case where we have a sample size of 5 pirates,

therefore we will be using the standard deviation equation for a sample of a population.

4, 2, 5, 8, 6.

2. Calculate for each value in the sample:

3. Calculate :

cc c c cc 'c

cc

à

phe simplest statistic is the mean or average. Years ago, when laboratories were beginning to

assay controls, it was easy to calculate a mean and use that value as the "target" to be achieved.

For example, given the following ten analyses of a control material - 90, 91, 89, 84, 88, 93, 80,

90, 85, 87 - the mean or bar is 877/10 or 87.7. [phe term refers to a symbol having a line

or bar over the , , however, we will use the term instead of the symbol in the text of these

lessons because it is easier to present.]

phe mean value characterizes the "central tendency" or "location" of the data. £lthough the mean

is the value most likely to be observed, many of the actual values are different than the mean.

When assaying control materials, it is obvious that technologists will not achieve the mean value

each and every time a control is analyzed. phe values observed will show a dispersion or

distribution about the mean, and this distribution needs to be characterized to set a range of

acceptable control values.

phe dispersion of values about the mean is predictable and can be characterized mathematically

through a series of manipulations, as illustrated below, where the individual x-values are shown

in column £.

£c

c

c

uc c

c
c
c

cc

c2 4cc
c

c c
cc c

ccc

c900ccc17cc900c c

c-
cc

uc c

c
c

cc

cc c
ccc
c
c

c

c c

c=c

cc

c

cc
c8cc

c
cc c

c>

c cc

cc

c c

c cc

cc c cc

ccc c

c

c

c#cc

uc cc
c

cc

%cc c

cc$c
cc

c

c

c

c

c&
c&cc

uc -cc

%c c

cc

cc

uc '
cc
c

cc

c c2?@cc

4c cc
c

c

c

= [132.10/(10-1)]1/2 = 3.83

ù

aa (df). Õoosely

interpreted, the term "degrees of freedom" indicates how much freedom or independence there is

within a group of numbers. For example, if you were to sum four numbers to get a total, you

have the freedom to select any numbers you like. However, if the sum of the four numbers is

stipulated to be 92, the choice of the first 3 numbers is fairly free (as long as they are low

numbers), but the last choice is restricted by the condition that the sum must equal 92. For

example, if the first three numbers chosen at random are 28, 18, and 36, these numbers add up to

82, which is 10 short of the goal. For the last number there is no freedom of choice. phe number

10 must be selected to make the sum come out to 92. pherefore, the degrees of freedom have

been limited by 1 and only n-1 degrees of freedom remain. In the Sù formula, the degrees of

freedom are n minus 1 because the mean of the data has already been calculated (which imposes

one condition or restriction on the data set).

]

£nother statistical term that is related to the distribution is the variance, which is the standard

deviation squared (variance = Sù ). phe Sù may be either positive or negative in value because

it is calculated as a square root, which can be either positive or negative. By squaring the Sù, the

problem of signs is eliminated. One common application of the variance is its use in the F-test to

compare the variance of two methods and determine whether there is a statistically significant

difference in the imprecision between the methods.

In many applications, however, the Sù is often preferred because it is expressed in the same

concentration units as the data. Using the Sù, it is possible to predict the range of control values

that should be observed if the method remains stable. £s discussed in an earlier lesson,

laboratorians often use the Sù to impose "gates" on the expected normal distribution of control

values.

ÿ

praditionally,

after the discussion of the mean, standard deviation, degrees of freedom, and variance, the next

step was to describe the normal distribution (a frequency polygon) in terms of the standard

deviation "gates." phe figure here is a representation of the frequency distribution of a large set

of laboratory values obtained by measuring a single control material. phis distribution shows the

shape of a normal curve. Note that a "gate" consisting of ±1Sù accounts for 68% of the

distribution or 68% of the area under the curve, ±2Sù accounts for 95% and ±3Sù accounts for

>99%. £t ±2Sù, 95% of the distribution is inside the "gates," 2.5% of the distribution is in the

lower or left tail, and the same amount (2.5%) is present in the upper tail. Some authors call this

polygon an error curve to illustrate that small errors from the mean occur more frequently than

large ones. Other authors refer to this curve as a probability distribution.

£nother way to describe the variation of a test is calculate the coefficient of variation, or CV.

phe CV expresses the variation as a percentage of the mean, and is calculated as follows:

In the laboratory, the CV is preferred when the Sù increases in proportion to concentration. For

example, the data from a replication experiment may show an Sù of 4 units at a concentration of

100 units and an Sù of 8 units at a concentration of 200 units. phe CVs are 4.0% at both levels

and the CV is more useful than the Sù for describing method performance at concentrations in

between. However, not all tests will demonstrate imprecision that is constant in terms of CV. For

some tests, the Sù may be constant over the analytical range.

phe CV also provides a general "feeling" about the performance of a method. CVs of 5% or less

generally give us a feeling of good method performance, whereas CVs of 10% and higher sound

bad. However, you should look carefully at the mean value before judging a CV. £t very low

concentrations, the CV may be high and at high concentrations the CV may be low. For example,

a bilirubin test with an Sù of 0.1 mg/dÕ at a mean value of 0.5 mg/dÕ has a CV of 20%, whereas

an Sù of 1.0 mg/dÕ at a concentration of 20 mg/dÕ corresponds to a CV of 5.0%.

£

phe lessons on Basic DC uractices cover these same terms (see DC - phe data calculations), but

use a different form of the equation for calculating cumulative or lot-to-date means and Sùs.

Guidelines in the literature recommend that cumulative means and Sùs be used in calculating

control limits [2-4], therefore it is important to be able to perform these calculations.

phe cumulative mean can be expressed as bar = ( xi)t /nt, which appears similar to the prior

mean term except for the "t" subscripts, which refer to data from different time periods. phe idea

is to add the xi and n terms from groups of data in order to calculate the mean of the combined

groups.

phe cumulative or lot-to-date standard deviation can be expressed as follows:

phis equation looks quite different from the prior equation in this lesson, but in reality, it is

equivalent. phe cumulative standard deviation formula is derived from an Sù formula called the

Raw Score Formula. Instead of first calculating the mean or bar, the Raw Score º

calculates bar inside the square root sign.

Oftentimes in reading about statistics, an unfamiliar formula may be presented. You should

realize that the mathematics in statistics is often redundant. Each procedure builds upon the

previous procedure. Formulae that seem to be different are derived from mathematical

manipulations of standard expressions with which you are often already acquainted.

-c c

cc i

c.
c

%&i

c

In this lesson, we apply regression analysis to some fictitious data, and we show how to interpret

the results of our analysis.

calculator. For this example, however, we will do the computations "manually", since the gory

details have educational value.

ic
i
c

Õast year, five randomly selected students took a math aptitude test before they began their

statistics course. phe Statistics ùepartment has three questions.

uc What linear regression equation best predicts statistics performance, based on math

aptitude scores

uc If a student made an 80 on the aptitude test, what grade would we expect her to make in

statistics

uc How well does the regression equation fit the data

6/cc0c
c.
c%$c

In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column

shows statistics grades. phe last two rows show sums and mean scores that we will use to

conduct the regression analysis.

1 95 85 17 8 289 64 136

2 85 95 7 18 49 324 126

3 80 70 2 -7 4 49 -14

4 70 65 -8 -12 64 144 96

5 60 70 -18 -7 324 49 126

i 390 385 730 630 470

78 77

phe regression equation is a linear equation of the form: ŷ = b0 + b1x . po conduct a regression

analysis, we need to solve for b0 and b1. Computations are shown below.

b1 = 470/730 = 0.644 b0 = 77 - (0.644)(78) = 26.768

Once you have the regression equation, using it is a snap. Choose a value for the independent

variable (½), perform the computation, and you have an estimated value (ŷ) for the dependent

variable.

In our example, the independent variable is the student's score on the aptitude test. phe

dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test,

the estimated statistics grade would be:

ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288

Warning: When you use a regression equation, do not use values for the independent variable

that are outside the range of values used to create the equation. phat is called
&

, and

it can produce unreasonable estimates.

In this example, the aptitude test scores used to create the regression equation ranged from 60 to

95. pherefore, only use values inside that range to estimate statistics grades. Using values outside

that range (less than 60 or greater than 95) is problematic.

Whenever you use a regression equation, you should ask how well the equation fits the data. One

way to assess fit is to check the coefficient of determination, which can be computed from the

following formula.

where N is the number of observations used to fit the model, Ȉ is the summation symbol, xi is the

x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y

value, ıx is the standard deviation of x, and ıy is the standard deviation of y. Computations for

the sample problem of this lesson are shown below.

ıx = sqrt [ Ȉ ( xi - x )2 / N ] ıy = sqrt [ Ȉ ( yi - y )2 / N ]

ıx = sqrt( 730/5 ) = sqrt(146) = 12.083 ıy = sqrt( 630/5 ) = sqrt(126) = 11.225

R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48

£ coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics

grades (the dependent variable) can be explained by the relationship to math aptitude scores (the

independent variable). phis would be considered a good fit to the data, in the sense that it would

substantially improve an educator's ability to predict student performance in statistics class.

.
cù

£ regression is a statistical analysis assessing the association between two variables. It is used

to find the relationship between two variables.

.
c0 i

Regression Equation(y) = a + bx

Slope(b) = (NȈY - (Ȉ)(ȈY)) / (NȈ2 - (Ȉ)2)

Intercept(a) = (ȈY - b(Ȉ)) / N

where

x and y are the variables.

b = phe slope of the regression line

a = phe intercept point of the regression line and the y axis.

N = Number of values or elements

= First Score

Y = Second Score

ȈY = Sum of the product of first and Second Scores

Ȉ = Sum of First Scores

ȈY = Sum of Second Scores

Ȉ2 = Sum of square First Scores

.
c%&i

: po find the Simple/Õinear Regression of

Values Y Values

60 3.1

61 3.6

62 3.8

63 4

65 4.1

po find regression equation, we will first find slope, intercept and use it to form regression

equation..

N=5

See the below table

60 3.1 60 * 3.1 = 186 60 * 60 = 3600

61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721

62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844

63 4 63 * 4 = 252 63 * 63 = 3969

65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225

Ȉ = 311

ȈY = 18.6

ȈY = 1159.7

Ȉ2 = 19359

Slope(b) = (NȈY - (Ȉ)(ȈY)) / (NȈ2 - (Ȉ)2)

= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)

= (5798.5 - 5784.6)/(96795 - 96721)

= 13.9/74

= 0.19

Intercept(a) = (ȈY - b(Ȉ)) / N

= (18.6 - 0.19(311))/5

= (18.6 - 59.09)/5

= -40.49/5

= -8.098

Regression Equation(y) = a + bx

= -8.098 + 0.19x.

Suppose if we want to know the approximate y value for the variable x = 64. phen we can

substitute the value in the above equation.

Regression Equation(y) = a + bx

= -8.098 + 0.19(64).

= -8.098 + 12.16

= 4.06

phis example will guide you to find the relationship between two variables by calculating the

Regression from the above steps.

c

phe correlation is one of the most common and most useful statistics. £ correlation is a single

number that describes the degree of relationship between two variables. Õet's work through an

example to show you how this statistic is computed.

c%&i

c

Õet's assume that we want to look at the relationship between two variables, height (in inches)

and self esteem. uerhaps we have a hypothesis that how tall you are effects your self esteem

(incidentally, I don't think we have to worry about the direction of causality here -- it's not likely

that self esteem causes your height!). Õet's say we collect some information on twenty

individuals (all male -- we know that the average height differs for males and females so, to keep

this example simple we'll just use males). Height is measured in inches. Self esteem is measured

based on the average of 10 1-to-5 rating items (where higher scores mean higher self esteem).

Here's the data for the 20 cases (don't take this too seriously -- I made this data up to illustrate

what a correlation is):

u

c 6c c

1c ;9c /1c

.c 01c /;c

:c ;.c :9c

/c 05c //c

5c 59c :.c

;c ;7c :1c

0c ;0c :9c

9c ;9c /1c

6c 01c /:c

.7c ;1c :;c

Now, let's take a quick look at the histogram for each variable:

] c

c c ]

c c

c c

c

*

c

You should immediately see in the bivariate plot that the relationship between the variables is a

positive one (if you can't see that, review the section on types of relationships) because if you

were to fit a single straight line through the dots it would have a positive slope or move up from

left to right. Since the correlation is nothing more than a quantitative estimate of the relationship,

we would expect a positive correlation.

What does a "positive relationship" mean in this context It means that, in general, higher scores

on one variable tend to be paired with higher scores on the other and that lower scores on one

variable tend to be paired with lower scores on the other. You should confirm visually that this is

generally true in the plot above.

c
c

c

Now we're ready to compute the correlation value. phe formula for the correlation is:

We use the symbol to stand for the correlation. phrough the magic of mathematics it turns out

that r will always be between -1.0 and +1.0. if the correlation is negative, we have a negative

relationship; if it's positive, the relationship is positive. You don't need to know how we came up

with this formula unless you want to be a statistician. But you probably will need to know how

the formula relates to real data -- how you can use the formula to compute the correlation. Õet's

look at the data we need for the formula. Here's the original data with the other necessary

columns:

u

c 6c !c "!c "c !"!c

11c ;9c :5c .:9c /;./c 1..5c

phe first three columns are the same as in the table above. phe next three columns are simple

computations based on the height and self esteem data. phe bottom row consists of the sum of

each column. phis is all the information we need to compute the correlation. Here are the values

from the bottom row of the table (where N is 20 people) as they are related to the symbols in the

formula:

Now, when we plug these values into the formula given above, we get the following (I show it

here tediously, one step at a time):

So, the correlation for our twenty cases is .73, which is a fairly strong positive relationship. I

guess there is a relationship between height and self esteem, at least in this made up data!

c

Once you've computed a correlation, you can determine the probability that the observed

correlation occurred by chance. phat is, you can conduct a significance test. àost often you are

interested in determining the probability that the correlation is a real one and not a chance

occurrence. In this case, you are testing the mutually exclusive hypotheses:

cA

cc cc7c

cc cBCc7c

phe easiest way to test this hypothesis is to find a statistics book that has a table of critical values

of r. àost introductory statistics texts would have a table like this. £s in all hypothesis testing,

you need to first determine the significance level. Here, I'll use the common significance level of

alpha = .05. phis means that I am conducting a test where the odds that the correlation is a

chance occurrence is no more than 5 out of 100. Before I look up the critical value in a table I

also have to compute the degrees of freedom or df. phe df is simply equal to N-2 or, in this

example, is 20-2 = 18. Finally, I have to decide whether I am doing a one-tailed or two-tailed

test. In this example, since I have no strong prior theory to suggest whether the relationship

between height and self esteem would be positive or negative, I'll opt for the two-tailed test.

With these three pieces of information -- the significance level (alpha = .05)), degrees of freedom

(df = 18), and type of test (two-tailed) -- I can now test the significance of the correlation I

found. When I look up this value in the handy little table at the back of my statistics book I find

that the critical value is .4438. phis means that if my correlation is greater than .4438 or less than

-.4438 (remember, this is a two-tailed test) I can conclude that the odds are less than 5 out of 100

that this is a chance occurrence. Since my correlation 0f .73 is actually quite a bit higher, I

conclude that it is not a chance finding and that the correlation is "statistically significant" (given

the parameters of the test). I can reject the null hypothesis and accept the alternative.

c

c &c

£ll I've shown you so far is how to compute a correlation between two variables. In most studies

we have considerably more than two variables. Õet's say we have a study with 10 interval-level

variables and we want to estimate the relationships among all of them (i.e., between all possible

pairs of variables). In this instance, we have 45 unique correlations to estimate (more later on

how I knew that!). We could do the above computations 45 times to obtain the correlations. Or

we could use just about any statistics program to automatically compute all 45 with a simple

click of the mouse.

I used a simple statistics program to generate random data for 10 variables with 20 cases (i.e.,

persons) for each variable. phen, I told the program to compute the correlations among these

variables. Here's the result:

cccccc

c

ccccc

c

ccccc

cccc

c

cccc

ccc

cccc

c

ccccc

ccc

cccc

cccc

c

cccc

ccc

cccc

ccc

cccc

c

cccc

cccc

ccc

ccc

ccc

cccc

c

ccccc

ccc

cccc

cccc

cccc

cccc

cccc

c

ccccc

cccc

ccc

ccc

ccc

cccc

ccc

cccc

c

ccccc

cccc

cccc

cccc

ccc

cccc

cccc

ccc

cccc

c

cccc

cccc

cccc

ccc

ccc

ccc

cccc

cccc

cccc

cccc

phis type of table is called a Y ½. It lists the variable names (C1-C10) down the

first column and across the first row. phe diagonal of a correlation matrix (i.e., the numbers that

go from the upper left corner to the lower right) always consists of ones. phat's because these are

the correlations between each variable and itself (and a variable is always perfectly correlated

with itself). phis statistical program only shows the lower triangle of the correlation matrix. In

every correlation matrix there are two triangles that are the values below and to the left of the

diagonal (lower triangle) and above and to the right of the diagonal (upper triangle). phere is no

reason to print both triangles because the two triangles of a correlation matrix are always mirror

images of each other (the correlation of variable x with variable y is always equal to the

correlation of variable y with variable x). When a matrix has this mirror-image quality above and

below the diagonal we refer to it as a |Y ½. £ correlation matrix is always a

symmetric matrix.

po locate the correlation for any pair of variables, find the value in the table for the row and

column intersection for those two variables. For instance, to find the correlation between

variables C5 and C2, I look for where row C2 and column C5 is (in this case it's blank because it

falls in the upper triangle area) and where row C5 and column C2 is and, in the second case, I

find that the correlation is -.166.

OK, so how did I know that there are 45 unique correlations when we have 10 variables phere's

a handy simple little formula that tells how many pairs (e.g., correlations) there are for any

number of variables:

where N is the number of variables. In the example, I had 10 variables, so I know I have (10 *

9)/2 = 90/2 = 45 pairs.

!
c

c

phe specific type of correlation I've illustrated here is known as the uearson uroduct àoment

Correlation. It is appropriate when both variables are measured at an interval level. However

there are a wide variety of other types of correlations for other circumstances. for instance, if you

have two ordinal variables, you could use the Spearman rank Order Correlation (rho) or the

Kendall rank order Correlation (tau). When one measure is a continuous interval level one and

the other is dichotomous (i.e., two-category) you can use the uoint-Biserial Correlation. For

other situations, consulting the web-based statistics selection program,

at

http://trochim.human.cornell.edu/selstat/ssstart.htm.

c

?*& !D8>c

c%

cc9c$

cc

>

c c'!E*cF

c

c

1cccc

cc

ccGc

c

.c&
cc ccccc c

c cc

c

c2 c$

4c/.c5;c/6c57c/6c5:c5.c/9c/0c5/c

c

:c&
c c$

ccc
cc

:5c/;c.6c50c;7c0:c.7c56c57c07c

c

/c&
c -c ccH
I

cc

.771c.77.c

&cJcF cJcF c

2 cK

4c2 cK

4cc

c

>c17c.7c1.c1;c

=c15c17c19c1.cc

&c1.c15c15c6c

@c.5c5c.7c11c

*c:7c:c.9c5cc

'c.7c17c15c9c

c

5c&
c%
c ccc
cc

.5c15c:7c/5c/7c.7c57c

c

;c'cc
cc
cH
cJ

I

c cc

$

c

/7c:;c/.c5:c.7c;5c;0c.7c:;c

c

0c&
cH
cJ

I

c cc
c

cc.5c17c.7c:.c57c.9c.;c1/c1.c1:c

Lcc15c1.c16c.;c.:c..c1:c17c15c5c

c

9c'c c
cc

ccc

Lcc1695c9;c90c99c96c67c61c6.c

! c/5c/6c5/c;7c97c6;c177c1.7c

c

?*& !D8=c

c%

cc.7c$

cc

>

c c AK**cF

c

c

6c' ccc

cc ccccc169.c

c

Lcc16;5c07c05c97c95c67c

cc

cc/0c56c;:c09c6.c11.c

c

17c>c c

c

c0c

c cc
cc c24c;c

c24c0c

c c24c5ccc

c

c

11c?
cc
c

c
c

'c cc

>c=c&c>
cc

!c57c:7c..7c1c

!!c67c/5c107c:c

!!!c.57c.77c57c/c

K% c/c.c.c

c

1.c&
cc c c c c ccc c
c

/.c50c115c1.0c:6c55c

c

1:c?cc

c

cc

cc

c% Gc

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.