Sei sulla pagina 1di 40

CORRELATION ANALYSIS

Chapter Objectives

Express quantitatively the degree and direction of the co-variation or association between two variables. Determine the validity and reliability of the covariation or association between two variables.

Introduction
The importance of examining the statistical relationship between two or more variables can be divided into the following questions and accordingly requires the statistical methods to answer these questions:
Is Is

there an association between two or more variables? If yes, what is form and degree of that relationship? the relationship strong or significant enough to be useful to arrive at a desirable conclusion? the relationship be used for predictive purposes, that is, to predict the most likely value of a dependent variable corresponding to the given value of independent variable or variables?

Can

Example
Family income & expenditure on luxury items. Yield of a crop & quantity of fertilizer used.

Sales revenue & expenses incurred on advertising.


Weight & height of individual.

Introduction
A statistical technique that is used to analyze the strength and direction of the relationship between two quantitative variables, is called correlation analysis. A few definitions of correlation analysis are:
An

analysis of the relationship of two or more variables is usually called correlation. A. M. Tuttle the relationship is of a quantitative nature, the appropriate statistical tool for discovering and measuring the relationship and expressing it in a brief formula is known as correlation. Croxton and Cowden

When

Introduction
The coefficient of correlation, is a number that indicates the strength (magnitude) and direction of statistical relationship between two variables. The strength of the relationship is determined by the closeness of the points to a straight line when a pair of values of two variables are plotted on a graph. A straight line is used as the frame of reference for evaluating the relationship. The direction is determined by whether one variable generally increases or decreases when the other variable increases.

Correlation and Causation


coincidence: A correlation coefficient may not reach any statistical significance, that is, it may represent a nonsense (spurious) or chance association. For example, (i) a positive correlation between growth in population and wheat production in the country has no statistical significance. Because, each of the two events might have entirely different, unrelated causes. (ii) While estimating the correlation in sales revenue and expenditure on advertisements over a period of time, the investigator must be certain that the outcome is not due to biased sampling or sampling error. That is, he needs to show that a correlation coefficient is statistically significant and not just due to random sampling error.
Chance

Correlation and Causation


Influence of third variable: If the correlation coefficient

does not establish any relationship, it can be used as a source for testing null and alternative hypotheses about a population. For example, it has been proved that smoking causes lung damage. There is often multiple reasons of health problems, the reason of stress cannot be ruled out. Similarly, there is a positive correlation between the yield of rice and tea because the crops are influenced by the amount of rainfall. But the yield of any one is not influenced by other.

Correlation and Causation


Mutual

influence: There may be a high degree of relationship between two variables but it is difficult to say as to which variable is influencing the other.

For example, variables like price, supply, and demand of commodity are mutually correlated.
According to the principle of economics, as the price of a commodity increases, its demand decreases, so price influences the demand level. But if demand of a commodity increases due to growth in population, then its price also increases. In this case increased demand make an effect on the price. However, the amount of export of a commodity is influenced by an increase or decrease in custom duties but the reverse is normally not true.

Types of Correlations
There are three broad types of correlations:
Positive and
Linear and Simple,

negative,

non-linear,

partial, and multiple.

4/14/2012

1-10

Positive and Negative Correlation


A positive (or direct) correlation refers to the same direction of change in the values of variables. In other words, if values of variables are varying (i.e., increasing or decreasing) in the same direction, then such correlation is referred to as positive correlation.

Example:
X 5 8 10 15 17 Y - 10 12 16 18 20 X 17 15 10 8 5 Y 20 18 16 12 10

A negative (or inverse) correlation refers to the change in the values of variables in opposite direction.
X 17 15 10 8 5 Y - 10 12 16 18 20 X 17 15 10 8 5 Y 20 18 16 12 10

Linear Correlation
A linear correlation implies a constant change in one of the variable values with respect to a change in the corresponding values of another variable. In other words, a correlation is referred to as linear correlation when variations in the values of two variables have a constant ratio. Example:

X : 10 20 30 40

50

Y: 40 60 80 100 120

Non-Linear Correlation
A non-linear correlation an absolute change in one of the variable values with respect to change in values of another variable When the amount of change in the values of one variable does not bear a constant ratio to the

amount of change in the corresponding values of another variable. Example: x: 8 9 9 10 10 28 29 30 y: 80 130 170 150 230 560 460 600

Different Values of the Correlation Coefficient

Simple, Partial, and Multiple Correlation


The distinction between simple, partial, and multiple correlation is based upon the number of variables involved in the correlation analysis.

If only two variables are chosen to study correlation between them, then such a correlation is referred to as simple correlation. Example: A study on the yield of a crop with respect to only amount of fertilizer, or sales revenue with respect to amount of money spent on advertisement, are a few examples of simple correlation.

Simple, Partial, and Multiple Correlation


Partial correlation: Two variables are chosen to study the correlation between them, but the effect of other influencing variables is kept constant.

Example:
(i)

yield of a crop is influenced by the amount of fertilizer applied, rainfall, quality of seed, type of soil, and pesticides Sales revenue from a product is influenced by the level of advertising expenditure, quality of the product, price, competitors, distribution, and so on.

(ii)

Simple, Partial, and Multiple Correlation


Multiple correlation: The relationship between more than three variables is considered simultaneously for study.

Example: Employer-employee relationship in any organization may be examined with reference to, training and development facilities; medical, housing, and education to children facilities; salary structure; grievances handling system etc.

Methods of Correlation Analysis


The following methods are used to find the correlation coefficient between two variables x and y are discussed:
Scatter Diagram

method

Karl

Pearsons Coefficient of Correlation method


Rank Correlation method of Least-squares

Spearmans Method

Methods of Correlation Analysis


Figure shows how the strength of the association between two variables is represented by the coefficient of correlation.

Negative Correlation 1.00 Strong negative correlation Perfect negative correlation 0.50 Weak negative correlation 0

Positive Correlation + 0.50 + 1.00

Moderate negative correlation

Weak positive Strong positive correlation correlation Perfect positive Moderate positive No correlation correlation correlation

Scatter Diagram Method


The scatter diagram method is a quick at-a-glance method of determining of an apparent relationship between two variables, if any.

A scatter diagram (or a graph) can be obtained on a graph paper by plotting observed (or known) pairs of values of variables x and y, taking the independent variable values on the x-axis and the dependent variable values on the y-axis.
Scatter diagram: A graph of pairs of values of two variables that is plotted to indicate a visual display of the pattern of their relationship.

Interpretation of Correlation Coefficients


Interpretation of correlation coefficient r, the following points should be taken into account:
A

low value of r does not indicate that the variables are unrelated but indicates that the relationship is poorly described by a straight line. A non-linear relationship may also exist.
correlation does not imply a cause-and-effect relationship, it is merely an observed association.

Types of Correlation Coefficient


Co efficient Phi () Rho () r Eta () Conditions Applied to use Both x & y variable are measured on nominal scale Both x & y variable are measured on or changed to ordinal scales (rank data) Both x & y variable are measured on an interval or ratio scales ( numerical data) It quantify nonlinear relationships

Karl Pearsons Correlation Coefficient

Step Deviation Method for Ungrouped Data


When actual mean values x and y are in fraction, the calculation of Pearsons correlation coefficient can be simplified by taking deviations of x and y values from their assumed means A and B, respectively. That is, dx = x A and dy = y B, where A and B are assumed means of x and y values.

S=

r=

n Sdx dy - (Sdx ) (Sdy )


2 n Sdx

- (Sdx )

2 n Sdy

- ( S dy )

Step Deviation Method for Grouped Data


When data on x and y values are classified or grouped into a frequency distribution.

r=

n S fd x d y - ( S fd x ) ( S fd y )
2 2 n S fd x - ( S fd x )2 n S fd y - ( S fd y )2

Assumptions of Using Pearsons Correlation Coefficient


Pearsons

correlation coefficient is appropriate to calculate when both variables x and y are measured on an interval or a ratio scale. variables x and y are normally distributed, and that there is a linear relationship between these variables.

Both

The

correlation coefficient is largely affected due to truncation of the range of values in one or both of the variables. This occurs when the distributions of both the variables greatly deviate from the normal shape.
is a cause and effect relationship between two variables that influences the distributions of both the variables. Otherwise correlation coefficient might either be extremely low or even zero.

There

Probable Error and Standard Error of Coefficient of Correlation


The probable error (PE) of coefficient of correlation indicates extent to which its value depends on the condition of random sampling. If r is the calculated value of correlation coefficient in a sample of n pairs of observations, then the standard error SEr of the correlation coefficient r is given by

1- r2 SEr = n
The probable error of the coefficient of correlation is calculated by the expression:

1- r2 PEr = 0.6745 SEr = 0.6745 n

Thus with the help of PEr we can determine the range within which population coefficient of correlation is expected to fall using following formula: = r PEr
where (rho) represents population coefficient of correlation.

Remarks
If

r < PEr then the value of r is not significant, that is, there is no relationship between two variables of interest.
r > 6PEr then value of r is significant, that is, there exists a relationship between two variables.

If

The Coefficient of Determination


This measure represents the proportion (or percentage) of the total variability of the dependent variable, y that is accounted for or explained by the independent variable, x. The proportion (or percentage) of variation in y that x can explain determines more precisely the extent or strength of association between two variables x and y

Interpretation of Coefficient of Determination


Coefficient of determination is preferred for interpreting the strength of association between two variables because it is easier to interpret a percentage.
If

r2 = 0, then no variation in y can be explain by the variable x. Where x is of no value in predicting the value of y. There is no association between x and y.
r2 = 1, then values of y are completely explained by x. There is perfect association between x and y.

If

Interpretation of Coefficient of Determination


If 0 r2 1, the degree of explained variation in y as a result of variation in values of x depends on the value of r2. Value of r2 closer to 0 shows low proportion of

variation in y explained by x. On the other hand value of r2 closer to 1 show that variable x can predict the actual value of the variable y.

Interpretation of Coefficient of Determination


0 Weak 0% None 50% Moderate Strength of association between variable x and y Proportion (percentage) of explained variation in y 0.50 Strong 100% Perfect 1.00

r2 = 1-

Explained variability in y Total variability in y

S (y - y )2 n S y 2 - a S y - b S xy = 1=1 2 S(y - y ) n S y 2 - ( y )2
where y= a + bx is the estimated value of y for given values of x. One minus the ratio between these two variations is referred as the coefficient of determination.

Spearmans Rank Correlation Coefficient


This method is applied to measure the association between two variables when only ordinal (or rank) data are available. This method is applied in a situation in which quantitative measure of certain qualitative factors such as judgment, brands personalities, TV programmes, leadership, colour, taste, cannot be fixed, but individual observations can be arranged in a definite order (also called rank).

The ranking is decided by using a set of ordinal rank numbers, with 1 for the individual observation ranked first either in terms of quantity or quality; and n for the individual observation ranked last in a group of n pairs of observations. Mathematically, across three types of cases.

Spearmans Rank Correlation Coefficient


Spearmans rank correlation coefficient is defined as: where R = rank correlation coefficient R1 = rank of observations with respect to first variable R2 = rank of observations with respect to second variable d = R1 R2, difference in a pair of ranks n = number of paired observations or individuals being ranked The number 6 is placed in the formula as a scaling device, it ensures that the possible range of R is from 1 to 1. While using this method we may come

Advantages
This

method is easy to understand and its application is simpler than Pearsons method.

This

method is useful for correlation analysis when variables are expressed in qualitative terms like beauty, intelligence, honesty, efficiency, and so on.
method is appropriate to measure the association between two variables if the data type is at least ordinal scaled (ranked) sample data of values of two variables is converted into ranks either in ascending order or descending order for calculating degree of correlation between two variables.

This

The

Disadvantages
Values

of both variables are assumed to be normally distributed and describing a linear relationship rather than non-linear relationship.

large computational time is required when number of pairs of values of two variables exceed 30.
method cannot be applied to measure the association between two variable grouped data.

This

Case 1: When Ranks are Given


When observations in a data set are already arranged in a particular order (rank), take the differences in pairs of observations to determine d. Square these differences and obtain the total d2.

Case 2: When Ranks are not Given


When pairs of observations in the data set are not ranked as in Case 1, the ranks are assigned by taking either the highest value or the lowest value as 1 for both the variables values.

Case 3: When Ranks are Equal


we may come across a situation of more than one observations being of equal size. In such a case the rank to be assigned to individual observations is an average of the ranks which these individual observations would have got had they differed from each other.

For example, if two observations are ranked equal at third place, then the average rank of (3 + 4)/2 = 3.5 is assigned to these two observations.
Similarly, if three observations are ranked equal at third place, then the average rank of (3 + 4 + 5)/3 = 4 is assigned to these three observations.

Case 3: When Ranks are Equal


While equal ranks are assigned to a few observations in the data set, an adjustment is made in the Spearman rank correlation coefficient formula as given below:

1 1 3 3 6 Sd 2 + m1 - m1 + m2 - m2 + ... 12 12 R =1 - n (n 2 - 1)

where mi (i = 1, 2, 3, . . .) stands for the number of times an observation is repeated in the data set for both variables.

Potrebbero piacerti anche