Sei sulla pagina 1di 12

CORRELATION AND REGRESSION ANALYSES

I. SCATTER DIAGRAM
This is a graphical display of relationship that exists between two variables (dependent and
independent variables). It is also known as scatter plot.
In the scatter diagram, each individual / observation is represented by a point so that the
horizontal position corresponds to the independent variable while the vertical position
corresponds to the dependent variable.
Example 1
Plot a scatter plot for the data given below
X 2 3 4 5 6 7 8 9
Y 3 4 8 9 10 11 13 14

X
Fig 1
Example 2
Given the following data, plot a scatter diagram for the data.
X 4 6 8 10 11 14 16 17 18
Y 12 11 9 6 4 5 2 6 5
Example 3
In AUN, MAT 110 is a prerequisite for STA 101. A sample of 15 students was drawn and
their scores in the two courses were recorded below;
MAT 110 65 58 93 68 74 81 58 85 88 75 63 79 80 54 72
STA 101 45 52 64 51 48 65 23 33 49 28 39 41 50 28 36
Draw a scatter diagram for the data.
Example 4
A real estate agent wanted to determine the extent the selling price of a home is related to its
size. To acquire this information he took a random sample of twelve homes that were sold
recently in a city. The data obtained are shown below:
Size (100km2) 23 18 26 20 22 14 33 28 23 20 27 18
Price (N ‘000) 315 229 355 261 234 216 308 306 289 204 265 195

Use a graphical technique to describe the relationship between the size and price of homes.

II. CORRELATION
This is a technique used to investigate relationships between two variables (say X and Y). Is
a change in one of these variables brings about changes in the other? For instance, can
increase in temperature cause changes in the rate of a chemical reaction? Can increase in
price of a commodity affect the quantity demanded of the commodity?
We shall consider the most popular correlation coefficients in this course. These are;
1. Pearson’s product-moment correlation coefficient (simply product-moment
correlation coefficient). This is used for interval and ratio data.
2. Spearman’s correlation coefficient (simply rank order correlation coefficient). This
is used for ordinal data.
Properties of correlation coefficient
i. The value of correlation coefficient ranges from negative one to positive one i.e –1 ≤
r ≤ +1.
ii. If the correlation coefficient is –1, it indicates that there is a perfect negative
relationship between the variables such that as one increases, the other decreases and
vice versa.
iii. If the value of correlation coefficient is +1, it shows that there is a perfect positive
relationship between the variables so that as one increases, the other increases and
vice versa.
iv. If the value of correlation coefficient is zero, then there is an indication that there is
no linear relationship between the variables.
PEARSON’S PRODUCT-MOMENT CORRELATION COEFFICIENT
𝒏 ∑ 𝒙𝒚 − (∑ 𝒙)(∑ 𝒚)
𝒓=
√[𝒏 ∑ 𝒙𝟐 − (∑ 𝒙)𝟐 ][𝒏 ∑ 𝒚𝟐 − (𝒚)𝟐 ]

Where r is the correlation coefficient,


𝒏 is the number of data pairs, ∑ 𝒙𝒚 is sum of products of the pairs,
∑ 𝒙 is sum of 𝒙 values, ∑ 𝒙𝟐 is sum of squares of 𝒙 values,
∑ 𝒚 is sum of 𝒚 values, and ∑ 𝒚𝟐 is sum of squares of 𝒚 values.
Example 5: The scores of STA 101 students in tests I and II are shown below. Compute
the correlation coefficient for the scores
No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
T1 10 8 5 5 9 14 9 4 0 4 12 7 10 4 3 8 8 7
T2 12 15 5 9 8 11 6 0 3 10 11 6 8 0 4 7 10 11

Example 6: Find the correlation coefficient for the data below


X 2 7 7 11 4 2 3 5 11 4
Y 10 4 9 2 4 5 12 5 3 7
Example 7: A company wanted to know if relationship exists between the number of
salespersons employed and the total sales of the company. The data collected for six months
are as shown below.
Month May Jun Jul Aug Sep Oct
Salespersons 43 48 56 61 67 70
Total sales (N ’00,000) 128 120 135 143 141 152
Compute the value of correlation coefficient for the data.
Example 8: A researcher wishes to determine if a person’s age is related to the number of
hours he / she exercises per week. The data obtained by the researcher are given below.
Calculate the correlation coefficient for the data.
Age (X) 18 26 32 38 52 59
Hours (Y) 10 5 2 3 1.5 1

Example 9: A group of 10 students carried out an experiment to measure the height and
weight of one another. The objective of the experiment is to find out whether there is a
linear relationship between the height and the weight of a student. The data obtained from
the experiment are shown below:
Height (inch) 70 72 68 69 66 67 65 66 59 69
Weight (kg) 76 74 68 90 80 80 71 59 61 67
Example 10: A professor is reviewing the examination results of a class of 20 students.
There are two sets of marks for each student. The first set of scores, denoted by X, is the
mark obtained by each student from the continuous assessment, where 70 is the maximum
obtainable mark. The second set of scores, denoted by Y, is the mark obtained from
examination, and 30 is the maximum possible mark. The results are shown in the following
table. Calculate the Pearson’s product-moment correlation coefficient for the data.
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X 27 24 31 49 35 47 35 31 26 50 37 48 31 56 27 42 35 33 49 23

Y 21 16 20 16 9 21 18 9 15 15 24 23 20 15 18 20 9 12 14 17

SPEARMAN’S CORRELATION COEFFICIENT


The Spearman’s correlation coefficient is used to measure the relationship between the
variables that are both measured on an ordinal scale. The basic concept of the Spearman’s
correlation coefficient is that instead of using each observation for computation we use its
corresponding rank. The Spearman’s correlation coefficient is also known as Spearman’s
rank-order correlation.
𝟔 ∑ 𝑫𝟐
𝝆=𝟏−
𝒏(𝒏𝟐 − 𝟏)

Where ρ is the rank-order correlation coefficient,


D is the difference between the pair ranks, and
n is number of pairs.
Example 11: Compute the Spearman’s correlation coefficient for the data below
X Y RX RY D D2
23 45
21 46
19 54
25 44
22 48
30 59
34 56
31 58
27 51
28 65
RANKING TIED SCORES
When converting scores into ranks for the Spearman’s correlation, one may encounter two
or more scores identical. These identical scores will be ranked using the procedure below;
1. Assign a rank – first, second, third, etc – to each and every score in the set, including
the tied ones.
2. Compute the mean of the ranks of the tied scores and assign this mean value as the
final rank for each of them.
Example 12: Compute the Spearman’s correlation coefficient for the data below.
X Y RX RY D D2
3 22
6 30
12 27
6 18
3 27
5 19
6 14
Total

Interpretation of the correlation coefficient

The extreme values of r, that is, when r = ±1, indicate that there is perfect (positive or
negative) correlation between X and Y. However, if r is 0, we say that there is no or zero
correlation.

Note

When r = 0, we may not assert that there is no correlation at all between X and Y.
Pearson’s correlation coefficient is meant to measure linear relationship only. It should not
be used in the case of non-linear relationships since it will obviously lead to an erroneous
interpretation.
The remaining values, falling in subintervals of [–1, 1], describe the relationship in terms
of its strength. Fig. 2 below may be used as a guideline as to what adjective should be
used for the values of r obtained after calculation to describe the relationship.

Positive or direct Negative or inverse


correlation correlation
0 No or zero 0
Very poor or very weak
0.1 –0.1

0.2 –0.2
Poor or weak
0.3 –0.3

0.4 –0.4
Fair or moderate
0.5 –0.5

0.6 –0.6

0.7 Strong or high –0.7

0.8 –0.8

0.9 Very strong/high –0.9

1 Perfect –1

Fig. 2 Interpretation of correlation coefficient

Note that Fig 2 is only to be used as a guideline. There are no set values that
demarcate, for example, moderate from strong correlation.

We observe that the strength of the relationship between X and Y is the same whether r =
0.85 or – 0.85. The only difference is that there is direct correlation in the first case and
inverse correlation in the second. We should bear in mind that r is the linear correlation
coefficient and that, as mentioned earlier, its value can be wrongly interpreted whenever
the relationship between X and Y is non-linear. That is the reason why we should have a
look at a scatter diagram of points (x, y) and verify whether the relationship is, for
example, of quadratic, logarithmic, exponential or trigonometric (briefly, non-linear)
nature.
If r = 0, we should not jump to the conclusion that there is no correlation at all between
X and Y. Consider the case where there is perfect (but unsuspected) non-linear
correlation between the two variables, say, related by the equation Y = X 2
(see Fig. 2. below). Taking an initial set of points (–3, 9), (–2, 4), (–1, 1), (0, 0), (1, 1),
(2, 4) and (3, 9), then one may easily verify that both ∑ x
and ∑ xy are equal to zero. Consequently, r = 0 (check the formula for r
above. We deduce that the linear product-moment correlation coefficient cannot be used
to interpret the strength of a non-linear relationship.

x x

x x

x x
x X
O

Fig. 2 Perfect non-linear relationship

With practice and experience, it is even possible to know approximately the value of r
by inspection of a scatter diagram. The location (amount of scattering) of the points with
respect to the least-squares regression line indicates the strength of the relationship
between the variables. The more scattered the points are, the weaker is the relationship and
the closer is the value of r to zero.

The sign of r is always the same as that of (the gradient) b in the


regression equation
𝑌̂ = 𝛼 + 𝛽𝑋

Fig. 3 below shows how we can deduce the


value of r to a certain degree of accuracy from a scatter diagram.

Note If the variables were qualitative in nature, that is, nominal or ordinal, then
it would be advisable to use a non-parametric method of determining the correlation
coefficient, namely, Spearman’s (not included in this note).
Y Y

x x
x x
x x
x x
x x
x x

X X
O r=1 O r = – 0.8

Y Y
x
x x x x
x x
x x
x x x x
x x x
x x x x
x
X X
O r = 0.6 O r=0

Y x x

x x x xx x x x

x x

X x X
O O
r=0 r=0

Y is independent of X, that X and Y have a non-linear


is, Y assumes the same relationship.
value irrespective of X.

Fig. 3 Using scattering diagrams to determine r approximately


Causality

Causality, also known as causation, is defined as a cause-effect relationship between two


variables. A significant correlation does not necessarily indicate causality but rather a common linkage
in a sequence of events. One type of significant correlation situation is when both variables are
influenced by a common cause and therefore are correlated with each other.

For example, individuals with a higher level of income have both higher levels of savings and spending.
We might find that there is a positive correlation between level of savings and level of spending but this
does not mean that one variable causes the other.

Spurious correlation

Spurious correlation occurs between two variables that are supposed to be mutually independent. It
must be conceded that the correlation coefficient can be readily calculated for any given set of paired
data, the names of the variables being totally irrelevant.

We may think about the variables X and Y being respectively given by

X = “number of bags of potatoes sold daily at the Yola market”


Y = “number of accidents daily in Nigeria”

With a given set of data, we may well find that the correlation between X and Y is highly significant
(for example, 0.89). But does that mean that these two variables are strongly correlated? Certainly not!
Not even through the longest and most complex cause-effect chain. That is what spurious correlation is
all about.

III. REGRESSION ANALYSIS


Regression analysis is concerned with describing and evaluating the relationship between a given
variable (usually called dependent variable) and one or more other variables (often called independent
variable(s)). We shall denote the dependent variable by 𝒚 and the independent variables by
𝒙𝟏 , 𝒙𝟐 , … . , 𝒙𝒌
Regression analysis can be simple or multiple. It is simple if only one independent is involved, and
multiple if more than one independent variable are involved. We shall limit ourselves to simple
regression analysis in this course.
Classification of Variables in Regression Analysis
𝒚 𝒙𝟏 , 𝒙𝟐 , … . , 𝒙𝒌

Dependent variable Independent variables


Explained variable Explanatory variables
Endogenous variable Exogenous variables
Regressand Regressors
Predictant Predictors
Effect variable Causal variables
Target variable Control variables

EQUATION OF THE REGRESSION LINE


The equation of the regression line is given by

y = α + βx + e
Where y and x are dependent and independent variables, α is y-intercept and β is slope of the line.
The α and β are also referred to as the regression coefficients / parameters and their estimates are
obtained by:
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝛽=
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
And
α = 𝑦̅ – βx̅
and e is the error term (residual)
Example 13: Consider the data on advertising expenditures (x) and sales revenue (y) for De United
foods (maker of Indomie instant noodles) for a period of five months. The observations are as follows:
Months Advert expenditure in Sales rev in millions (y)
hundreds of thousands (x)
1 1 3
2 2 4
3 3 2
4 4 6
5 5 8
Total

Fit a regression model of y on x for the data. Hence, predict sales revenue when N800, 000 is expended
on advert.
- 10 -
Example 14: The data below were obtained for the amount of rainfall (in inches) and the yield of
cowpea (in kg) per acre for a selected sample. Fit a regression line to the data. Hence, predict the yield
when there are 5 inches of rain.
Rainfall (x) Yield of cowpea (y)
2 38
3 42
7 41
8 45
10 46
11 44
13 48

Example 15
Fit a regression model to the data given below
X 1 2 3 4 5 6
Y 20 14 18 12 10 8

INTERPRETATION OF REGRESSION COEFFICIENTS


y=α+βx
α is interpreted as the value of y when x is zero and β is the amount that y increases when x increases by
one unit.
Example 16: Suppose a study was conducted to compare the length of time, x (in months) couples

have been in a relationship to the amount of money, y (in thousand naira) that is spent when they go out.
The equation of the regression line was found to be y = 7 – 0.5x. This implies that α = 7 and β = – 0.5
The y-intercept (α) tells us that at the beginning of the relationship, the average date costs N7000. The
slope (β) shows that as the relationship lasts an additional one month, the average date costs N500 less
than the previous date. The regression equation can be used to predict the amount of money that a date
costs when the relationship has lasted, for example 8 months.
That is y8 = 7000 – 0.5(8) = 3000. Meaning that the average cost after 8 months of the relationship will
be N3000.
- 11 -
Exercise:
1. A store manager selling TV sets observed the following sales on 10 different days. Fit a
regression equation of y on x, where y is the number of TV sets sold and x is the number of sales
representatives he employed. Hence, interpret your results.
X 3 6 10 5 10 12 5 10 10 8
Y 7 12 12 20 27 28 25 30 29 26

2. The annual bonuses (N ‘000) of 6 randomly selected employees and their years of service were
recorded and shown below. Fit a regression model to the data and interpret your results. Predict
the bonus of an employee who has served for ten years.
Years (x) 1 2 3 4 5 6
Bonus (y) 6 10 19 25 27 30
3. Attempting to analyze the relationship between advertising and sales, the owner of a furniture
store recorded the monthly advertising budget, x (in thousands of naira) and the sales, y (in
millions of naira) for a sample of 12 months. The data are shown below:
X 23 46 60 54 28 33 25 31 36 88 90 99
Y 9.6 11.3 12.8 9.8 8.9 12.5 12.0 11.4 12.6 13.7 14.4 15.9
Fit a regression model to the data and interpret the coefficients.

Coefficient of Determination
The coefficient of determination is the ratio of the explained variation to the total variation
and is denoted by 𝑟 2 . That is,
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑟2 =
𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

The term 𝑟 2 is usually expressed as a percentage. That is, percentage of the total variation explained by
the regression line using the independent variable.

Another way to arrive at the value for 𝑟 2 is to square the correlation coefficient; which is the same by
using the variation ratio.

Therefore, coefficient of determination is a measure of the variation of the dependent


variable that is explained by the regression line and the independent variable.

- 12 -

Potrebbero piacerti anche