Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
I. SCATTER DIAGRAM
This is a graphical display of relationship that exists between two variables (dependent and
independent variables). It is also known as scatter plot.
In the scatter diagram, each individual / observation is represented by a point so that the
horizontal position corresponds to the independent variable while the vertical position
corresponds to the dependent variable.
Example 1
Plot a scatter plot for the data given below
X 2 3 4 5 6 7 8 9
Y 3 4 8 9 10 11 13 14
X
Fig 1
Example 2
Given the following data, plot a scatter diagram for the data.
X 4 6 8 10 11 14 16 17 18
Y 12 11 9 6 4 5 2 6 5
Example 3
In AUN, MAT 110 is a prerequisite for STA 101. A sample of 15 students was drawn and
their scores in the two courses were recorded below;
MAT 110 65 58 93 68 74 81 58 85 88 75 63 79 80 54 72
STA 101 45 52 64 51 48 65 23 33 49 28 39 41 50 28 36
Draw a scatter diagram for the data.
Example 4
A real estate agent wanted to determine the extent the selling price of a home is related to its
size. To acquire this information he took a random sample of twelve homes that were sold
recently in a city. The data obtained are shown below:
Size (100km2) 23 18 26 20 22 14 33 28 23 20 27 18
Price (N ‘000) 315 229 355 261 234 216 308 306 289 204 265 195
Use a graphical technique to describe the relationship between the size and price of homes.
II. CORRELATION
This is a technique used to investigate relationships between two variables (say X and Y). Is
a change in one of these variables brings about changes in the other? For instance, can
increase in temperature cause changes in the rate of a chemical reaction? Can increase in
price of a commodity affect the quantity demanded of the commodity?
We shall consider the most popular correlation coefficients in this course. These are;
1. Pearson’s product-moment correlation coefficient (simply product-moment
correlation coefficient). This is used for interval and ratio data.
2. Spearman’s correlation coefficient (simply rank order correlation coefficient). This
is used for ordinal data.
Properties of correlation coefficient
i. The value of correlation coefficient ranges from negative one to positive one i.e –1 ≤
r ≤ +1.
ii. If the correlation coefficient is –1, it indicates that there is a perfect negative
relationship between the variables such that as one increases, the other decreases and
vice versa.
iii. If the value of correlation coefficient is +1, it shows that there is a perfect positive
relationship between the variables so that as one increases, the other increases and
vice versa.
iv. If the value of correlation coefficient is zero, then there is an indication that there is
no linear relationship between the variables.
PEARSON’S PRODUCT-MOMENT CORRELATION COEFFICIENT
𝒏 ∑ 𝒙𝒚 − (∑ 𝒙)(∑ 𝒚)
𝒓=
√[𝒏 ∑ 𝒙𝟐 − (∑ 𝒙)𝟐 ][𝒏 ∑ 𝒚𝟐 − (𝒚)𝟐 ]
Example 9: A group of 10 students carried out an experiment to measure the height and
weight of one another. The objective of the experiment is to find out whether there is a
linear relationship between the height and the weight of a student. The data obtained from
the experiment are shown below:
Height (inch) 70 72 68 69 66 67 65 66 59 69
Weight (kg) 76 74 68 90 80 80 71 59 61 67
Example 10: A professor is reviewing the examination results of a class of 20 students.
There are two sets of marks for each student. The first set of scores, denoted by X, is the
mark obtained by each student from the continuous assessment, where 70 is the maximum
obtainable mark. The second set of scores, denoted by Y, is the mark obtained from
examination, and 30 is the maximum possible mark. The results are shown in the following
table. Calculate the Pearson’s product-moment correlation coefficient for the data.
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X 27 24 31 49 35 47 35 31 26 50 37 48 31 56 27 42 35 33 49 23
Y 21 16 20 16 9 21 18 9 15 15 24 23 20 15 18 20 9 12 14 17
The extreme values of r, that is, when r = ±1, indicate that there is perfect (positive or
negative) correlation between X and Y. However, if r is 0, we say that there is no or zero
correlation.
Note
When r = 0, we may not assert that there is no correlation at all between X and Y.
Pearson’s correlation coefficient is meant to measure linear relationship only. It should not
be used in the case of non-linear relationships since it will obviously lead to an erroneous
interpretation.
The remaining values, falling in subintervals of [–1, 1], describe the relationship in terms
of its strength. Fig. 2 below may be used as a guideline as to what adjective should be
used for the values of r obtained after calculation to describe the relationship.
0.2 –0.2
Poor or weak
0.3 –0.3
0.4 –0.4
Fair or moderate
0.5 –0.5
0.6 –0.6
0.8 –0.8
1 Perfect –1
Note that Fig 2 is only to be used as a guideline. There are no set values that
demarcate, for example, moderate from strong correlation.
We observe that the strength of the relationship between X and Y is the same whether r =
0.85 or – 0.85. The only difference is that there is direct correlation in the first case and
inverse correlation in the second. We should bear in mind that r is the linear correlation
coefficient and that, as mentioned earlier, its value can be wrongly interpreted whenever
the relationship between X and Y is non-linear. That is the reason why we should have a
look at a scatter diagram of points (x, y) and verify whether the relationship is, for
example, of quadratic, logarithmic, exponential or trigonometric (briefly, non-linear)
nature.
If r = 0, we should not jump to the conclusion that there is no correlation at all between
X and Y. Consider the case where there is perfect (but unsuspected) non-linear
correlation between the two variables, say, related by the equation Y = X 2
(see Fig. 2. below). Taking an initial set of points (–3, 9), (–2, 4), (–1, 1), (0, 0), (1, 1),
(2, 4) and (3, 9), then one may easily verify that both ∑ x
and ∑ xy are equal to zero. Consequently, r = 0 (check the formula for r
above. We deduce that the linear product-moment correlation coefficient cannot be used
to interpret the strength of a non-linear relationship.
x x
x x
x x
x X
O
With practice and experience, it is even possible to know approximately the value of r
by inspection of a scatter diagram. The location (amount of scattering) of the points with
respect to the least-squares regression line indicates the strength of the relationship
between the variables. The more scattered the points are, the weaker is the relationship and
the closer is the value of r to zero.
Note If the variables were qualitative in nature, that is, nominal or ordinal, then
it would be advisable to use a non-parametric method of determining the correlation
coefficient, namely, Spearman’s (not included in this note).
Y Y
x x
x x
x x
x x
x x
x x
X X
O r=1 O r = – 0.8
Y Y
x
x x x x
x x
x x
x x x x
x x x
x x x x
x
X X
O r = 0.6 O r=0
Y x x
x x x xx x x x
x x
X x X
O O
r=0 r=0
For example, individuals with a higher level of income have both higher levels of savings and spending.
We might find that there is a positive correlation between level of savings and level of spending but this
does not mean that one variable causes the other.
Spurious correlation
Spurious correlation occurs between two variables that are supposed to be mutually independent. It
must be conceded that the correlation coefficient can be readily calculated for any given set of paired
data, the names of the variables being totally irrelevant.
With a given set of data, we may well find that the correlation between X and Y is highly significant
(for example, 0.89). But does that mean that these two variables are strongly correlated? Certainly not!
Not even through the longest and most complex cause-effect chain. That is what spurious correlation is
all about.
y = α + βx + e
Where y and x are dependent and independent variables, α is y-intercept and β is slope of the line.
The α and β are also referred to as the regression coefficients / parameters and their estimates are
obtained by:
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝛽=
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
And
α = 𝑦̅ – βx̅
and e is the error term (residual)
Example 13: Consider the data on advertising expenditures (x) and sales revenue (y) for De United
foods (maker of Indomie instant noodles) for a period of five months. The observations are as follows:
Months Advert expenditure in Sales rev in millions (y)
hundreds of thousands (x)
1 1 3
2 2 4
3 3 2
4 4 6
5 5 8
Total
Fit a regression model of y on x for the data. Hence, predict sales revenue when N800, 000 is expended
on advert.
- 10 -
Example 14: The data below were obtained for the amount of rainfall (in inches) and the yield of
cowpea (in kg) per acre for a selected sample. Fit a regression line to the data. Hence, predict the yield
when there are 5 inches of rain.
Rainfall (x) Yield of cowpea (y)
2 38
3 42
7 41
8 45
10 46
11 44
13 48
Example 15
Fit a regression model to the data given below
X 1 2 3 4 5 6
Y 20 14 18 12 10 8
have been in a relationship to the amount of money, y (in thousand naira) that is spent when they go out.
The equation of the regression line was found to be y = 7 – 0.5x. This implies that α = 7 and β = – 0.5
The y-intercept (α) tells us that at the beginning of the relationship, the average date costs N7000. The
slope (β) shows that as the relationship lasts an additional one month, the average date costs N500 less
than the previous date. The regression equation can be used to predict the amount of money that a date
costs when the relationship has lasted, for example 8 months.
That is y8 = 7000 – 0.5(8) = 3000. Meaning that the average cost after 8 months of the relationship will
be N3000.
- 11 -
Exercise:
1. A store manager selling TV sets observed the following sales on 10 different days. Fit a
regression equation of y on x, where y is the number of TV sets sold and x is the number of sales
representatives he employed. Hence, interpret your results.
X 3 6 10 5 10 12 5 10 10 8
Y 7 12 12 20 27 28 25 30 29 26
2. The annual bonuses (N ‘000) of 6 randomly selected employees and their years of service were
recorded and shown below. Fit a regression model to the data and interpret your results. Predict
the bonus of an employee who has served for ten years.
Years (x) 1 2 3 4 5 6
Bonus (y) 6 10 19 25 27 30
3. Attempting to analyze the relationship between advertising and sales, the owner of a furniture
store recorded the monthly advertising budget, x (in thousands of naira) and the sales, y (in
millions of naira) for a sample of 12 months. The data are shown below:
X 23 46 60 54 28 33 25 31 36 88 90 99
Y 9.6 11.3 12.8 9.8 8.9 12.5 12.0 11.4 12.6 13.7 14.4 15.9
Fit a regression model to the data and interpret the coefficients.
Coefficient of Determination
The coefficient of determination is the ratio of the explained variation to the total variation
and is denoted by 𝑟 2 . That is,
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑟2 =
𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
The term 𝑟 2 is usually expressed as a percentage. That is, percentage of the total variation explained by
the regression line using the independent variable.
Another way to arrive at the value for 𝑟 2 is to square the correlation coefficient; which is the same by
using the variation ratio.
- 12 -