Sei sulla pagina 1di 15

Brief lecture notes

Correlation and Simple Linear Regression

Bivariate Data/Distribution
Bivariate data refer to data relating to two variables. Statistical data relating to
simultaneous measurements of two variables are called bivariate data. Thus, for
n item we have n pairs of measurements or observations as
( x1 , y1 ), ( x2 , y2 ), ..., ( xn , yn ).

Example
Following is the bivariate data showing the Height (c.m) and Weight (kg) of 10
students.
Height (cm) 130 126 120 124 125 127 126 123 130 124
Weight (kg) 40 32 23 35 34 34 32 28 38 30

In the above data the height (cm) and weight (kg) of the first student are paired
as (130, 40), for the second student as (126, 32) and so on.

Scatter Diagram

The diagrammatic way of representing bivariate data is called scatter diagram.

For a bivariate distribution xi , yi , i  1,2,........., n , the diagram of the dots obtained
by the values of the variates x and y along the x-axis and y-axis respectively in
the x, y-plane gives the scatter diagram. From a scatter diagram it can be
evidently ascertained whether there is any correlation exists among the variates
or not.

Table Student Scores on Entrance Examination and Cumulative Grade-point


Averages at Graduation.
Student A B C D E F G H
Entrance examination score 74 69 85 63 82 60 79 91
(100 = maximum possible score)
Cumulative GPA (4.0=a) 2.6 2.2 3.4 2.3 3.1 2.1 3.2 3.8

To begin, we should transfer the information in Table 1 to a graph. Because the


director wishes to use examination scores to predict success in college, we have

1
placed the cumulative GPA (the dependent variable) on the vertical or Y axis
and the entrance examination score (the independent variable) on the horizontal
or X-axis. Figure-1 shows the completed scatter diagram.

Figure-1: Scatter diagram of student scores on entrance examinations plotted


against cumulative grade-point average.

4
Cumulative GPA

3.75
3.5
3.25
3
2.75
2.5
2.25
2
50 55 60 65 70 75 80 85 90 95
Entrance examination scores

Independent and Dependent Variables


Regression and correlation analyses are based on the relationship, or
association, between two (or more) variables. The known variable (or variables)
is called the independent variable(s). The variable we are trying to predict is the
dependent variable.

Example: With a rise in price, the demand for commodity goes down; with the
better monsoon, output of the agricultural produces increases etc.

2
Correlation Analysis
A group of statistical techniques used to measure the strength of the relationship
(correlation) between two variables. The basic purpose of correlation analysis is
to find how strong the relationship is between two variables.

Measures of correlation

The coefficient of correlation


The degree of correlation between two variables X and Y is measured by
coefficient of correlation or correlation coefficient. Designated r , it is often
referred to as Pearson’s r and as the Pearson product moment correlation
coefficient. The formula for r is

S .P.( x, y )
r ----------------------- (i)
S .S ( x).S .S ( y )

 x y   x 2

S .P( x. y)   xy  , S .S ( x)  x 2
 ,
n n
 y  2

S .S ( y )   y 2 
n

The value of the correlation coefficient always lies in the range of –1 to 1; that
is,
-1 ≤ ρ ≤ 1 and -1 ≤ r ≤ 1

ρ for population correlation coefficient and r for sample correlation coefficient


respectively.

3
Figure_C1 Linear correlation between two
variables.

(a) Perfect positive linear correlation, r = 1


y

r=1

x 47

Figure_C2 Linear correlation between two variables.

(b) Perfect negative linear correlation, r = -1


y

r = -1

48
x

4
Figure_C3 Linear correlation between two variables.

(c) No linear correlation, , r ≈ 0


y

r≈0

49
x

Figure_C4 Linear correlation between variables.

x
(a) Strong positive linear correlation (r is close to 1)
50

5
Figure_C5 Linear correlation between variables.

x
(b) Weak positive linear correlation (r is positive
but close to 0)
51

Figure_C6 Linear correlation between variables.

x
(c) Strong negative linear correlation (r is close to -1)
52

6
Figure_C7 Linear correlation between variables.

x
(d) Weak negative linear correlation (r is negative
and close to 0)
53

Example
Rising Hills Manufacturing Inc. wishes to study the relationship between the
numbers of workers, X, and the number of tables, Y, produced in its Redwood
Falls plant. It has obtained a random sample of 10 hours of production. The
following (x, y) combinations of points were obtained:
(12, 20) (30, 60) (15, 27) (24, 50) (14, 21)
(18, 30) (28, 61) (26, 54) (19, 32) (27, 57)
Compute the covariance and correlation coefficient. Discuss briefly the
relationship between the number of workers and the number of tables produced
per hour.

Solution
The computations are set out in the Table bellow.
xi yi ( xi  x ) ( xi  x ) 2 ( yi  y ) ( yi  y ) 2 ( xi  x )( yi  y )

12 20 -9.3 86.49 -21.2 449.44 197.16


30 60 8.7 75.69 18.8 353.44 163.56
15 27 -6.3 39.69 -14.2 201.64 89.46
24 50 2.7 7.29 8.8 77.44 23.76
14 21 -7.3 53.29 -20.2 408.04 147.46
18 30 -3.3 10.89 -11.2 125.44 36.96
28 61 6.7 44.89 19.8 392.04 132.66
26 54 4.7 22.09 12.8 153.84 60.16

7
19 32 -2.3 5.29 -9.2 84.64 21.16
27 57 5.7 32.49 15.8 249.64 90.06
213 412 378.1 2505.6 962

Thus

 ( xi  x )( yi  y ) 962.4
Cov ( x, y )  s xy    106.93
n 1 9

 ( xi  x ) 2 378.1
s x2    42.01
n 1 9

 ( yi  y ) 2 2505.6
s 2y    278.4
n 1 9

And correlation
Cov ( x, y ) 106.93
r   0.989
sx s y 42.01 278.4

Problem: 1
Calculate the coefficient of correlation between the number of sales calls and
the number of units sold and comment on the result.
Sales Representative R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
No. of Sales Calls (x) 14 35 22 29 6 15 17 20 12 29
No. Units Sold (y) 28 66 38 70 22 27 28 47 14 68

Problem: 2
A department store gives in-service training to its salesmen which are followed
by a test. It is considering whether it should terminate the services of any
salesman who does not do well in the test. The following data gives the test
scores and sales made by the salesmen during a certain period.
Test Scores 15 20 25 22 27 23 16 21 20
Sales (Thousand Tk) 32 37 49 38 51 46 33 41 39
Compute the correlation coefficient between the test scores and the sales.

Problem: 3
From the following data find the association or the correlation in the value of
two currencies, the German mark and the Japanese yen, from 1988 to 1997.
Exchange rate of the German mark and the Japanese yen in U.S. dollars

8
Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
G_Mark 1.76 1.88 1.62 1.66 1.56 1.65 1.62 1.50 1.54 1.80
J_Yen 128.1 138 145 134.6 126.8 111.2 102.2 103.4 115.9 130.4

Regression Analysis
It is necessary to develop an equation to express the relationship between two
variables and estimate the value of the dependent variable Y based on a selected
value of the independent variable X . the technique used to develop the equation
for the straight line and make these predictions is called regression analysis. So,
regression is a statistical method to estimate (or predict) the unknown values of
one variable (Y ) for specified values of the other variable (X ).

Definition
A regression model is a mathematical equation that describes the relationship
between two or more variables. A simple regression model includes only two
variables: one independent and one dependent. The dependent variable is the
one being explained, and the independent variable is the one used to explain the
variation in the dependent variable.

A (simple) regression model that gives a straight-line relationship between two


variables is called a linear regression model.

Regression Lines or Regression Equations


In the regression model y = A + Bx + Є, A is called the y-intercept or constant
term, B is the slope, and Є is the random error term. The dependent and
independent variables are y and x, respectively.
In the model ŷ = a + bx, a and b, which are calculated using sample data, are
called the estimates of A and B.

9
SIMPLE LINEAR REGRESSION ANALYSIS cont.

Constant term or y-intercept Slope

y = A + Bx

Dependent variable Independent variable

Figure_R1 Relationship between food expenditure and income.


(a) Linear relationship. (b) Nonlinear relationship.
Food Expenditure

Food Expenditure

Linear

Nonlinear

Income Income

(a) (b)

10
Figure_R2 Plotting a linear equation.

y
y = 50 + 5x
150

100 x = 10
y = 100
50 x=0
y = 50
5 10 15 x
6

The Least Squares Line

For the least squares regression line


ŷ = a + bx,

SS xy
b and a  y  bx
SS xx

18

11
The Least Squares Line cont.

where
 x  y   x  2

SS xy   xy  and SS xx  x 
2

n n

and SS stands for “sum of squares”. The


least squares regression line ŷ = a + bx us
also called the regression of y on x.

19

Example
Find the least squares regression line for the data on incomes and food
expenditure on the seven households given in the Table_R1, Use income as an
independent variable and food expenditure as a dependent variable.

Table_R1 Incomes (in hundreds of dollars) and Food


Expenditures of Seven Households

Income Food Expenditure


35 9
49 15
21 7
39 11
15 5
28 8
25 9
12

12
Solution

Table_R2

Income Food Expenditure


x y xy x²
35 9 315 1225
49 15 735 2401
21 7 147 441
39 11 429 1521
15 5 75 225
28 8 224 784
25 9 225 625
Σx = 212 Σy = 64 Σxy = 2150 Σx² = 7222
21

 x  212  y  64
x   x / n  212 / 7  30.2857
y   y / n  64 / 7  9.1429

22

13
 x  y  (212)(64)
SS xy   xy   2150   211.7143
n 7
 x  2
(212) 2
SS xx   x 2   7222   801.4286
n 7

23

Solution 13-1
SS xy 211.7143
b   .2642
SS xx 801.4286
a  y  bx  9.1429  (.2642)(30.2857)  1.1414
Thus,
ŷ = 1.1414 + .2642x

24

14
Problems
1. The Bradford Electric Illuminating Company is studying the relationship
between kilowatt-hours (thousands) and number of rooms in a private
single-family residence. A random sample of 10 homes yielded the
following.
Number of Rooms 12 9 14 6 10 8 10 10 5 7
Kilowatt-hours(thous) 9 7 10 5 8 6 8 10 4 7

(i) Determine the regression equations


(ii) Determine the number of kilowatt-hours, in thousands, for a six-room
house.
2. Mr. James McWhinney, president of Daniel-James Financial Services,
believes there is a relationship between the number of client contacts and
the dollar amount of sales. To document this assertion, Mr. McWhinney
gathered the following sample information. The X column indicates the
number of client contacts last month, and the Y column shows the value
of sales ($ thousands) last month for each client sampled.
Number of Contacts 14 12 20 16 46 23 48 50 55 50
Sales ($ thousands) 24 14 28 30 80 30 90 85 120 110

(i) Determine the regression equation


(ii) Determine the estimated sales if 40 contacts made.

The formula for correlation coefficient is


n XY   X  Y
r
n X 2
   X   n Y    Y  
2 2 2

15

Potrebbero piacerti anche