Sei sulla pagina 1di 13

STAT3010: Lecture 11

CORRELATION AND REGRESSION


Correlation Analysis (Section 10.1, Page 466)
The goal of correlation analysis is to understand the nature and
strength of the relationship between x and y (bivariate data).
We must first understand the relationship between 2 variables
by view of the scatter plot of (x, y).
The following scatter plots display different types of relationships
between the x and y values:

STAT3010: Lecture 11

To make precise statements about a data set, we must go


beyond just a scatter plot. For example, we know that the
above plot (b) scatter plot shows a direct (positive) linear
relationship between x and y, but the question is, how positive is
it? This is where the population correlation coefficient comes
correlation between x and y, it will also tell us the strength, ie.,

The population correlation coefficient, denoted by


(rho), will
only take on the values in the range of ______________.
The sign of the correlation coefficient indicates the nature of
the relationship between x and y

And the magnitude of the correlation coefficient indicates the


strength of the linear association between the 2 variables.
Recall from STAT 2010/2020:

STAT3010: Lecture 11

THE SAMPLE CORRELATION COEFFICIENT r


Definition:
is given by

r=

Where Var(x) and Var(y) are the sample variances of x and


y, respectively. Recall:

and
And Cov(x,y) is the covariance of x and y defined by:

Computing formulas for the three summation quantities are

(y

y)

S yy

2
i

yi ) 2
n

STAT3010: Lecture 11

Standard deviation and variance only operate on 1 dimension,


so that you could only calculate the standard deviation for
each dimension of the data set independently of the other
dimensions. However, it is useful to have a similar measure to
find out how much the dimensions vary from the mean with
respect to each other.
Covariance is such a measure. Covariance is always
measured between 2 dimensions. If you calculate the
covariance between one dimension and itself, you get the
variance. So, if you had a 3-dimensional data set (x, y, z), then
you could measure the covariance between the x and y
dimensions, the y and z dimensions, and the x and z dimensions.
Measuring the covariance between x and x, or y and y, or z
and z would give you the variance of the x, y and z dimensions
respectively.

Example 10.1: Correlation Between Body Mass Index and


Systolic Blood Pressure

mass index and systolic blood pressure in males 50 yrs old. A


random sample of 10 males 50 years of age is selected and
their body mass index scores and systolic blood pressure is
recorded in the following table:

STAT3010: Lecture 11

X = Body Mass Index

Y = Systolic Blood Pressure

18.4
20.1
22.4
25.9
26.5
28.9
30.1
32.9
33.0
34.7

120
110
120
135
140
115
150
165
160
180

first view the scatter diagram:


Systolic
blood
pressure

Body mass index

Calculate the sample correlation coefficient and explain:

STAT3010: Lecture 11

Now that was sample correlation, what about population


correlation? The sample correlation coefficient, r, is a point
estimate for the population correlation coefficient,
. Tests of
hypothesis concerning
address whether there is a linear
association in the population.
To test the null hypothesis of NO linear relationship (
use:

with

df = n-2

(using table B.3

to get a critical value)


Example 10.1.2: Statistical Inference Concerning
Hypothesis:
Test Statistic:

Decision:

Conclusion:

SAS CODE:
options ps=62 ls=80;
data correlation;
input bmi sbp;
cards;
18.4 120
20.1 110
22.4 120
25.9 135
26.5 140
28.9 115
30.1 150

=0) we

STAT3010: Lecture 11
32.9 165
33.0 160
34.7 180
run;
proc plot;
plot sbp*bmi;
run;
proc corr cov;
var bmi sbp;
run;

SAS OUTPUT:
Plot of sbp*bmi.

The SAS System


Legend: A = 1 obs, B = 2 obs, etc.

165

17.5

20.0

22.5

25.0

27.5
bmi

30.0

32.5

35.0

STAT3010: Lecture 11
The SAS System
The CORR Procedure
2

Variables:

bmi

sbp

Covariance Matrix, DF = 9

bmi
sbp

bmi

sbp

31.8521111
115.2166667

115.2166667
563.6111111

Simple Statistics
Variable
bmi
sbp

Mean

Std Dev

Sum

Minimum

Maximum

10
10

27.29000
139.50000

5.64377
23.74050

272.90000
1395

18.40000
110.00000

34.70000
180.00000

Pearson Correlation Coefficients, N = 10


Prob > |r| under H0: Rho=0
bmi

sbp

bmi

1.00000

0.85992
0.0014

sbp

0.85992
0.0014

1.00000

Simple Linear Regression (Section 10.2, Page 477)


Regression analysis is used to develop the mathematical
equation that best describes the relationship between two
variables, x and y. In correlation analysis, it is not necessary to
specify which of the two variables is the independent one and
dependent one. In regression analysis, it is necessary, they must
be specified.

STAT3010: Lecture 11

Remember our correlation plot from example 10.1 (above):

We now want to create the equation of the best fit of this data.
This equation of the line relating y to x is called the simple linear
regression equation and is given by:

Where Y is the dependent variable


X is the independent variable
is the Y-intercept (the value of Y, when X=0)
is the slope (the expected change in Y relative to one
unit change in X)
is the error
The parameters of
and
in the least squares regression line
are estimated in such a way that:

Let estimates of

and

be respectively denoted by

. These estimators are the solutions of the following


equations:

and

STAT3010: Lecture 11

10

STAT3010: Lecture 11

We have now obtained:

These estimates are called the least squares estimates of the


slope and intercept. The estimate of the simple linear
regression equation is given by substituting the least squares
estimates in the simple linear regression equation:

Where

is the expected value of Y for a given value of X.

Back to Example 10.1:


least squares regression equation for the
data given in example 10.1.

11

STAT3010: Lecture 11

12

STAT3010: Lecture 11

To compute the regression estimates (


and
) within SAS,
place the following code after the code introduced above on
Page 6/7:
proc reg;
model sbp=bmi;
run;

Variable
bmi
sbp

The SAS System


The CORR Procedure
2 Variables:
bmi
sbp
Covariance Matrix, DF = 9
bmi
sbp
bmi
31.8521111
115.2166667
sbp
115.2166667
563.6111111
Simple Statistics
N
Mean
Std Dev
Sum
Minimum
10
27.29000
5.64377
272.90000
18.40000
10
139.50000
23.74050
1395
110.00000

Maximum
34.70000
180.00000

Pearson Correlation Coefficients, N = 10


Prob > |r| under H0: Rho=0
bmi
sbp
bmi
1.00000
0.85992
0.0014
sbp
0.85992
1.00000
0.0014
The SAS System
The REG Procedure
Model: MODEL1
Dependent Variable: sbp
Number of Observations Read
10
Number of Observations Used
10
Analysis of Variance
Sum of
Mean
Source
DF
Squares
Square
F Value
Model
1
3750.89494
3750.89494
22.71
Error
8
1321.60506
165.20063
Corrected Total
9
5072.50000
Root MSE
12.85304
R-Square
0.7395
Dependent Mean
139.50000
Adj R-Sq
0.7069
Coeff Var
9.21365

Variable
Intercept
bmi

DF
1
1

Parameter Estimates
Parameter
Standard
Estimate
Error
40.78558
21.11158
3.61724
0.75913

13

t Value
1.93
4.76

Pr > F
0.0014

Pr > |t|
0.0895
0.0014

Potrebbero piacerti anche