Lec 11

STAT3010: Lecture 11
CORRELATION AND REGRESSION

Correlation Analysis (Section 10.1, Page 466)
The goal of correlation analysis is to understand the nature and
strength of the relationship between x and y (bivariate data).
We must first understand the relationship between 2 variables
by view of the scatter plot of (x, y).
The following scatter plots display different types of relationships
between the x and y values:
To make precise statements about a data set, we must go

beyond just a scatter plot. For example, we know that the
above plot (b) scatter plot shows a direct (positive) linear
relationship between x and y, but the question is, how positive is
it? This is where the population correlation coefficient comes
correlation between x and y, it will also tell us the strength, ie.,
The population correlation coefficient, denoted by

(rho), will
only take on the values in the range of ______________.
The sign of the correlation coefficient indicates the nature of
the relationship between x and y
And the magnitude of the correlation coefficient indicates the

strength of the linear association between the 2 variables.
Recall from STAT 2010/2020:
THE SAMPLE CORRELATION COEFFICIENT r

Definition:
is given by
r=
Where Var(x) and Var(y) are the sample variances of x and

y, respectively. Recall:
and
And Cov(x,y) is the covariance of x and y defined by:
Computing formulas for the three summation quantities are
(y
y)
S yy
2
i
yi ) 2
n
Standard deviation and variance only operate on 1 dimension,

so that you could only calculate the standard deviation for
each dimension of the data set independently of the other
dimensions. However, it is useful to have a similar measure to
find out how much the dimensions vary from the mean with
respect to each other.
Covariance is such a measure. Covariance is always
measured between 2 dimensions. If you calculate the
covariance between one dimension and itself, you get the
variance. So, if you had a 3-dimensional data set (x, y, z), then
you could measure the covariance between the x and y
dimensions, the y and z dimensions, and the x and z dimensions.
Measuring the covariance between x and x, or y and y, or z
and z would give you the variance of the x, y and z dimensions
respectively.
Example 10.1: Correlation Between Body Mass Index and

Systolic Blood Pressure
mass index and systolic blood pressure in males 50 yrs old. A

random sample of 10 males 50 years of age is selected and
their body mass index scores and systolic blood pressure is
recorded in the following table:
X = Body Mass Index
Y = Systolic Blood Pressure
18.4
20.1
22.4
25.9
26.5
28.9
30.1
32.9
33.0
34.7
120
110
120
135
140
115
150
165
160
180
first view the scatter diagram:

Systolic
blood
pressure
Body mass index
Calculate the sample correlation coefficient and explain:
Now that was sample correlation, what about population

correlation? The sample correlation coefficient, r, is a point
estimate for the population correlation coefficient,
. Tests of
hypothesis concerning
address whether there is a linear
association in the population.
To test the null hypothesis of NO linear relationship (
use:
with
df = n-2
(using table B.3
to get a critical value)

Example 10.1.2: Statistical Inference Concerning
Hypothesis:
Test Statistic:
Decision:
Conclusion:
SAS CODE:
options ps=62 ls=80;
data correlation;
input bmi sbp;
cards;
18.4 120
20.1 110
22.4 120
25.9 135
26.5 140
28.9 115
30.1 150
=0) we
32.9 165
33.0 160
34.7 180
run;
proc plot;
plot sbp*bmi;
run;
proc corr cov;
var bmi sbp;
run;
SAS OUTPUT:
Plot of sbp*bmi.
The SAS System

Legend: A = 1 obs, B = 2 obs, etc.
165
17.5
20.0
22.5
25.0
27.5
bmi
30.0
32.5
35.0
The SAS System
The CORR Procedure
2
Variables:
bmi
sbp
Covariance Matrix, DF = 9
bmi
sbp
bmi
sbp
31.8521111
115.2166667
115.2166667
563.6111111
Simple Statistics
Variable
bmi
sbp
Mean
Std Dev
Sum
Minimum
Maximum
10
10
27.29000
139.50000
5.64377
23.74050
272.90000
1395
18.40000
110.00000
34.70000
180.00000
Pearson Correlation Coefficients, N = 10

Prob > |r| under H0: Rho=0
bmi
sbp
bmi
1.00000
0.85992
0.0014
sbp
0.85992
0.0014
1.00000
Simple Linear Regression (Section 10.2, Page 477)

Regression analysis is used to develop the mathematical
equation that best describes the relationship between two
variables, x and y. In correlation analysis, it is not necessary to
specify which of the two variables is the independent one and
dependent one. In regression analysis, it is necessary, they must
be specified.
Remember our correlation plot from example 10.1 (above):
We now want to create the equation of the best fit of this data.
This equation of the line relating y to x is called the simple linear
regression equation and is given by:
Where Y is the dependent variable

X is the independent variable
is the Y-intercept (the value of Y, when X=0)
is the slope (the expected change in Y relative to one
unit change in X)
is the error
The parameters of
and
in the least squares regression line
are estimated in such a way that:
Let estimates of
and
be respectively denoted by
. These estimators are the solutions of the following

equations:
and
10
We have now obtained:
These estimates are called the least squares estimates of the

slope and intercept. The estimate of the simple linear
regression equation is given by substituting the least squares
estimates in the simple linear regression equation:
Where
is the expected value of Y for a given value of X.
Back to Example 10.1:

least squares regression equation for the
data given in example 10.1.
11
12
To compute the regression estimates (

and
) within SAS,
place the following code after the code introduced above on
Page 6/7:
proc reg;
model sbp=bmi;
run;
Variable
bmi
sbp
The SAS System

The CORR Procedure
2 Variables:
bmi
sbp
Covariance Matrix, DF = 9
bmi
sbp
bmi
31.8521111
115.2166667
sbp
115.2166667
563.6111111
Simple Statistics
N
Mean
Std Dev
Sum
Minimum
10
27.29000
5.64377
272.90000
18.40000
10
139.50000
23.74050
1395
110.00000
Maximum
34.70000
180.00000
Pearson Correlation Coefficients, N = 10

Prob > |r| under H0: Rho=0
bmi
sbp
bmi
1.00000
0.85992
0.0014
sbp
0.85992
1.00000
0.0014
The SAS System
The REG Procedure
Model: MODEL1
Dependent Variable: sbp
Number of Observations Read
10
Number of Observations Used
10
Analysis of Variance
Sum of
Mean
Source
DF
Squares
Square
F Value
Model
1
3750.89494
3750.89494
22.71
Error
8
1321.60506
165.20063
Corrected Total
9
5072.50000
Root MSE
12.85304
R-Square
0.7395
Dependent Mean
139.50000
Adj R-Sq
0.7069
Coeff Var
9.21365
Variable
Intercept
bmi
DF
1
1
Parameter Estimates
Parameter
Standard
Estimate
Error
40.78558
21.11158
3.61724
0.75913
13
t Value
1.93
4.76
Pr > F
0.0014
Pr > |t|
0.0895
0.0014

Lec 11

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lec 11

Caricato da

Copyright:

Formati disponibili

STAT3010: Lecture 11

CORRELATION AND REGRESSION

To make precise statements about a data set, we must go

The population correlation coefficient, denoted by

And the magnitude of the correlation coefficient indicates the

THE SAMPLE CORRELATION COEFFICIENT r

Where Var(x) and Var(y) are the sample variances of x and

Computing formulas for the three summation quantities are

Standard deviation and variance only operate on 1 dimension,

Example 10.1: Correlation Between Body Mass Index and

mass index and systolic blood pressure in males 50 yrs old. A

X = Body Mass Index

Y = Systolic Blood Pressure

first view the scatter diagram:

Body mass index

Calculate the sample correlation coefficient and explain:

Now that was sample correlation, what about population

(using table B.3

to get a critical value)

The SAS System

Pearson Correlation Coefficients, N = 10

Simple Linear Regression (Section 10.2, Page 477)

Remember our correlation plot from example 10.1 (above):

Where Y is the dependent variable

. These estimators are the solutions of the following

We have now obtained:

These estimates are called the least squares estimates of the

is the expected value of Y for a given value of X.

Back to Example 10.1:

To compute the regression estimates (

The SAS System

Pearson Correlation Coefficients, N = 10

Potrebbero piacerti anche