Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Linear Regression
Lesson 15 Outline
Review correlation analysis
y
Dependent and Independent variables
Least Squares Regression line
Calculating
C l l ti the th slope
l
Calculating the Intercept
Residuals and Residual Plots
Identifying significant relationship: t-
t-test of the slope
R2 : coefficient of determination
Using the regression line for Prediction of Y from X
Relationship between correlation coefficient and linear
regression
g
1
1. Plot the data using a scatter plot to get a
visual idea of the relationship
2. Calculate the correlation coefficient
1. Use Pearsons correlation coefficient if both
variables are continuous
2. Use Spearman rank correlation coefficient if
both variables are ordinal or one is ordinal
and the other continuous.
r
( x x )( y y )
[ ( x x ) ][ ( y y )
2 2
]
When mid-
mid-parents are taller than
mediocrity, their children tend to be
shorter than they
and
d
When mid-
mid-parents are shorter
th mediocrity,
than di it their
th i children
hild tend
t d
to be taller than they
3.5
2.5
2
50 55 60 65 70 75 80
Body Weight (kg)
4
me (liters)
3.5
asma Volum
2.5
Pla
2
50 55 60 65 70 75 80
Body Weight (kg)
4
e (L)
35
3.5
a Volume
3
Plasma
2.5
2
50 55 60 65 70 75 80
B d W
Body Weight
i ht (kg)
(k )
0 is
i the
h y - intercept
i off the
h line
li
1 is the slope
p of the regression
g line
is the error term - the difference between
the observed Y and the regression line
Y X
Y a bX
b
slope
a One-unit
Change
g in X
intercept
x
0
PubH 6414 Lesson 15 23
Interpretation of predicted
values
l off Y
The p predicted value of y is the expected
p y-value
y-
Since not all observed data points are exactly on the
regression line, there is a range of possible y- y-values (a
distribution) for each xx--value. In regression analysis the
distribution of y-
y-values for each x-x-value is assumed to be a
normal distribution.
The predicted values of y represent the mean values of the
distributions of y for each specified value of x.
The following slide illustrates this for 3 values of X: notice
th t th
that the mean off each
h di
distribution
t ib ti iis on the
th regression
i line
li
equation (the predicted value of y) and that the distribution
of yy--values are normal distributions.
b =0
b <0
x
0
PubH 6414 Lesson 15 28
Calculating the Slope of the
R
Regression
i Line
Li
The formula to calculate the slope of the least
squares regression line is given below
b
n
i 1 ( xi x )( yi y )
n
i 11 ( xi x ) 2
a Y bX
X 66.875
Y 3.0025
b 0.043615
a 3.0025 0.043615 * 66.875 0.0857
The intercept is the estimated expected value of Y when
X = 0. Intercepts do not always have realistic interpretations.
In this example, plasma volume is predicted to be 0.0857 liters
when
h b body
d weight
i ht = 0 kkg. which
hi h is
i nott a possibility.
ibilit
PubH 6414 Lesson 15 33
Regression Line Equation
Once the slope and the intercept have been calculated
th regression
the i equation
ti can beb constructed:
t t d
Y a bX
Y 0.0857 0.0436 X
This is the equation that will be used to predict plasma
volume
l (l) from
f body
b d weight
i ht (kg).
(k )
The regression equation calculated from sample data is
an estimate of the true population regression equation.
Which point is closest to the regression line? (74, 3.37) has the smallest
residual
Which point is furthest from the regression line? (70.5, 3.49) has the largest
PubH 6414 Lesson 15 residual 40
Regression Line and Residuals
Largest residual
4
Smallest
ma Volume ((L)
3.5 residual
id l
3
Plasm
2.5
2
50 55 60 65 70 75 80
Body Weight (kg)
0.4
0.3
0.2
esiduals
0.1
0
-0
0.1
1 0.0
00 20 0
20.0 40 0
40.0 60 0
60.0 80 0
80.0
Re
-0.2
-0.3
-0.4
body weight (kg)
No evidence of nonlinearity.
nonlinearity The points are equally distributed
around the value 0 with no evident positive or negative slope
PubH 6414 Lesson 15 43
(X, Y) Scatterplot for a nonlinear
( curvilinear)
(or ili ) relationship
l ti hi
16
14
12
10
8
6
4
2
0
0 10 20 30 40 50
6
4
Residuals
2
0
2 0
-2 10 20 30 40 50
R
-4
-6
X
3.
3 Significance level = 0.05
0 05
Regression Statistics
Multiple R 0.759126577
R Square 0.576273159
Adj t d R S
Adjusted Square 0 505652019
0.505652019
Standard Error 0.218809511
Observations 8
ANOVA
df SS MS F Significance F
Regression 1 0.390684388 0.390684388 8.160066 0.028930913
Residual 6 0.287265612 0.047877602
Total 7 0.67795
P-value
P al e for
fo t-test
t test = 0.029
0 029 so reject
eject the n
nullll h
hypothesis
pothesis and conclude
concl de that
there is a significant relationship between weight and plasma volume
PubH 6414 Lesson 15 52
Regression
g Analysis
y in Excel
In Excel Module 15 use the Data Analysis Tool to obtain
the Regression Analysis results
select
l Regression
under
d theh Data Analysis
l Tool.
l
Enter the plasma volume data for Y- Y-range and the
weight
g data for X- X-range
g
Check labels if you highlight the column headers
Also check Residuals and Residual Plot
Identify
Id tif ththe tt--statistic
t ti ti andd th
the p-
p-value
l for
f the
th t-
t-test
t t off
the slope.
Also identifyy the slopep and the intercept
p on the outputp
table
These are under the Coefficients column
95% confidence intervals for the coefficients are also
provided if the Confidence Level box is checked
PubH 6414 Lesson 15 53
T-test of the Intercept
The Data Analysis Tool also provides results of a t- t-test of
the Intercept
Intercept.
The Null hypothesis of this test is that the intercept = 0:
= 0
Th Alt
The Alternative
ti hypothesis
h th i off this
thi test
t t is
i that
th t the
th
intercept 0: 0
Usuallyy there is not much interest in the t-t-test of the
intercept because testing whether the intercept = 0 does
not provide information about the relationship between
the two variables.
From the Regression Table, you can see that the null
hypothesis for the intercept = 0 is not rejected because
the p-
p-value = 0.936.
0 936 This result does not affect the
significant result of the t-
t-test of the slope.
PubH 6414 Lesson 15 54
Linear Regression Procedure
Look at a scatter plot of the data
Plot Y on the y-
y-axis and X on the x-
x-axis
Add the trend line to the plot
Estimate the regression line equation
Find the slopep and interceptp of the regression
g line
Is the relationship statistically significant?
Use a t-t-test of the slope to determine significance
H
How wellll does
d the
th estimated
ti t d regression i line
li equation
ti fit the
th
data?
Calculate R2 - the coefficient of determination
Use the estimated regression line equation to predict values
of the dependent variable (Y) for specified values of the
independent variable (X).
3.5
3
2.7
2.5
2
50 55 60 65 70 75 80
Body Weight (kg)
The predicted plasma volume for weight = 60 kg is the point on the regression
line corresponding to x = 60. This point is 2.7 liters.
PubH 6414 Lesson 15 61
Appropriate Applications of the
Regression
i Line
i Equation
i
Predictions using regression line equations are only valid
within
ithi the
th range off x-
x-values
l in
i the
th collected
ll t d data.
d t
For the example data, the range of weight is from 58
74 kgs.
g
It would not be appropriate to use this regression line
equation to predict plasma volume for an individual
weighing 100 kg or an individual weighing 25 kg.
There may be a different relationship between weight and
plasma volume beyond the values of the collected data so
the relationship identified by the regression line equation
should not be extrapolated much beyond the range of the X
values.
sx
PubH 6414 Lesson 15 65
Hypothesis Test of population
correlation
l i coefficient:
ffi i
We can set up p a hypothesis
yp test of independence
p for the
population correlation:
Null Hypothesis:
no significant linear association between the variables
Alternative Hypothesis:
0
significant linear association between the variables
The test statistic is a t-
t-statistic with n-
n-2 df
r n2
t
1 r 2
After finding
g the t-
t-statistic,, you
y can use EXCEL to find the
p-value = TDIST(t, n- n-2, 2)
PubH 6414 Lesson 15 66
T-test of the correlation
coefficient
ffi i
For a given sample data, the t- t-test for and the t- t-test for
th slope,
the l 1 , will
ill have
h the
th same t- t-statistic
t ti ti and d p-
p-value.
l
For the plasma volume data, the t- t-statistic for the test of
the population correlation coefficient = 2.85658 which is
th same as th
the the tt--statistic
t ti ti ffor th
the slope
l off th
the regressioni
line
You can work through the equation in EXCEL to
confirm this
P-value = TDIST(2.85658, 6, 2) = 0.02893
The same conclusion is reached from either hypothesis
t t th
test: there iis a significant
i ifi t relationship
l ti hi between
b t the
th two
t
variables
The p-p-value < 0.05 so the null hypothesis of
independen e is
independence i rejected
eje ted att significance
ignifi n e level
le el 0.05
0 05
PubH 6414 Lesson 15 67
Linear Regression and
Correlation: which to use?
Both Linear Regression and Correlation Analysis can be
used to explore the linear relationship between two
continuous (quantitative) random variables
Use Correlation analysis when the interest is primarily
in identifying whether a relationship exists.
exists
Use the t-
t-test of the correlation coefficient to determine if
the relationship is significant.
Use Regression
Reg ession Anal
Analysis
sis to identif
identify a relationship
elationship AND
to predict the value of one variable given a value of
the other variable.
Use the t-
t-test of the slope to determine if the relationship is
significant
Regression analysis is most useful when there is an identified
interest in predicting one variable from the other(s).
other(s) If
prediction doesnt make sense, use correlation analysis.
PubH 6414 Lesson 15 68
Readings and Assignments
Reading
ead g
Chapter 8 pgs. 192-
192-194, 202
202212
Complete
p the Lesson 15 Practice Exercises
Lesson 15 Excel Modules
Excel Module 15: Plasma Volume works
through the example in this Lesson
Excel Module 15: BMI works through the
example in the text (pages 205
205--206,
206 208-
208-209)
Complete OPTIONAL Homework 11: Use the
Data Analysis Tool for the Linear Regression
problems
PubH 6414 Lesson 15 69