Sei sulla pagina 1di 69

Linear correlation and linear

regression + summary of tests


Recall: Covariance
1
) )( (
) , ( cov
1

=
n
Y y X x
y x
n
i
i i
cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent


Interpreting Covariance
Correlation coefficient
Pearsons Correlation Coefficient is
standardized covariance (unitless):

y x
y x ariance
r
var var
) , ( cov
=
Recall dice problem
Var(x) = = 2.916666
Var(y) = 5.83333
Cov(xy) = 2.91666


707 .
2
1
8333 . 5 91666 . 2
91666 . 2
= = = r
5 . 707 .
2 2
= = R
Interpretation of R
2
: 50% of the total
variation in the sum of the two dice is
explained by the roll on the first die.
Makes perfect intuitive sense!
R
2=
Coefficient of
Determination =
SS
explained
/TSS
Correlation
Measures the relative strength of the linear
relationship between two variables
Unit-less
Ranges between 1 and 1
The closer to 1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship
Scatter Plots of Data with
Various Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3 r = +1
Y
X
r = 0
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Linear Correlation
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Y
X
Y
X
No relationship
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Some calculation formulas
y x
xy
n
i
i
n
i
i
n
i
i i
n
i
i
n
i
i
n
i
i i
SS SS
SS
y y x x
y y x x
n
y y
n
x x
n
y y x x
r
=


=

= =
=
= =
=
1
2
1
2
1
1
2
1
2
1
) ( ) (
) )( (
1
) (
1
) (
1
) )( (

y x
xy
SS SS
SS
r =

Note: Easier computation


formulas:

2
2
2 2

=
=
=
y n y SS
x n x SS
y x n y x SS
i y
i x
i i xy
Sampling distribution of
correlation coefficient:
*note, like a proportion, the variance of the correlation coefficient depends
on the correlation coefficient itselfsubstitute in estimated r

2
1
)

(
2

=
n
r
r SE
The sample correlation coefficient follows a T-distribution with
n-2 degrees of freedom (since you have to estimate the
standard error).
Sample size requirements for r
2
) )( 1 (
2
2
2 /
2
+
+
=
r
Z Z r
n
power o
Correlation in SAS
/*To get correlations between variables 1 and 2, 1 and 3, and 2
and 3:*/
PROC CORR data=yourdata;
var variable1 variable2 variable3;
run;

/*To get correlations between variables 3 and 1 and 3 and 2:*/
PROC CORR data=yourdata;
var variable1 variable2;
with variable3;
run;

Linear regression
http://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
What is Linear?
Remember this:
Y=mX+B?
B
m
Whats Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.


Simple linear regression
The linear regression model:
Love of Math =

1.076268 + 0.010203*math SAT
score
intercept
slope
P=.22; not
significant
Prediction
If you know something about X, this knowledge helps you
predict something about Y. (Sound familiar?sound
like conditional probabilities?)

EXAMPLE
The distribution of baby weights at Stanford
~ N(3400, 360000)
Your Best guess at a random babys weight,
given no information about the baby, is what?
3400 grams
But, what if you have relevant information? Can
you make a better guess?
Predictor variable
X=gestation time

Assume that babies that gestate for longer
are born heavier, all other things being equal.
Pretend (at least for the purposes of this
example) that this relationship is linear.
Example: suppose a one-week increase in
gestation, on average, leads to a 100-gram
increase in birth-weight
Y depends on X
Y=birth-
weight
(g)
X=gestation
time (weeks)
Best fit line is chosen such
that the sum of the squared
(why squared?) distances of
the points (Y
i
s) from the line
is minimized:
Or mathematically
(remember max and mins
from calculus)
Derivative[E(Y
i
-(mx+b))
2
]=0
Prediction
A new baby is born that had gestated
for just 30 weeks. Whats your best
guess at the birth-weight?
Are you still best off guessing 3400?
NO!
Y=birth-
weight
(g)
X=gestation
time (weeks)

At 30 weeks
3000
30
Y=birth
weight
(g)
X=gestation
time (weeks)

At 30 weeks
(x,y)=
(30,3000)
3000
30
At 30 weeks
The babies that gestate for 30 weeks
appear to center around a weight of
3000 grams.

In Math-Speak
E(Y/X=30 weeks)=3000 grams

Note the conditional
expectation
But
Note that not every Y-value (Y
i
) sits on the line.
Theres variability.
Y
i
=3000 + random error
i


In fact, babies that gestate for 30 weeks
have birth-weights that center at 3000
grams, but vary around 3000 with some
variance o
2
Approximately what distribution do birth-weights
follow? Normal. Y/X=30 weeks ~ N(3000, o
2
)


Y=birth-
weight
(g)
X=gestation
time (weeks)

And, if X=20, 30, or 40
20 30 40
Y=baby
weights
(g)
X=gestation
times (weeks)

If X=20, 30, or 40
20 30 40
Y/X=40 weeks ~ N(4000, o
2
)
Y/X=30 weeks ~ N(3000, o
2
)
Y/X=20 weeks ~ N(2000, o
2
)

Mean values fall on the line
E(Y/X=40 weeks)=4000
E(Y/X=30 weeks)=3000

E(Y/X=20 weeks)=2000

E(Y/X)=
Y/X
= 100 grams/week*X weeks



Linear Regression Model
Ys are modeled
Y
i
= 100*X + random error
i



Follows a
normal
distribution
Fixed
exactly
on the
line
Assumptions (or the fine print)
Linear regression assumes that
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the
same (homogeneity of variances)

Why? The math requires itthe
mathematical process is called least squares
because it fits the regression line by
minimizing the squared errors from the line
(mathematically easy, but not generalrelies
on above assumptions).
Non-homogenous variance
Y=birth-
weight
(100g)
X=gestation
time (weeks)
Least squares estimation
** Least Squares Estimation
A little calculus.
What are we trying to estimate? , the slope, from

Whats the constraint? We are trying to minimize the squared distance (hence the least squares) between the
observations themselves and the predicted values , or (also called the residuals, or left-over unexplained variability)

Difference
i
= y
i
(x + ) Difference
i
2
= (y
i
(x + ))
2


Find the that gives the minimum sum of the squared differences. How do you maximize a function? Take the
derivative; set it equal to zero; and solve. Typical max/min problem from calculus.






From here takes a little math trickery to solve for
... 0 )) ( ( 2
)) )( ( ( 2 )) ( (
1
2
1 1
2
= + +
= +


=
= =
n
i
i i i i
n
i
i i i
n
i
i i
x x x y
x x y x y
d
d
o |
o | o |
|
C

A

B

A

y
i




x

y

y
i




C

B

*Least squares estimation
gave us the line () that
minimized C
2


o | + =
i i
x y

y
A
2
B
2
C
2


SS
total

Total squared distance of
observations from nave mean
of y
Total variation

SS
reg

Distance from regression line to nave mean of y
Variability due to x (regression)




SS
residual

Variance around the regression line
Additional variability not explained
by xwhat least squares method aims
to minimize


= = =
+ =
n
i
i i
n
i
n
i
i i
y y y y y y
1
2
1 1
2 2
) ( ) ( ) (
The Regression Picture
R
2
=SS
reg
/SS
total
Results of least squares
Slope (beta coefficient) =
x
xy
SS
SS
= |

=
=
=
=
n
i
i
n
i
i i
x x
y y x x
1
2
x
1
xy
) ( SS and
) )( ( SS where
) , ( y x
x

- y

: Calculate | o =
Intercept=

Regression line always goes through the point:

Relationship with correlation
y
x
SS
SS
r |

=
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.


= =
= =
= =
= =
n
i
i
n
i
i
n
i
i
n
i
i
x n x x x
y n y y y
1
2 2
1
2
x
1
2 2
1
2
y
) ( SS and
) ( SS where
Expected value of y:
i
y

i i
x y | o


+ =
Expected value of y at level of x: x
i
=
Residual
)

i i i i i
x y y y e | o = =
We fit the regression coefficients such that sum
of the squared residuals were minimized (least
squares regression).
Residual
Residual = observed value predicted value
At 33.5 weeks gestation,
predicted baby weight is
3350 grams
33.5 weeks
This baby was actually
3380 grams.
His residual is +30
grams:
3350
grams
Standard error of y/x
1 1
) (
s
1
2
2
y

=
n
SS
n
y y
y
n
i
i
2
)

(
1
2
2
/

=
n
y y
s
n
i
i i
x y
S
y/x
2
= average residual squared
(what weve tried to minimize)





(equivalent to MSE(=SSW/df)
in ANOVA)
2
)

(
1
2
2
/

=
n
y y
s
n
i
i i
x y
Y=baby
weights
(g)
X=gestation
times (weeks)
20 30 40
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.
S
y/x

S
y/x

S
y/x

S
y/x

S
y/x

S
y/x

Standard error of beta
i i
n
i
i
n
i
i
x y
x n x x x
| o

and
) ( SS where
1
2 2
1
2
x
+ =
= =

= =
x
x y
x
n
i
i i
SS
s
SS
n
y y
s
2
/
1
2

2
)

(
=

=
|
Comparing Standard Errors
of the Slope
Y
X
Y
X
|

S small
|

S large
is a measure of the variation in the slope of regression
lines from different possible samples
|

S
Sampling distribution of beta
Slope
Sampling distribution of slope ~ T
n-2
(,s.e.( ))





|

H
0
:
1
= 0 (no linear relationship)
H
1
:
1
= 0 (linear relationship does exist)
)

.( .
0

|
|
e s

T
n-2
=
x x
n
i
i i
SS
s
SS
n
y y
2
1
2
2
) (
=

=
(Standard error of intercept)
x

- y

| o =
x
SS
x
n
s
2
1
+
= )

(o stderror
2
)

(
1
2
2
/

=
n
y y
s
n
i
i i
x y
Residual Analysis: check
assumptions
The residual for observation i, e
i
, is the difference
between its observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Examine for constant variance for all levels of X
(homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption
Graphical Analysis of Residuals
Can plot residuals vs. X
i i i
Y Y e

=
Residual Analysis for
Linearity
Not Linear
Linear

x
r
e
s
i
d
u
a
l
s

x
Y
x
Y
x
r
e
s
i
d
u
a
l
s

Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Residual Analysis for
Homoscedasticity
Non-constant variance

Constant variance
x x
Y
x
x
Y
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Residual Analysis for
Independence
Not Independent
Independent
X
X
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

X
r
e
s
i
d
u
a
l
s


Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
A ttest is linear regression!
In our class the average English SAT score
weekly was 739 (sd: 48) in those born on odd
days (n=12); and 657 (sd: 107) in those born
on even days (n=13).
We can evaluate these data with a ttest or a
linear regression
02 . ; 49 . 2
13
82
12
82
7 . 81 657 739
2 2
23
= =
+
=
= p T
Pooled standard deviation
is 82, but the variances
violate the assumption of
homogeneity of variances!
As a linear regression

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 657.5000000 23.66105065 27.79 <.0001
OddDay 81.7307692 32.81197359 2.49 0.0204
Intercept
represents the
mean value in
the even-day
group. It is
significantly
different than
0so the
average Eng
SAT score is
not 0.
Slope represents
the difference in
means between odd
and even groups.
Diff is significant.
Multiple Linear Regression
More than one predictor

= o + |
1
*X + |
2
*W + |
3
*Z

Each regression coefficient is the amount of
change in the outcome variable that would be
expected per one-unit change of the
predictor, if all other variables in the model
were held constant.


ANOVA is linear regression!
A categorical variable with more than two
groups:
E.g.: groups 1, 2, and 3 (mutually exclusive)

= o (=value for group 1) + |
1
*(1 if in group
2) + |
2
*(1 if in group 3)

This is called dummy codingwhere multiple
binary variables are created to represent
being in each category (or not) of a
categorical variable
Functions of multivariate
analysis:
Control for confounders
Test for interactions between predictors
(effect modification)
Improve predictions
Linear Regression Coefficient (z Score)

Variable SBP DBP
Model 1
Total protein, % kcal -0.0346 (-1.10) -0.0568 (-3.17)
Cholesterol, mg/1000 kcal 0.0039 (2.46) 0.0032 (3.51)
Saturated fatty acids, % kcal 0.0755 (1.45) 0.0848 (2.86)
Polyunsaturated fatty acids, % kcal 0.0100 (0.24) -0.0284 (-1.22)
Starch, % kcal 0.1366 (4.98) 0.0675 (4.34)
Other simple carbohydrates, % kcal 0.0327 (1.35) 0.0006 (0.04)
Model 2
Total protein, % kcal -0.0344 (-1.10) -0.0489 (-2.77)
Cholesterol, mg/1000 kcal 0.0034 (2.14) 0.0029 (3.19)
Saturated fatty acids, % kcal 0.0786 (1.73) 0.1051 (4.08)
Polyunsaturated fatty acids, % kcal 0.0029 (0.08) -0.0230 (-1.07)
Starch, % kcal 0.1149 (4.65) 0.0608 (4.35)
Models controlled for baseline age, race (black, nonblack), education, smoking, serum
cholesterol.
Table 3. Relationship of Combinations of Macronutrients to BP (SBP and DBP) for 11 342
Men, Years 1 Through 6 of MRFIT: Multiple Linear Regression Analyses
Circulation. 1996 Nov 15;94(10):2417-23.
Total protein, % kcal
-0.0346 (-1.10) -0.0568 (-3.17)
Linear Regression Coefficient (z Score)

Variable SBP DBP
Translation: controlled for other variables in
the model (as well as baseline age, race, etc.),
every 1 % increase in the percent of calories
coming from protein correlates with .0346
mmHg decrease in systolic BP. (NS)
In math terms: SBP= o -.0346*(% protein) + |
age
*(Age) +.
Also (from a separate model), every 1 % increase
in the percent of calories coming from protein
correlates with a .0568 mmHg decrease in diastolic
BP. (significant)
DBP= o - 05568*(% protein) + |
age
*(Age) +.
Multivariate regression pitfalls
Multi-collinearity
Residual confounding
Overfitting
Multicollinearity
Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.

Model building and diagnostics are tricky
business!

Residual confounding
You cannot completely wipe out confounding simply by
adjusting for variables in multiple regression unless variables
are measured with zero error (which is usually impossible).
Residual confounding can lead to significant adjusted odds
ratios (ORs) as high as 1.5 to 2.0 if measurement error is high.
Hypothetical Example: In a case-control study of lung cancer,
researchers identified a link between alcohol drinking and
cancer in smokers only. The OR was 1.3 for 1-2 drinks per day
(compared with none) and 1.5 for 3+ drinks per day. Though
the authors adjusted for number of cigarettes smoked per day
in multivariate regression, we cannot rule out residual
confounding by level of smoking (which may be tightly linked to
alcohol drinking).


Overfitting
In multivariate modeling, you can get highly significant but
meaningless results if you put too many predictors in the model.
The model is fit perfectly to the quirks of your particular sample, but
has no predictive ability in a new sample.
Example (hypothetical): In a randomized trial of an intervention to
speed bone healing after fracture, researchers built a multivariate
regression model to predict time to recovery in a subset of women
(n=12). An automatic selection procedure came up with a model
containing age, weight, use of oral contraceptives, and treatment
status; the predictors were all highly significant and the model had a
nearly perfect R-square of 99.5%.
This is likely an example of overfitting. The researchers have fit a
model to exactly their particular sample of data, but it will likely have
no predictive ability in a new sample.
Rule of thumb: You need at least 10 subjects for each additional
predictor variable in the multivariate regression model.
Overfitting
Pure noise variables still produce good R
2
values if the model is
overfitted. The distribution of R
2
values from a series of
simulated regression models containing only noise variables.
(Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction
to Overfitting in Regression-Type Models. Psychosomatic Medicine 66:411-421 (2004).)

Other types of multivariate
regression
Multiple linear regression is for normally
distributed outcomes

Logistic regression is for binary outcomes

Cox proportional hazards regression is used when
time-to-event is the outcome

Overview of statistical tests
The following table gives the appropriate
choice of a statistical test or measure of
association for various types of data (outcome
variables and predictor variables) by study
design.

Continuous outcome
Binary predictor
Continuous predictors
e.g., blood pressure= pounds + age + treatment (1/0)
Types of variables to be analyzed

Statistical procedure
or measure of association
Predictor variable/s


Outcome variable

Cross-sectional/case-control studies

Categorical (>2 groups)

Continuous

ANOVA

Continuous

Continuous

Simple linear regression

Multivariate
(categorical and
continuous)

Continuous

Multiple linear regression

Categorical

Categorical

Chi-square test

(or Fishers
exact)

Binary

Binary

Odds ratio, risk ratio

Multivariate

Binary

Logistic regression



Cohort Studies/Clinical Trials
Binary

Binary

Risk ratio

Categorical

Time-to-event

Kaplan-Meier/ log-rank test

Multivariate

Time-to-event

Cox-proportional hazards
regression, hazard ratio

Binary (two groups)

Continuous

T-test
Binary
Ranks/ordinal

Wilcoxon rank-sum test

Categorical

Continuous

Repeated measures ANOVA

Multivariate

Continuous

Mixed models; GEE modeling

Alternative summary: statistics
for various types of outcome data



Outcome Variable
Are the observations independent or
correlated?



Assumptions
independent

correlated
Continuous
(e.g. pain scale,
cognitive function)
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Binary or
categorical
(e.g. fracture yes/no)
Difference in proportions
Relative risks
Chi-square test
Logistic regression
McNemars test
Conditional logistic regression
GEE modeling
Chi-square test
assumes sufficient
numbers in each cell
(>=5)
Time-to-event
(e.g. time to fracture)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
Continuous outcome (means);
HRP 259/HRP 262

Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent
groups

ANOVA: compares means
between more than two
independent groups

Pearsons correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables

Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
Paired ttest: compares means
between two related groups (e.g.,
the same subjects before and
after)

Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)

Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest

Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest

Kruskal-Wallis test: non-
parametric alternative to ANOVA

Spearman rank correlation
coefficient: non-parametric
alternative to Pearsons correlation
coefficient
Binary or categorical outcomes
(proportions); HRP 259/HRP 261

Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:
independent correlated
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
compares proportions between
more than two groups

Relative risks: odds ratios
or risk ratios

Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemars chi-square test:
compares binary outcome between
correlated groups (e.g., before and
after)

Conditional logistic
regression: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)

GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)

Fishers exact test: compares
proportions between independent
groups when there are sparse data
(some cells <5).

McNemars exact test:
compares proportions between
correlated groups when there are
sparse data (some cells <5).

Time-to-event outcome
(survival data); HRP 262

Outcome
Variable
Are the observation groups independent or correlated? Modifications to
Cox regression
if proportional-
hazards is
violated:
independent correlated
Time-to-
event (e.g.,
time to
fracture)

Kaplan-Meier statistics: estimates survival functions for
each group (usually displayed graphically); compares survival
functions with log-rank test

Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios

n/a (already over
time)
Time-dependent
predictors or time-
dependent hazard
ratios (tricky!)

Potrebbero piacerti anche