Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Fall 2016
Huaizhen Qin
Figure 0: Outcome,
random error.
e
predictor
and
The ideas of regression were first elucidated by the English scientist Sir Francis Galton (1822
1911) in reports of his research on heredityfirst in sweet peas and later in human stature. He
described a tendency of adult offspring, having either short or tall parents, to revert back toward
the average height of the general population. He first used the word reversion, and later
regression, to refer to this phenomenon.
In simple linear regression, we will study how to relate an outcome variable to ONE predictor
variable , where can be accurately measured/observed, but may also depend on unobserved
error (Figure 0).
Fig. 1: Data from the Greene-Touchstone study relating birthweight and estriol level in pregnant
women near term. Source: Rosner, Bernard. Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury,
Brooks/Cole, Cengage Learning. Page 428.
Fall 2016
Huaizhen Qin
Table 1: Sample data from the Greene-Touchstone study relating birthweight (y) and estriol level (x)
in pregnant women near term.
1
2
3
4
5
6
7
8
9
10
11
Estriol
Birthweight
(mg/24 hr)
(g/100)
7
25
9
25
9
25
12
27
14
27
16
27
16
24
14
30
16
30
16
31
17
30
12
13
14
15
16
17
18
19
20
21
Estriol
(mg/24 hr)
19
21
24
15
16
17
25
27
15
15
Birthweight
(g/100)
31
30
28
32
32
32
32
34
34
34
22
23
24
25
26
27
28
29
30
31
Estriol
(mg/24 hr)
15
16
19
18
17
18
20
22
25
24
Birthweight
(g/100)
35
35
34
35
36
37
38
40
39
43
Source: From Rosner, Bernard. Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury, Brooks/Cole,
Cengage Learning. Page 429.
been found to be related to infant birthweight. The test can provide indirect evidence of an
abnormally small fetus.
The relationship between estriol level (x) and birthweight (y) can be quantified by fitting a
regression line that relates the two variables. Greene and Touchstone conducted a study to relate
birthweight () and estriol level () in pregnant women, using the 31 data points (Table 1).
Postulated relationship
As can be seen from the plot (Figure 1), there appears to be a relationship between estriol level
() and birthweight (), although this relationship is not consistent and considerable scatter
exists throughout the plot. We can postulate a linear relationship between and that is of the
following form:
(|) = 0 + 1 ,
where, (. |. ) denotes conditional mean. That is, for all mothers with the same estriol-level , the
average birthweight of their newborns is (|) = 0 + 1 .
Fall 2016
Huaizhen Qin
Regression line
Definition 1: The line = 0 + 1 is called the regression line, where 0 is called the
intercept and 1 is called the slope of the line.
The regression line = 0 + 1 is NOT expected to hold exactly for every woman and baby
pair. For example, NOT all women with a given estriol level (e.g., = 16) have babies with
identical birthweights.
Thus an error term , which represents the variance of birthweights among all babies of women
with a given estriol level , is introduced into the model. Let ~(0, 2 ), the normal distribution
with mean 0 and variance 2 . The full linear regression model then takes the following
form: = 0 + 1 + , where 0, 1 and 2 are unknown parameters.
Definition 2: For the full linear regression model = 0 + 1 + , is called the dependent
variable and is called the independent variable, because we are trying to predict y as a
function of x.
Example 2 (Obstetrics): For the regression line in Example 1, birthweight y is the dependent
variable and estriol level x is the independent variable, because estriol levels are being used to
predict birthweights.
Interpretation of the regression line
One interpretation of the regression line is that for a given value (i.e., estriol level x), the
corresponding (i.e., birthweight) will be normally distributed with mean 0 + 1 and variance
2 . In notation, |~(0 + 1 , 2 ). If 2 were 0, then every point would fall exactly on the
regression line, whereas the larger 2 is, the more scatter occurs about the regression line (see
Figure 2).
(a) Perfect fit.
(b) Imperfect fit.
Figure2: The effect of 2 on the goodness of fit of a regression line. Source: Rosner, Bernard.
Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury, Brooks/Cole, Cengage Learning. Page 431.
Fall 2016
Huaizhen Qin
Figure4: Interpretation of the regression line for different values of 1 . Source: Rosner, Bernard.
Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury, Brooks/Cole, Cengage Learning. Page 431.
Fall 2016
Huaizhen Qin
Fig. 5: Least square criterion for judging the fit of a regression line.
The distance of a typical sample point ( , ) from the estimated line could be measured along
a direction parallel to the y-axis. Let ( , ) = ( , 0 + 1 ) be the point on the estimated
regression line at , then this distance is given by
= = (0 + 1 ).
For both theoretical reasons and ease of derivation, the following least-squares criterion is
commonly used.
Principle of Least Squares
A good-fitting line would make sum of the squared distances as small as possible.For arbitrary
0 and 1 , one can predict by (0 , 1 ) = 0 + 1 , but has to suffer residuals
(0 , 1 ) = 0 1 . The least square estimates (0 , 1 ) minimize
(0 , 1 ) = ( 0 1 )2 .
=1
That is,
2 = min (0 , 1 ).
b0 ,1
Fall 2016
Huaizhen Qin
, = (
)2
, = ( )( )
=1
=1
2
=1
=
=1
(=1 )2
2
=
,
(=1 )(=1 )
=
.
=1
=1
=
,
= , /, .
Solution:
and
The Least Squares Line is given by = 0 + 1 , where 0 and 1 are called Least Squares
Estimates of 0 and 1, respectively. One unbiased estimator of 2 is given by
1
2
2 =
( 0 1 ) .
2
=1
Fall 2016
Huaizhen Qin
31
31
= 534,
= 992,
= 17500
=1
=1
=1
534
= 17.22851,
31
992
31
= 32,
31
31
2 = 32419,
2 = 9876,
=1
=1
, = 2 2
, =
=1
=1
534 2
= 9876 31 (
)
31
534 992
= 17500 31 (
)(
)
31
31
= 677.41935,
= 412.
1 = , /,
412
=
= 0.60819,
677.41935
0 = 1
= 32 (0.60819)(17.2258)
= 21.5234.
Fall 2016
Huaizhen Qin
must be normally distributed to ensure the usual (finite sample) inferential procedures
(estimation, hypothesis testing) be valid.
4. Homoscedasticity: The variances of the subpopulations of y are all equal and denoted by 2 .
In other words, it is invariant to values.
5. Linearity: Variables must be in the proper scale so that the average value of for a given
value of is a linear function of . In other words, the means of the subpopulations of y all
lie on the same straight line, symbolically, | (|) = 0 + 1.
6. Independence: The y values for distinct subjects are statistically independent. In other words,
in drawing the sample, it is assumed that the values of y chosen at one value of x in no way
depend on the values of y chosen at another value of x.
Fall 2016
Huaizhen Qin
1.
Determine whether or not the assumptions underlying a linear relationship are met in the
data available for analysis.
2.
Obtain the equation for the line that best fits the sample data.
3.
Evaluate the equation to obtain some idea of the strength of the relationship and the
usefulness of the equation for predicting and estimating.
4.
If the data appear to conform satisfactorily to the linear model, use the equation obtained
from the sample data to predict and to estimate.
Excises 1
1. Read the lecture note and the materials on pages 417 424 of Danels book. Which
assumption(s) is (are) violated in the data analysis of the metabolic disease?
2. Try to derive the Least Squares Estimators 0 and 1 mathematically (This is
optional; but it is good for those who like theory).