Sei sulla pagina 1di 9

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

Lecture 1: Simple Linear Regression


Regression analysis is helpful in assessing specific forms
of the relationship between variables. The ultimate
objective when this method is employed usually is to
predict (or estimate) the value of one variable (y)
corresponding to a given value of another variable (x).

Figure 0: Outcome,
random error.

e
predictor

and

The ideas of regression were first elucidated by the English scientist Sir Francis Galton (1822
1911) in reports of his research on heredityfirst in sweet peas and later in human stature. He
described a tendency of adult offspring, having either short or tall parents, to revert back toward
the average height of the general population. He first used the word reversion, and later
regression, to refer to this phenomenon.
In simple linear regression, we will study how to relate an outcome variable to ONE predictor
variable , where can be accurately measured/observed, but may also depend on unobserved
error (Figure 0).

1.1. General Concepts


Example 1 (Obstetrics): Obstetricians sometimes order tests to measure estriol levels from 24hour urine specimens taken from pregnant women who are near term, because level of estriol has

Fig. 1: Data from the Greene-Touchstone study relating birthweight and estriol level in pregnant
women near term. Source: Rosner, Bernard. Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury,
Brooks/Cole, Cengage Learning. Page 428.

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

Table 1: Sample data from the Greene-Touchstone study relating birthweight (y) and estriol level (x)
in pregnant women near term.

1
2
3
4
5
6
7
8
9
10
11

Estriol
Birthweight
(mg/24 hr)
(g/100)

7
25
9
25
9
25
12
27
14
27
16
27
16
24
14
30
16
30
16
31
17
30

12
13
14
15
16
17
18
19
20
21

Estriol
(mg/24 hr)

19
21
24
15
16
17
25
27
15
15

Birthweight
(g/100)

31
30
28
32
32
32
32
34
34
34

22
23
24
25
26
27
28
29
30
31

Estriol
(mg/24 hr)

15
16
19
18
17
18
20
22
25
24

Birthweight
(g/100)

35
35
34
35
36
37
38
40
39
43

Source: From Rosner, Bernard. Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury, Brooks/Cole,
Cengage Learning. Page 429.

been found to be related to infant birthweight. The test can provide indirect evidence of an
abnormally small fetus.
The relationship between estriol level (x) and birthweight (y) can be quantified by fitting a
regression line that relates the two variables. Greene and Touchstone conducted a study to relate
birthweight () and estriol level () in pregnant women, using the 31 data points (Table 1).

Postulated relationship
As can be seen from the plot (Figure 1), there appears to be a relationship between estriol level
() and birthweight (), although this relationship is not consistent and considerable scatter
exists throughout the plot. We can postulate a linear relationship between and that is of the
following form:
(|) = 0 + 1 ,
where, (. |. ) denotes conditional mean. That is, for all mothers with the same estriol-level , the
average birthweight of their newborns is (|) = 0 + 1 .

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

Regression line
Definition 1: The line = 0 + 1 is called the regression line, where 0 is called the
intercept and 1 is called the slope of the line.
The regression line = 0 + 1 is NOT expected to hold exactly for every woman and baby
pair. For example, NOT all women with a given estriol level (e.g., = 16) have babies with
identical birthweights.
Thus an error term , which represents the variance of birthweights among all babies of women
with a given estriol level , is introduced into the model. Let ~(0, 2 ), the normal distribution
with mean 0 and variance 2 . The full linear regression model then takes the following
form: = 0 + 1 + , where 0, 1 and 2 are unknown parameters.

Definition 2: For the full linear regression model = 0 + 1 + , is called the dependent
variable and is called the independent variable, because we are trying to predict y as a
function of x.
Example 2 (Obstetrics): For the regression line in Example 1, birthweight y is the dependent
variable and estriol level x is the independent variable, because estriol levels are being used to
predict birthweights.
Interpretation of the regression line
One interpretation of the regression line is that for a given value (i.e., estriol level x), the
corresponding (i.e., birthweight) will be normally distributed with mean 0 + 1 and variance
2 . In notation, |~(0 + 1 , 2 ). If 2 were 0, then every point would fall exactly on the
regression line, whereas the larger 2 is, the more scatter occurs about the regression line (see
Figure 2).
(a) Perfect fit.
(b) Imperfect fit.

Figure2: The effect of 2 on the goodness of fit of a regression line. Source: Rosner, Bernard.
Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury, Brooks/Cole, Cengage Learning. Page 431.

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

Figure4: Interpretation of the regression line for different values of 1 . Source: Rosner, Bernard.
Fundamentals of Biostatistics. 7th Edition, 2011 Duxbury, Brooks/Cole, Cengage Learning. Page 431.

Interpretation of the slope


If 1 is greater than 0, then as x increases, the expected value of = 0 + 1 will increase.
This situation appears to be the case in Figure 4a for birthweight (y) and estriol level (x) because
as estriol level increases, the average birthweight increases correspondingly.
If 1 is less than 0, then as x increases, the expected value of y will decrease. This situation might
occur in pediatrics, for example, pulse rate (y) vs. age (x), as illustrated in Figure 4b, because
infants are born with rapid pulse rates that gradually slow with age.
If 1 is equal to 0, then there is no linear relationship between x and y. This situation might occur
in a plot of birthweight vs. birthday, as shown in Figure 4c, because there is no relationship
between birthweight and birthday.

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

1.2. Fitting Regression Lines


Let {( , ): = 1, , } be observations of (, ). Using the observed data points, we will
estimate the unknowns using the method of least squares (Figure 5).

Fig. 5: Least square criterion for judging the fit of a regression line.

The distance of a typical sample point ( , ) from the estimated line could be measured along
a direction parallel to the y-axis. Let ( , ) = ( , 0 + 1 ) be the point on the estimated
regression line at , then this distance is given by
= = (0 + 1 ).
For both theoretical reasons and ease of derivation, the following least-squares criterion is
commonly used.
Principle of Least Squares
A good-fitting line would make sum of the squared distances as small as possible.For arbitrary
0 and 1 , one can predict by (0 , 1 ) = 0 + 1 , but has to suffer residuals
(0 , 1 ) = 0 1 . The least square estimates (0 , 1 ) minimize

(0 , 1 ) = ( 0 1 )2 .
=1

That is,

2 = min (0 , 1 ).

b0 ,1

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

We let = =1 / and = =1 /denote the sample means.The least squares estimates


0 , 1 , 2 are given as below:

, = (

)2

, = ( )( )

=1

=1

2
=1

=
=1

(=1 )2
2
=
,

(=1 )(=1 )
=
.

=1

=1

=
,
= , /, .

Solution:

and

The Least Squares Line is given by = 0 + 1 , where 0 and 1 are called Least Squares
Estimates of 0 and 1, respectively. One unbiased estimator of 2 is given by

1
2
2 =
( 0 1 ) .
2
=1

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

Example 3: Compute the Least Squares Line of the Obstetrics data.


Step 1: Compute the moments.
31

31

31

= 534,

= 992,

= 17500

=1

=1

=1

534
= 17.22851,
31

992
31

= 32,
31

31

2 = 32419,

2 = 9876,

=1

=1

Step 3: Compute the betas.

Step 2: Compute the Ls

, = 2 2

, =

=1

=1

534 2
= 9876 31 (
)
31

534 992
= 17500 31 (
)(
)
31
31

= 677.41935,

= 412.

1 = , /,
412
=
= 0.60819,
677.41935
0 = 1
= 32 (0.60819)(17.2258)
= 21.5234.

Least Squares Line: = . + . . This regression line is shown in Figure 1.

1.3. Assumptions Underlying Simple Linear Regression


In classical simple linear regression analysis, the following assumptions are usually adopted by
default.
1. Non-randomness: Values of the independent variable are non-random (fixed). In other
words, these values can be preselected or accurately measured by the investigator.
2. Error-free: The independent variable can be measured accurately (without error) by the
investigator. Since no measuring procedure is perfect, this means the measurement error must
be negligible.
3. Normality: For each value of there is a subpopulation of values. These subpopulations

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

must be normally distributed to ensure the usual (finite sample) inferential procedures
(estimation, hypothesis testing) be valid.
4. Homoscedasticity: The variances of the subpopulations of y are all equal and denoted by 2 .
In other words, it is invariant to values.
5. Linearity: Variables must be in the proper scale so that the average value of for a given
value of is a linear function of . In other words, the means of the subpopulations of y all
lie on the same straight line, symbolically, | (|) = 0 + 1.
6. Independence: The y values for distinct subjects are statistically independent. In other words,
in drawing the sample, it is assumed that the values of y chosen at one value of x in no way
depend on the values of y chosen at another value of x.

A graphical representation of the regression model is given in Figure

Figure 6. Representation of the simple linear regression model.

BIOS 6040-01: Simple Linear Regression

Fall 2016

Huaizhen Qin

1.4. Steps in Regression Analysis


In the absence of extensive information regarding the nature of the variables of interest, a
frequently employed strategy is to assume initially that they are linearly related. Subsequent
analysis, then, involves the following steps.

1.

Determine whether or not the assumptions underlying a linear relationship are met in the
data available for analysis.

2.

Obtain the equation for the line that best fits the sample data.

3.

Evaluate the equation to obtain some idea of the strength of the relationship and the
usefulness of the equation for predicting and estimating.

4.

If the data appear to conform satisfactorily to the linear model, use the equation obtained
from the sample data to predict and to estimate.

Excises 1

1. Read the lecture note and the materials on pages 417 424 of Danels book. Which
assumption(s) is (are) violated in the data analysis of the metabolic disease?
2. Try to derive the Least Squares Estimators 0 and 1 mathematically (This is
optional; but it is good for those who like theory).

Potrebbero piacerti anche