Sei sulla pagina 1di 70

Linear Model for Biostatistics

Ernesto Ponsot Balaguer


PhD in Statistics, MSc. In Applied Statistics, Systems Engineering
http://webdelprofesor.ula.ve/economia/ernesto
E-mail: eponsot@yachaytech.edu.ec

University of Experimental Technologies Research Yachay (Yachay Tech)


School of Mathematical Sciences and Information Technology
School of Biological Sciences and Engineering

Imbabura, Ecuador - March 2018


Content

1 Simple Regression Model

2 Multiple Linear Regression Model


Simple Regression Model Multiple Linear Regression Model

The Model

If we have a quantitative variable (x ) that we believe is associated


with the response of interest (also quantitative, y ), then the
simplest statistical model that we can postulate is:

yi = β0 + β1 xi + i , i = 1, 2, · · · , n (1)

We call this the Simple Linear Regression Model. “Simple”


because we have only one independent variable x also call
“regressor” and “linear” because the relationship is linear in the
β’s.
In (1), we have n observations of the pair (y , x ) and we propose
two unknown parameters β0 and β1 in R, that we must estimate
from the data; i is an unobservable error that we assume is
present in each observation.

3 / 69
Simple Regression Model Multiple Linear Regression Model

The Model

Remark
The word regression is curious, it means going backwards. In this
context we can interpret it as “returning to origins”, in the sense
that x is called a regressor because it somehow gives origin to y .

4 / 69
Simple Regression Model Multiple Linear Regression Model

Preliminary assumptions of the model

1 i , i = 1, 2, · · · , n are random variables with E[i ] = 0 and


V[i ] = σ 2 ∀i. Also Cov[i , j ] = 0 ∀i 6= j.
2 xi , i = 1, 2, · · · , n are known constants.
3 Consequently yi , i = 1, 2, · · · , n, which results from adding a
random variable to a constant, is thus a random variable and
E[yi ] = E[β0 + β1 xi + i ] = β0 + β1 xi and
V[yi ] = V[β0 + β1 xi + i ] = V[i ] = σ 2 ∀i. Like before,
Cov[yi , yj ] = 0 ∀i 6= j.

5 / 69
Simple Regression Model Multiple Linear Regression Model

Assumptions of the model

The assumption of constant variance is known as the


homoscedasticity assumption.
The assumption that the expected value of the errors is null
implies that the expected value of the response depends only
on the explanatory variable.
The assumption that errors are uncorrelated is known as the
assumption of non-collinearity.

6 / 69
Simple Regression Model Multiple Linear Regression Model

Parameters Estimation

The parameters of the model are β0 , β1 and σ 2 . We call βb0 ,


b 2 the estimators of the respective parameters from
βb1 and σ
the observed sample.
We call ybi = βb0 + βb1 xi , the i-th predicted value and
bi = yi − ybi , the i-th residue, this is the difference between the
i-th observed value and the i-th value predicted by the model.
Note that the predicted value ybi estimates E[yi ] = β0 + β1 xi ,
not yi . In addition, ybi is an unbiased estimator if βb0 and βb1
are also unbiased.
Now, how to find these estimators in a reasonable way?

7 / 69
Simple Regression Model Multiple Linear Regression Model

The least squares method

A “reasonable” way (of course, not the only way) to estimate


the parameters of the model from the observed sample is to
find those that minimize the sum of squares of the residuals.
This is call the least squares method. This is:
( n ) ( n )
X X
min b2i = min (yi − βb0 − βb1 xi ) 2

b0 ,βb1 ) i=1 (β
b0 ,βb1 ) i=1

Then the least squares method is to find βb0 and βb1 which
minimize the overall sum of squared differences between
observed and predicted values. So, taking partial derivatives,
equaling zero and solving:
nx y − ni=1 xi yi
P
βb1 = and βb0 = y − x βb1
nx 2 − ni=1 xi2
P

8 / 69
Simple Regression Model Multiple Linear Regression Model

The least squares method

The expected values and variances are:

E[βb1 ] = β1 and E[βb0 ] = β0


" #
σ2 2 1 x2
V[βb1 ] = Pn 2
and V[β0 ] = σ
b + Pn 2
i=1 (xi − x ) n i=1 (xi − x )

Thus, βb0 and βb1 obtained by the least squares method are
unbiased estimators of β0 and β1 respectively.
Exercise 1
Calculate E[βb0 ], E[βb1 ], and if you’re brave, the variances.

9 / 69
Simple Regression Model Multiple Linear Regression Model

The least squares method


What’s about σ 2 ?

The least squares method don’t provide an estimator for σ 2 .


However we can find one intuitively based on the definition of
variance. Let’s see: V[yi ] = σ 2 = E[(yi − E[yi ])2 ]. Now, the
estimator of E[yi ] is ybi = βb0 + βb1 xi . Now, we can think that a
good estimator of expected value is always the average, then
we propose:
n n
X (yi − ybi )2 X (yi − βb0 − βb1 xi )2
s2 = σ
b2 = =
i=1
n−2 i=1
n−2

Normally on an average we would divide by n. When we


estimate s 2 from a random sample we divide by n − 1 because
y must also be estimated which reduces one degree of
freedom, and in this case, we must estimate β0 and β1 , which
reduces two degrees of freedom.
10 / 69
Simple Regression Model Multiple Linear Regression Model

The least squares method


What’s about σ 2 ?

Note that E[s 2 ] = σ 2 , so the proposed s 2 is an unbiased estimator


of σ 2 . The quantity ni=1 (yi − ybi )2 is known as the residual sum of
P

squares or better, the sum of squares of error and will be


abbreviated SSE.
Exercise 2
Calculate E[s 2 ].

11 / 69
Simple Regression Model Multiple Linear Regression Model

Hypothesis test and confidence intervals for β1

Our main concern is to determine whether or not there is a


relation between x and y , therefore, the principal hypothesis
to prove is H0 : β1 = 0.
If we reject H0 we conclude that our model is valid and that
such a relation does exist, on the contrary, if we can’t reject
H0 , we will have to accept that our model is not valid and that
relation between x and y can’t be established in those terms.
To prove such hypothesis, we need some distributional theory:
!
σ 2
βb1 ∼ N β1 , Pn 2
i=1 (xi − x )
(n − 2)s 2
∼ χ2n−2
σ2
βb1 and s 2 are independent.
12 / 69
Simple Regression Model Multiple Linear Regression Model

Hypothesis test and confidence intervals for β1

With these assumptions,

βb1
t = qP ∼ tn−2,δ
n
s/ i=1 (xi − x )2

Where δ is known as the non-centrality parameter of the


t-Student distribution and is such that
β1
δ = qP
n
σ/ i=1 (xi − x )2

Note that if β1 = 0 then δ = 0, son under H0 t ∼ tn−2 , this is t is


distributed as a central t-Student. This is the basis of the
hypothesis test.

13 / 69
Simple Regression Model Multiple Linear Regression Model

Hypothesis test and confidence intervals for β1

Then the procedure is:


Assuming the alternative hypothesis as Ha : β1 6= 0, calculate
|t|.
Find (or calculate) a theoretical tα/2,n−2 , the upper point of
the central t distribution that leaves to the right a probability
equal to α/2. Here α is the significance level of the prove and
n − 2 is the degrees of freedom.
If |t| > tα/2,n−2 then we reject H0 .
Also we can reject H0 if p ≤ α (p is called the p-value). For a
twosided test, the p-value is twice the probability that tn−2
exceeds |t|.
A (1 − α)% confidence interval for β1 is obtained as:
s
βb1 ± tα/2,n−2 qP
n 2
i=1 (xi − x )

14 / 69
Simple Regression Model Multiple Linear Regression Model

Coefficient of Determination r 2

We can separate SST = SSR + SSE. Where:


n
X
SST = Total sum of squares = (yi − y )2
i=1
n
X
SSR = Sum of squares of regression = (ybi − y )2
i=1
n
X
SSE = Sum of squares of error = (yi − ybi )2
i=1

Then
n
X n
X n
X
(yi − y )2 = (ybi − y )2 + (yi − ybi )2 (2)
i=1 i=1 i=1

15 / 69
Simple Regression Model Multiple Linear Regression Model

Coefficient of Determination r 2

SST represents the total variation of y (corrected by the mean of


the observations), SSR represents the variation of y explained
(accounted) by regression and SSE represents the variation of y
due to error. Then
n
X
(ybi − y )2
SSR
r2 = = i=1
n
SST X
(yi − y )2
i=1

So r 2 gives the proportion of variation in y that is explained by the


model or, equivalently, accounted for by regression on x. From (2)
we can conclude that 0 ≤ r 2 ≤ 1 and that r 2 close to 1 is “good”
for our model but close to 0 is “bad”, because it means that the
variation of y is largely explained by error.
16 / 69
Simple Regression Model Multiple Linear Regression Model

Coefficient of Determination r 2

The coefficient of determination is a measure of goodness of fit


of the model. It’s known as r 2 precisely because it’s the square of
the sample correlation coefficient r . Note, however, that for this
claim to make statistical sense, both Y and X must be considered
random variables, which is not the case here.
Exercise 3
Do the exercises 17.1 to 17.3 from Zar (p. 329).

17 / 69
Simple Regression Model Multiple Linear Regression Model

MLRM: Formulation

The Multiple Linear Regression Model is:

y = Xβ +  (3)

Where
       
y1 1 x11 x12 · · · x1k β0 1
y 1 x21 x22 · · · x2k  β 2 
      
 2   1
y =
..  , X = .
. .. .. .. ..  , β =  . ,  =  . 
 .  .
. . . . . . 

 .  .
yn 1 xn1 xn2 · · · xnk βk n
h i
Of course, X could also be written as X = j x 1 x2 · · · xk ,
h i0 h i0
where j = 1 1 · · · 1 and xi = x1i x2i ··· xni for
i = 1, 2, · · · , k.

18 / 69
Simple Regression Model Multiple Linear Regression Model

MLRM: Formulation

In (3) we postulate that the response variable y depends on


an intercept and k explanatory variables or regressors
x1 , x2 , · · · , xk .
We must also emphasize that we have n observations of the
set {y, x1 , x2 , · · · , xk }.
In (3) we propose k + 1 unknown parameters β0 , β1 , · · · , βk .
The linearity of the model is required in the relationship
between these parameters but not necessarily between the
variables. For example the conceptual model (that is, without
reference to observations):
y = β0 + β1 x1 + β2 x2 + β3 x12 + β4 x22 + β5 x1 x2 + ,
is linear, but
y = β0 + β1 x1 + β12 x2 + e β2 x3 + log(β3 )x4 + ,
is not linear for us.
19 / 69
Simple Regression Model Multiple Linear Regression Model

MLRM: Formulation

In (3) X is called the design matrix and is a matrix of known


constants, β is a vector of unknown constants whose elements
must be found from the data,  is an unobservable random
vector and then y is a random vector since it is the sum of a
vector of constants and a random vector.
Although we can relax the assumptions later, we start by assuming
the following:
E[] = 0 ⇒ E[y] = µ = Xβ
V[] = V[y] = Σ = σ 2 I.
n > k + 1 and r(X) = k + 1.

20 / 69
Simple Regression Model Multiple Linear Regression Model

MLRM: Formulation

In (3) the elements of β are called the regression


coefficients, sometimes referred to as partial regression
coefficients. The word partial carries both a mathematical and
a statistical meaning.
Mathematically, the partial derivative of E[yi ] with respect to
xi1 , for example, is β1 . Thus β1 indicates the change in E[yi ]
with a unit increase in xi1 when xi2 , xi3 , · · · , xik are held
constant.
Statistically, β1 shows the effect of xi1 on E[yi ] in the presence
of the other xi ’s. This effect would typically be different from
the effect if the other xi ’s were not present in the model
(except if xi and xj are orthogonal for i 6= j).

21 / 69
Simple Regression Model Multiple Linear Regression Model

Estimation of the parameters

We must remember that


yb = E[y] [ = X βb (The predicted values vector).
d = (Xβ)
b = y − y (The residuals vector).
b
Also, we can show that ni=1 (yi − ybi )2 = (y − X β)
b 0 (y − X β)
P b
Using the MLS Method, after an optimization process, the
normal equations are X 0 X βb = X 0 y.
And considering that (X 0 X)−1 exists, then the least square
estimator of β is

βb = (X 0 X)−1 X 0 y and
E[β]
b = β
b = σ 2 (X 0 X)−1
V[β]

Since the predicted value is X βb = X(X 0 X)−1 X 0 y we will call


H = X(X 0 X)−1 X 0 the hat matrix.
22 / 69
Simple Regression Model Multiple Linear Regression Model

The least squares method


What’s about σ 2 ? (again)

We postulate that s 2 is an unbiased estimator for σ 2 where.


 X n  
1 1 b 0 (y − X β)
s2 = (yi − ybi )2 = (y − X β) b
n − k − 1 i=1 n−k −1
Let’s try it on the board, but for this we need to know that if A is a
matrix of constants and y is a random vector with E[y] = µ and
V[y] = Σ, then
E[Ay] = AE[y] = Aµ
(Expected value of a linear form)
V[Ay] = AV[y]A0 = AΣA0
(Variance of a linear form)
0
E[y Ay] = tr(AΣ) + µ0 Aµ
(Expected value of a quadratic form)
V[y 0 Ay] = 2tr[(AΣ)2 ] + 4µ0 AΣAµ
(Variance of a quadratic form)
23 / 69
Simple Regression Model Multiple Linear Regression Model

The least squares method

According to The Gauss-Markov Theorem, if E[y] = Xβ and


V[y] = σ 2 I, the least-squares estimators βb have minimum
variance among all linear unbiased estimators.
Sometimes the phrase “...have minimum variance among all
linear unbiased estimators” is understood as “...are the best
linear unbiased estimators (BLUE)”.
As a corollary we have that the BLUE of aβ (a linear
function of the parameters) is a β,
b where β
b is the
least-squares estimator.

24 / 69
Simple Regression Model Multiple Linear Regression Model

Normal model

Suppose now that in (3) we add the assumption  ∼ N(0, σ 2 I) or


equivalently y ∼ N(Xβ, σ 2 I). In general, if y ∼ N(µ, Σ) and Σ is
positive definite then
( )
1 (y − µ)0 Σ−1 (y − µ)
f (y; µ, Σ) = exp −
(2π)n/2 |Σ|1/2 2

Then with our assumption:


0 2 −1 (y−Xβ)
n o
exp − (y−Xβ) (σ 2I)
f (y; Xβ, σ 2 I) =
(2π)n/2 |σ 2 I|1/2
1 (y − Xβ)0 (y − Xβ)
 
= exp −
(2πσ 2 )n/2 2σ 2

25 / 69
Simple Regression Model Multiple Linear Regression Model

Normal model

Y
X

Figure 1: N2 (µ = 0, σ 2 = 1) 26 / 69
Simple Regression Model Multiple Linear Regression Model

Normal model

The likelihood L(β, σ 2 ) = f (y; Xβ, σ 2 I) view as a function of


parameters and

(y − Xβ)0 (y − Xβ)
l = log[L(β, σ 2 )] = − − log[(2πσ 2 )n/2 ]
2σ 2
Now we can find the maximum-likelihood estimators of the
parameters, deriving l with respect to β and σ 2 , equaling 0 and
clearing. So

βb = (X 0 X)−1 X 0 y
b 0 (y − X β)
(y − X β) b
b2 =
σ
n

27 / 69
Simple Regression Model Multiple Linear Regression Model

Normal model

b 2 is biased since the denominator is n


Note that the estimator σ
rather than n − k − 1. We will use the unbiased estimator s 2 by
b 0 (y − X β)
(y − X β) b
s2 =
n−k −1

Now suppose that y ∼ N(Xβ, σ 2 I) with the assumptions made


b 2 have the following distributional properties:
before. Then βb and σ

1 βb ∼ N(β, σ 2 (X 0 X)−1 ).
2 b 2 /σ 2 ∼ χ2(n−k−1) , or equivalently

(n − k − 1)s 2 /σ 2 ∼ χ2(n−k−1) .
3 b 2 (or s 2 ) are independent.
βb and σ

28 / 69
Simple Regression Model Multiple Linear Regression Model

Sums of Squares

For a random variable y with n observations, the Total Sum of


Squares is ni=1 yi2 . Now suppose we want to correct this by the
P

mean of observations (this is centering observations), then


n
X n
X n
X n
X n
X
(yi − y )2 = [yi2 − 2yi y + y 2 ] = yi2 − 2y yi + y2
i=1 i=1 i=1 i=1 i=1
n
X Xn
= yi2 − 2y ny + ny 2 = yi2 − 2ny 2 + ny 2
i=1 i=1
n
X
= yi2 − ny 2
i=1

29 / 69
Simple Regression Model Multiple Linear Regression Model

Sums of Squares

The following are useful identities:


n
yi2 = y 0 y
X

i=1
n
yi = (1/n)j 0 y
X
y = (1/n)
i=1
ny 2 = n[(1/n)j 0 y]2 = n[(1/n)j 0 y][(1/n)j 0 y
but j 0 y = y 0 j so
ny 2 = (1/n)y 0 jj 0 y = (1/n)y 0 Jy
Then
n
(yi − y )2 = y 0 y − (1/n)y 0 Jy
X

i=1
= y 0 [I − (1/n)J]y
30 / 69
Simple Regression Model Multiple Linear Regression Model

Sums of Squares

Now suppose we want to partition the sum total of squares into


“something” that is due to regression and “something” that is due
to error. We need to include ybi , but yb = X βb = X(X 0 X)−1 X 0 y
= Hy. So what if we add and subtract H? Let’s see:

y 0 [I − (1/n)J]y = y 0 [I − (1/n)J + H − H]y


= y 0 [I − H]y + y 0 [H − (1/n)J]y

Then:

SST = y 0 [I − (1/n)J]y
SSE = y 0 [I − H]y
SSR = y 0 [H − (1/n)J]y

Where H = X(X 0 X)−1 X 0 .


31 / 69
Simple Regression Model Multiple Linear Regression Model

Coefficient of Determination R 2

We have just shown that it is possible to partition the sum of


total squares (corrected) into two components, the sum of
squares of the error and the sum of squares due to regression.
That is SST = SSR + SSE.
The definition of the Coefficient of Determination R 2 is the
same as before, then
SSR y 0 [H − (1/n)J]y
R2 = = 0
SST y [I − (1/n)J]y

32 / 69
Simple Regression Model Multiple Linear Regression Model

Coefficient of Determination R 2

0 ≤ R 2 ≤ 1. Thinking about the sums of squares as estimates


of the variability of the data, we can affirm that R 2 will be
close to zero when the variability explained by the regression
is very small in relation to the total (which is to say that it
depends a lot on the error) and, on the contrary, it will be
close to one, when the variability explained by the regression
is large with respect to the total.
So R 2 near one, speaks in favor of the model and near zero,
speaks against.
However, R 2 can only grow if you add explanatory variables
(why?), so an artificial way to improve the value of R 2 is to
add explanatory variables, regardless of whether they are
relevant or not. So this indicator should be viewed carefully.

33 / 69
Simple Regression Model Multiple Linear Regression Model

Generalized Least Squares

Now consider the model:

y = Xβ + , E[] = 0, V[] = σ 2 V (4)

Where all previous assumptions hold, except that


V[] = V[y] = Σ = σ 2 V , with V a positive definite matrix of
constants. Note that
The model formulated in (4) generalizes that formulated in
(3) to the situation in which the variances of the errors can be
different from each other and there can be covariances
between them.

34 / 69
Simple Regression Model Multiple Linear Regression Model

Generalized Least Squares

But, there exists a non-singular matrix Q such that V = QQ 0 .


Then, multiplying by Q −1

Q −1 y = Q −1 Xβ + Q −1 
z = Wβ + δ (5)

Calling z = Q −1 y, W = Q −1 X and δ = Q −1 . But

E[δ] = E[Q −1 ] = Q −1 E[] = 0


V[δ] = V[Q −1 ] = Q −1 V[](Q −1 )0
= Q −1 σ 2 V (Q 0 )−1 = σ 2 Q −1 QQ 0 (Q 0 )−1 = σ 2 I

So the model formulated in (5) is exactly the original model.

35 / 69
Simple Regression Model Multiple Linear Regression Model

Generalized Least Squares

And we know that

βb = (W 0 W )−1 W 0 z
b = σ 2 (W 0 W )−1
V[β]
z 0 [I − HW ]z
s2 = , with HW = W (W 0 W )−1 W 0
n−k −1

Exercise 4
Prove that

βb = (X 0 V −1 X)−1 X 0 V −1 y
b = σ 2 (X 0 V −1 X)−1
V[β]
y 0 [V −1 − V −1 X(X 0 V −1 X)−1 X 0 V −1 ]y
s2 =
n−k −1
36 / 69
Simple Regression Model Multiple Linear Regression Model

Some results about quadratic forms

Let z ∼ Nn (0, I) then z 0 z ∼ χ2(n) .


Let y ∼ Nn (µ, I) then y 0 y ∼ χ2(n,λ) , with λ = (1/2)µ0 µ.
If vi are independent RV’s distributed as χ2(ni ,λi ) for all
Pk
i = 1, 2, · · · , k then i=1 vi ∼ χ2Pk Pk .
i=1
ni , i=1
λi

If u ∼ χ2(p) and v ∼ χ2(q) are independent, then

u/p
w = ∼ F(p,q)
v /q

If u ∼ χ2(p,λ) and v ∼ χ2(q) are independent, then

u/p
w = ∼ F(p,q,λ)
v /q
37 / 69
Simple Regression Model Multiple Linear Regression Model

Some results about quadratic forms

Let z ∼ N(0, 1), u ∼ χ2(p) ; z, u independent, then

z
t = p ∼ t(p)
u/p

Let y ∼ N(µ, 1), u ∼ χ2(p) ; y , u independent, then

y
t = p ∼ t(p,µ)
u/p

Let y ∼ N(µ, σ 2 ), u ∼ χ2(p) ; y , u independent, then

y /σ
t = p ∼ t(p,µ/σ)
u/p

38 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test H0 : Cβ = 0 vs. Ha : Cβ 6= 0

We recall that the general linear model is

y = Xβ + , E[] = 0 ⇒ E[y] = Xβ, V[] = V[y] = σ 2 I

Where y ∼ N(Xβ, σ 2 I) and r(X) = k + 1 (full column rank).


Note that βb = (X 0 X)−1 X 0 y is the least squares (also
maximum likelihood) estimator of β. This βb is the value of β
that minimizes the sum of squares of the error 0  =
(y − Xβ)0 (y − Xβ).
d = Xβ
Then E[y] b = X(X 0 X)−1 X 0 y = Hy, naming
H = X(X X) X 0 , the hat matrix.
0 −1

39 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test H0 : Cβ = 0 vs. Ha : Cβ 6= 0

Now, assuming that the null hypothesis is true, the model is


subject to a linear constraint that conditions it. So we can
write:

y = Xβ +  subject to Cβ = 0

Where C[q×(k+1)] is a matrix of known constants such that


r(C) = q (full row rank).
Clearly finding the value of β that minimizes the error sum of
squares is now a new optimization problem, which can be
solved using Lagrange multipliers. That is, finding the
minimum on β and λ of the function:

u(β, λ) = (y − Xβ)0 (y − Xβ) + λ0 Cβ

40 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test H0 : Cβ = 0 vs. Ha : Cβ 6= 0

Using partial derivatives, equaling to zero and solving for:


The β least squares estimator is now:

βe = (X 0 X)−1 X 0 y
−(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0 y

And now

E[y] e = X(X 0 X)−1 X 0 y


d = Xβ

−X(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0 y


= (H − H ∗ )y

Naming H ∗ = X(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0


It can be shown that H ∗ is symmetric, idempotent and its
trace is q.
41 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test H0 : Cβ = 0 vs. Ha : Cβ 6= 0

Now, do not forget that we are looking for a way to test the
null hypothesis H0 : Cβ = 0 and we must find a test statistic
that serves these purposes. With the theory we have
developed so far, it is logical to think about two independent
χ2 random variables. But note that:
y 0H ∗y = y 0 X(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0 y
b 0 [C(X 0 X)−1 C 0 ]−1 C β
= (C β) b

Is a quadratic form clearly associated with Cβ. Moreover,


C βb ∼ Nq (Cβ, σ 2 C(X 0 X)−1 C 0 ). Let’s see: βb is normal
distributed and C βb is a linear form of β,b so is normal
distributed too. The dimension of C β is q × 1 and
b

E[C β]
b = CE[β]
b = Cβ
V[C β] b 0 = σ 2 C(X 0 X)−1 C 0
b = CV[β]C

42 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test H0 : Cβ = 0 vs. Ha : Cβ 6= 0

Then after some verifications, (1/σ 2 )y 0 H ∗ y ∼ χ2q,λ with

λ = [1/(2σ 2 )](Cβ)0 [C(X 0 X)−1 C 0 ]−1 Cβ

Which it is zero if and only if the null hypothesis is true.


We also can prove that y 0 H ∗ y and SSE = y 0 (I − H)y are
independent.

43 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test H0 : Cβ = 0 vs. Ha : Cβ 6= 0

Finally

(y 0 H ∗ y/σ 2 )/q
F =
(y 0 (I − H)y/σ 2 )/(n − k − 1)
y 0 H ∗ y/q
= ∼ Fq,(n−k−1),λ
y 0 (I − H)y/(n − k − 1)
With λ = [1/(2σ 2 )](Cβ)0 [C(X 0 X)−1 C 0 ]−1 Cβ

If the null hypothesis is true, F is central-F distributed, so the


decision rule is Reject H0 if F ≥ Fα,q,(n−k−1) where α is
the significance level.

44 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test H0 : Cβ = t vs. Ha : Cβ 6= t

Also, for this generalization, it can be proved that:

(C βb − t)0 [C(X 0 X)−1 C 0 ]−1 (C βb − t)/q


F =
SSE/(n − k − 1)
∼ Fq,(n−k−1),λ
With λ = [1/(2σ 2 )](Cβ − t)0 [C(X 0 X)−1 C 0 ]−1 (Cβ − t)

Like before if the null hypothesis is true, F is central-F


distributed, so the decision rule is Reject H0 if
F ≥ Fα,q,(n−k−1) where α is the significance level.

45 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test: Particularizations


When H0 : a 0 β = 0.

If a is a vector of known constants, it is clear that in this case


C = a 0 and t = 0 so:

(a 0 β)
b 0 [a 0 (X 0 X)−1 a]−1 (a 0 β)
b (a 0 β)
b 2
F = =
s2 s 2 [a 0 (X 0 X)−1 a]
∼ F1,(n−k−1),λ
(a 0 β)2
With λ =
2σ 2 a 0 (X 0 X)−1 a

Because q = 1, a 0 β,
b a 0 β and a 0 (X 0 X)−1 a are scalars. Then
we reject H0 if F ≥ Fα,1,(n−k−1) .

46 / 69
Simple Regression Model Multiple Linear Regression Model

The General Linear Hypothesis Test: Particularizations


Testing one βj . H0 : βj = 0.
h i
Now a 0 = 0 0 · · · 0 1 0 ··· 0 0 with 1 in the
position j. Then:

βbj2 βj2
F = ∼ F 1,(n−k−1),λ with λ =
s 2 gjj 2σ 2 gjj

Where gjj is the jj element of the matrix (X 0 X)−1 , of course,


this is the j-th element of the diagonal.
Since F has 1 and n − k − 1 degrees of freedom, we can
equivalently use the t-Student distribution:

βbj
t = √ ∼ t(n−k−1),λ
s gjj

Then we reject H0 if |t| ≥ tα/2,(n−k−1) .


47 / 69
Simple Regression Model Multiple Linear Regression Model

Confidence Interval for βj

If we assume that βj is non-zero, βbj − βj has expected value


iqual to 0 and using the hypothesis test based on t-Sudent we
have:

βbj − βj
t = √ ∼ t(n−k−1)
s gjj

And we can write:

P[|t| ≥ tα/2,(n−k−1) ] = α
P[|t| < tα/2,(n−k−1) ] = 1 − α
P[−tα/2,(n−k−1) < t < tα/2,(n−k−1) ] = 1 − α

48 / 69
Simple Regression Model Multiple Linear Regression Model

Confidence Interval (CI) for βj

βbj − βj
P[−tα/2,(n−k−1) < √ < tα/2,(n−k−1) ]
s gjj
√ √
= P[−tα/2,(n−k−1) s gjj < βbj − βj < tα/2,(n−k−1) s gjj ]
√ √
= P[−βbj − tα/2,(n−k−1) s gjj < −βj < −βbj + tα/2,(n−k−1) s gjj ]
√ √
= P[βbj + tα/2,(n−k−1) s gjj > βj > βbj − tα/2,(n−k−1) s gjj ]
√ √
= P[βbj − tα/2,(n−k−1) s gjj < βj < βbj + tα/2,(n−k−1) s gjj ]
= 1−α

Then a 100(1 − α)% CI for βj is βbj ± tα/2,(n−k−1) s gjj .

49 / 69
Simple Regression Model Multiple Linear Regression Model

Confidence Interval (CI) for σ 2

We know that (n − k − 1)s 2 /σ 2 ∼ χ2(n−k−1) . Then:


" #
(n − k − 1)s 2
P χ2(1−α/2),(n−k−1) ≤ ≤ χ2(α/2),(n−k−1) = 1−α
σ2
But, at the right side:
(n − k − 1)s 2
≤ χ2(α/2),(n−k−1)
σ2
(n − k − 1)s 2
≤ σ2
χ2(α/2),(n−k−1)
And, at the left side:
(n − k − 1)s 2
χ2(1−α/2),(n−k−1) ≤
σ2
(n − k − 1)s 2
σ2 ≤ 2
χ(1−α/2),(n−k−1)
50 / 69
Simple Regression Model Multiple Linear Regression Model

Confidence Interval (CI) for σ 2

So:
" #
(n − k − 1)s 2 2 (n − k − 1)s 2
P ≤ σ ≤ = 1−α
χ2(α/2),(n−k−1) χ2(1−α/2),(n−k−1)

Then a 100(1 − α)% CI for σ 2 is

(n − k − 1)s 2 (n − k − 1)s 2
2 ≤ σ2 ≤ 2
χ(α/2),(n−k−1) χ(1−α/2),(n−k−1)

51 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

We consider various approaches to checking the model and


the attendant assumptions for adequacy and validity. We
discuss some properties of the residuals, the hat matrix,
outliers, the influence of individual observations and leverage.
The residual vector is b = y − yb , that is, the difference
between the response vector and the vector of predictions.
b = y − yb = y − X βb = y − X(X 0 X)−1 X 0 y
= (I − H)y
Also
b = (I − H)y = (I − H)(Xβ + ) = Xβ +  − HXβ − H
= Xβ +  − Xβ − H =  − H = (I − H)
Because HX = X(X 0 X)−1 X 0 X = X.
52 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

Note that yb = E[y]


d = Xβ b = X(X 0 X)−1 X 0 y = Hy. From that
comes the name “hat matrix”. H transforms y into y-hat.
Also note that

h
HXi = X
h i
H j x1 · · · xk = j x1 · · · xk
⇒ Hj = j

And from b = (I − H) note that the elements of H must be


close to 0 if b will be used as a reasonable approximation of .

53 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

Some useful results:

E[b] = 0 (6)
2
V[b] = σ (I − H) (7)
2
Cov[b, y] = σ (I − H) (8)
Cov[b, yb ] = 0 (9)
b0
b =  j /n = 0 (10)
b0
y 0
= y (I − H)y (11)
b0 b
y = 0 (12)
b0
X = 0 0
(13)

54 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

If x and y are two random variables, the sample correlation


coefficient is defined as
sxy
rxy = q
sx2 sy2
x 0 [I − (1/n)J]y/(n − 1)
= p
{x 0 [I − (1/n)J]x/(n − 1)}{y 0 [I − (1/n)J]y/(n − 1)}
x 0 [I − (1/n)J]y
= p
x 0 [I − (1/n)J]xy 0 [I − (1/n)J]y

55 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

And examining the numerator, for b, y:

b0 [I − (1/n)J]y = b0 y − (1/n)b0 Jy = b0 y

due to (10). Also for b, yb :

b0 [I − (1/n)J]yb = b0 yb − (1/n)b0 J yb = 0

due to (10) and (12). And due to (10) and (13) this
numerator is 0 for b and every column of the matrix X. Then

rb,by = 0
rb,xi = 0, i = 1, · · · , k

56 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

If the model and attendant assumptions are correct, then a


plot of the residuals versus predicted values, should show no
systematic pattern. The k plots of the residuals versus each of
X’s collumns should show only random variation.

Exercise 5
Using the Hematology Data in Table 10.1 of Rencher’s (p. 253),
described in the example 10.3, postule a “purely additive” linear
model for all explanatory variables and study the residuals.

57 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


The role of the matrix H.

For H = {hij }, i, j = 1, 2, · · · , n we have:

(1/n) ≤ hii ≤ 1
−0.5 ≤ hij ≤ 0.5, ∀i 6= j
X
tr(H) = hii = k + 1

What do you think about the lower bound of hii ?

58 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Outliers.

An outlier is an atypical value of the data, which appears very


far from the rest and which is suspected to be a measurement
error, a transcription error, or something like that.
As the adjustment of the linear model occurs through means
and these are very sensitive to the extreme values, an outlier
can represent a significant change in the estimations that can
even affect the hypothesis tests.
So the first impulse is to detect it and get it out of the data
set.
But what if you have very little data? How can we be sure
that this is an error and not a correct, plausible value?
Maybe we should get it out of the set, but never before
thoroughly investigating its true meaning.
59 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Outliers.


We need to scale the variance doing bi /[σ 1 − hii ] (the
standardized residuals) or the studentized residual

bi
ri = √
s 1 − hii
Our approach to checking for outliers is to plot studentized
residual versus ybi or versus i, the observation number.
There are different approaches to this analysis, however the
others are beyond the scope of this course.

60 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

An influential observation is an observation (yi , xi0 ) that makes


a major difference on βb and X β.
b
A point (yi , xi0 ) is potentially influential if it is an outlier in the
y direction or if it is unusually far away from the center of the
x ’s.
Let’s look the blood data for x4 : neutrophil count.

61 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

60

40
y

20

20 30 40
x4

Figure 2: Scatter plot (y vs. x )


62 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

R code 1
# Scatter plot x4 vs y
ggplot(dat, aes(x = x4, y = y)) +
geom_point(colour=’darkblue’,shape=21,size=2)+
geom_smooth(method=’lm’,colour=’red’,se=TRUE)+
geom_point(aes(x=42,y=34),data=dat,size=10,
shape=1,color=’green’)+
geom_segment(aes(x=37,y=34,xend=41,yend=34),
lwd=0.2,col=’darkgreen’,arrow=arrow(length
=unit(0.03,"npc")))+
geom_point(aes(x=17, y=61),data=dat,size=10,
shape=1,color=’green’)+
geom_segment(aes(x=17,y=53,xend=17,yend=59),
lwd=0.2,col=’darkgreen’,arrow=arrow(length
=unit(0.03,"npc")))+
ylab(’y’)+
xlab(’x4’)
63 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

Again, yb = Hy then for all i = 1, 2, · · · , n:


 
y1
h y2  X
i  n
ybi = hi1 hi2 · · · hin  . 
 = hij yj
 .. 

j=1
yn
X
= hii yi + hij yj
j6=i
Pn
But it is also true that Hj = j then j=1 hij = 1 ∀i.
Then, when hii is large (close to 1) the rest of the hij are
necessarily small and yi has a great influence on ybi . Hence, hii
is called the leverage of yi (in Spanish “apalancamiento”).

64 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

When an observation (yi , xi0 ) has a value of hii near 1, then


the estimated regression equation will be close to yi ; that is,
ybi − yi will be small and have high potential for influencing
regression results.
We can examine any observation whose leverage is unusually
large relative to the other values. We know that
Pn
i=1 hii = k + 1 then some authors suggest that if we find a
hii whose value equals or exceeds 2(k + 1)/n (twice the
average) we must think that it is an influential point.
High leverage points can be either good or bad. In the
direction of the regression line can contribute to decrease the
variance of the estimators, but in the opposite direction, it
affects the adjustment too much.

65 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

The strategy is to compare two models: one that includes the


suspicious point and one that excludes it (symbolized with (i)
subscript). To do this we use Cook’s distance, defined as:

b 0 X 0 X(β
(βb(i) − β) b − β)b
(i)
Di =
(k + 1)s 2
(X βb(i) − X β) b 0 (X β
b − X β)
b
(i)
=
(k + 1)s 2
(yb(i) − yb )0 (yb(i) − yb )
=
(k + 1)s 2
!
ri2 hii

=
k +1 1 − hii

66 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

Di is proportional to the ordinary Euclidean distance between


ybi and yi . Thus if Di is large, the observation (yi , xi0 ) has
substantial influence on both βb and yb = X β. b
Revisiting the blood data let’s see an interesting R code.

67 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics

R code 2
hv <- diag(H); hv
hatvalues(mod1)
lmax <- 2*(ncol(mod1$x)+1)/nrow(mod1$x); lmax
cd <- round(cooks.distance(mod1),6); cd
sx4 <- as.data.frame(sort(dat$x4, decreasing=T,
index.return=T))
sx4_y <- cbind(sx4, y=dat$y[sx4$ix], h=hv[sx4$ix],
c=cd[sx4$ix]); sx4_y
sy <- as.data.frame(sort(dat$y, decreasing=T,
index.return=T))
sy_x4 <- cbind(sy, x4=dat$x4[sy$ix], h=hv[sy$ix],
c=cd[sy$ix]); sy_x4

68 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

So, what to do?. Easy, detect high influence points and


inquire about their origin, if it is measurement or transcription
errors, discard them, otherwise, do not discard them (but keep
them under observation...)

69 / 69
Simple Regression Model Multiple Linear Regression Model

Model Validation and Diagnostics


Influential Observations and Leverage.

So, what to do?. Easy, detect high influence points and


inquire about their origin, if it is measurement or transcription
errors, discard them, otherwise, do not discard them (but keep
them under observation...)

69 / 69

Potrebbero piacerti anche