Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1 A case in Medicine
EXAMPLE 1 The following data set records the plasma levels of total cholesterol
(in mg/ml) of 24 patients with hypercholesterolemia admitted to a hospital:
3.5 1.9 4.0 2.6 4.5 3.0 2.9 3.8 2.1 3.8 4.1 3.0
2.5 4.6 3.2 4.2 2.3 4.0 4.3 3.9 3.3 3.2 2.5 3.3
Question: Predict the cholesterol level of the next patient to be admitted to the
hospital with hypercholesterolemia.
Intuitive answer: Use the average of the 24 observations: 3.354 (horizontal ref-
erence line in Fig. 1(a)). Observations scatter around the average but are subject
to considerable
[The above is justiable if the observations are i.i.d. for example. In the
absence of further information, this seems to be the best we can do.]
Suppose the hospital has also collected data on the ages of the 24 patients:
46 20 52 30 57 25 28 36 22 43 57 33
22 63 40 48 28 49 52 58 29 34 24 50
1
Scatterplot (a) Scatterplot (b)
5.0
5.0
4.5
4.5
4.0
4.0
Level of cholesterol
Level of cholesterol
3.5
3.5
3.0
3.0
2.5
2.5
2.0
2.0
5 10 15 20 20 30 40 50 60
Patient Age
We can see a strong linear relationship between the two variables. As far
as prediction is concerned, it seems more reliable to assume a linear function
relating age and cholesterol level of hypercholesterolemia patients, and predict
the next patients cholesterol level by his/her age.
Is: cholesterol level= b + m age ? (functional relation)
No! The relationship is far from perfect/exact (its a statistical relation)!
We can say that
Fig. 1(b) ts a sloped straight line to the scatterplot by the least squares
method (to be discussed later). This straight line summarizes the relation-
2
ship between cholesterol level and age and can be used for predicting future
patients cholesterol levels. Compared to Fig. 1(a), uctuations of the 24 ob-
servations around the sloped straight line are much smaller. A function
linear in age (the sloped straight line) can better account for the observed
variation in cholesterol level than a simple constant function (the horizontal
line).
EXAMPLE 2
(b) In a study of house prices, n recent transactions are sampled. Each obser-
vation may have the form
(c) To study the eects of 2 types of medical treatments, a clinical trial is con-
ducted on a sample of n patients in a hospital, some of whom receiving the
rst treatment and the rest receiving the second treatment. Each observa-
tion may have the form
(d) A physicist brings a pendulum to a remote planet and studies the gravita-
tional force there. He/she varies the length of the pendulum and measures
its period on n separate occasions. Each observation may have the form
3
(e) An electrician wants to determine the resistance of an electrical circuit. He
passes several dierent known currents through the circuit and measures
the corresponding voltages. Each observation may have the form
(current, voltage).
DEFINITION 1 The variable which is of our primary interest is called the re-
sponse variable (out- put variable, outputs, Y-variables or dependent variable),
whereas the remaining variables are called predictor variables (input variable,
inputs, X-variables, re- gressors or independent variable).
Example 2 (contd)
There are two types of variables: quantitative and qualitative (or categorical).
4
[ In earlier chapters we will focus only on quantitative variables. Qualitative
variables will be dealt with in later chapters.]
3 Steps in regression
Regression analysis uses a mathematical model to predict a variable y from values
of other predictor variables x1 , x2 , , xn . When using the model to predict y for
a particular set of values x1 , x2 , , xk , we will want a measure of the reliability of
our prediction. That is, we will want to know how large the error of predication
might be.
5
how to calculate? best in the sense of what? minimizing some-
thing?
(v) model adequacy checking
ascertain the quality of t
is the model adequate? should we modify our model to obtain a
better t?
we may need to iterate the above phases many times before coming
up with a satisfactory model.
(vi) model simplication [optional]
can we simplify our model? perhaps by dropping some ineective
predictor variables, or by transforming the variables?
(vii) making inference
try to answer questions of our interest (depending on the context
of the problem)
quantify condence in our answers using statistical arguments
make suggestions, conclusions etc.