Chap 1

MH3510 REGRESSION ANALYSIS
Division of Mathematical Sciences, NTU

Chapter 1 Introduction
Regression analysis is a statistical methodology that utilizes the relation be-

tween two or more quantitative variables so that a response or outcome variable
can be predicted from the other, or the others. This methodology is widely used
in business, the social and behavioral sciences, the biological sciences, and many
other disciplines. A few examples of applications are listed below.
1 A case in Medicine
EXAMPLE 1 The following data set records the plasma levels of total cholesterol
(in mg/ml) of 24 patients with hypercholesterolemia admitted to a hospital:
3.5 1.9 4.0 2.6 4.5 3.0 2.9 3.8 2.1 3.8 4.1 3.0

2.5 4.6 3.2 4.2 2.3 4.0 4.3 3.9 3.3 3.2 2.5 3.3
Question: Predict the cholesterol level of the next patient to be admitted to the
hospital with hypercholesterolemia.
Intuitive answer: Use the average of the 24 observations: 3.354 (horizontal ref-
erence line in Fig. 1(a)). Observations scatter around the average but are subject
to considerable
[The above is justiable if the observations are i.i.d. for example. In the
absence of further information, this seems to be the best we can do.]
Suppose the hospital has also collected data on the ages of the 24 patients:
46 20 52 30 57 25 28 36 22 43 57 33
22 63 40 48 28 49 52 58 29 34 24 50
Each observation (corresponding to each patient) consists of values of two vari-

ables:
(X, Y ) = (age, cholesterol level).
1
Scatterplot (a) Scatterplot (b)
5.0
5.0
4.5
4.5
4.0
4.0
Level of cholesterol
Level of cholesterol
3.5
3.5
3.0
3.0
2.5
2.5
2.0
2.0
5 10 15 20 20 30 40 50 60
Patient Age
Figure 1: Plasma levels of total cholesterol (in mg/ml).
We can see a strong linear relationship between the two variables. As far
as prediction is concerned, it seems more reliable to assume a linear function
relating age and cholesterol level of hypercholesterolemia patients, and predict
the next patients cholesterol level by his/her age.
Is: cholesterol level= b + m age ? (functional relation)
No! The relationship is far from perfect/exact (its a statistical relation)!
We can say that
E(cholesterol level) = b + m age.
That is, cholesterol level is a random variable.
Fig. 1(b) ts a sloped straight line to the scatterplot by the least squares
method (to be discussed later). This straight line summarizes the relation-
2
ship between cholesterol level and age and can be used for predicting future
patients cholesterol levels. Compared to Fig. 1(a), uctuations of the 24 ob-
servations around the sloped straight line are much smaller. A function
linear in age (the sloped straight line) can better account for the observed
variation in cholesterol level than a simple constant function (the horizontal
line).
The above example highlights the importance in data analysis of collecting

data on some other variables (e.g. age) relevant to the main variable of
interest (e.g. cholesterol level) in order to obtain a model which can better
explain the observed variation in the main variable.
2 General problem and terminology

Typical observations or experimental studies involve the drawing of a sample n
observations from a population about which inference is to be made.
EXAMPLE 2
(a) n members of a society are sampled in a survey. Information is collected on

each members opinion (towards some social issues) and on his/her other
characteristics like sex, age, educational level etc. Here each observation
can be represented in the form
(sex, age, educational level, ... , opinion),
(b) In a study of house prices, n recent transactions are sampled. Each obser-
vation may have the form
(size, building age, prices, facilities, location),
(c) To study the eects of 2 types of medical treatments, a clinical trial is con-
ducted on a sample of n patients in a hospital, some of whom receiving the
rst treatment and the rest receiving the second treatment. Each observa-
tion may have the form
(age, sex, past medical record, smoking behaviour, type of medical

treatment applied, response to treatment),
(d) A physicist brings a pendulum to a remote planet and studies the gravita-
tional force there. He/she varies the length of the pendulum and measures
its period on n separate occasions. Each observation may have the form
(length of pendulum, period).
3
(e) An electrician wants to determine the resistance of an electrical circuit. He
passes several dierent known currents through the circuit and measures
the corresponding voltages. Each observation may have the form
(current, voltage).
In general, each observation consists of measurements on a number of vari-

ables related to an individual sampled from the population.
DEFINITION 1 The variable which is of our primary interest is called the re-
sponse variable (out- put variable, outputs, Y-variables or dependent variable),
whereas the remaining variables are called predictor variables (input variable,
inputs, X-variables, re- gressors or independent variable).
Example 2 (contd)
(a) Response: opinion

Predictor: sex, age, educational level, etc.
(b) Response: price

Predictor: size, building age, facilities, location.
(c) Response: response to treatment

Predictor: age, sex, past medical record, smoking behaviour, type of medical
treatment applied.
(d) Response: period

Predictor: length of pendulum.
(e) Response: voltage

Predictor: current.
There are two types of variables: quantitative and qualitative (or categorical).
DEFINITION 2 Quantitative variables can be measured in a numerical form: e.g.

age, income, time, temperature etc.
Qualitative variables are not numerical in nature: e.g. gender, categorized age,
education level, type of crime committed, style of cuisine served in a restaurant
etc.
4
[ In earlier chapters we will focus only on quantitative variables. Qualitative
variables will be dealt with in later chapters.]
Our objective in linear modelling is to study the relationship between predic-

tor and response variables based on the sample collected. Typical questions to
ask include:
does a certain predictor variable aect signicantly the response?
is there a simple mathematical formula relating the response to the predic-

tor variables?
can we predict the response in a future observation based on the values of

the predictor variables?
Corresponding to the above questions, regression analysis serves three major

purposes: (1) description of the relation between variables; (2) control of predictor
variables for a given value of a response variable; and (3) prediction of a response
based on predictor variables
3 Steps in regression
Regression analysis uses a mathematical model to predict a variable y from values
of other predictor variables x1 , x2 , , xn . When using the model to predict y for
a particular set of values x1 , x2 , , xk , we will want a measure of the reliability of
our prediction. That is, we will want to know how large the error of predication
might be.
(a) Practical regression analysis consists of the following phases:
(i) Collect the sample size

(ii) graphical display of observed data
scatterplot, scatterplot matrix, etc.
(iii) formulation of model
are we tting a straight line, a exponential line or anything else?
no useful model can explain the observed data perfectly; how are
we accounting for the discrepancies between the model and the
observations in our model formuation?
(iv) model tting
calculate the best estimates of the parameters in the model
5
how to calculate? best in the sense of what? minimizing some-
thing?
(v) model adequacy checking
ascertain the quality of t
is the model adequate? should we modify our model to obtain a
better t?
we may need to iterate the above phases many times before coming
up with a satisfactory model.
(vi) model simplication [optional]
can we simplify our model? perhaps by dropping some ineective
predictor variables, or by transforming the variables?
(vii) making inference
try to answer questions of our interest (depending on the context
of the problem)
quantify condence in our answers using statistical arguments
make suggestions, conclusions etc.

Chap 1

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Chap 1

Caricato da

Copyright:

Formati disponibili

MH3510 REGRESSION ANALYSIS

Division of Mathematical Sciences, NTU

Regression analysis is a statistical methodology that utilizes the relation be-

Each observation (corresponding to each patient) consists of values of two vari-

(X, Y ) = (age, cholesterol level).

Figure 1: Plasma levels of total cholesterol (in mg/ml).

E(cholesterol level) = b + m age.

That is, cholesterol level is a random variable.

The above example highlights the importance in data analysis of collecting

2 General problem and terminology

(a) n members of a society are sampled in a survey. Information is collected on

(sex, age, educational level, ... , opinion),

(size, building age, prices, facilities, location),

(age, sex, past medical record, smoking behaviour, type of medical

(length of pendulum, period).

In general, each observation consists of measurements on a number of vari-

(a) Response: opinion

(b) Response: price

(c) Response: response to treatment

(d) Response: period

(e) Response: voltage

DEFINITION 2 Quantitative variables can be measured in a numerical form: e.g.

Our objective in linear modelling is to study the relationship between predic-

does a certain predictor variable aect signicantly the response?

is there a simple mathematical formula relating the response to the predic-

can we predict the response in a future observation based on the values of

Corresponding to the above questions, regression analysis serves three major

(a) Practical regression analysis consists of the following phases:

(i) Collect the sample size

Potrebbero piacerti anche