Regression

Regression Technique
Regression
Web support
Simple regression a reminder
Multiple regression an introduction
Reporting regression analyses
Choosing regressors (predictor variables)
Choosing a regression model
Model checking - residuals
Simple Regression
Establish equation for the best-fit line:

y = bx + a
Best-fit line same as Regression line
b is the regression coefficient for x
x is the predictor or regressor variable for y
Multiple Regression
Establish equation for the best-fit line:

y = b1x1 + b2x2 + b3x3 + a
Where:
b1 = regression coefficient for variable x1
a = constant
Multiple Regression
R2 - Goodness of fit
Model Summary
Model
1
R
.721a
R Square
.520
Adjusted
R Square
.399
Std. Error of
the Estimate
17.70134
a. Predictors: (Constant), AGE, GENDER, INCOME
For multiple regression, R2 will get larger every time another

independent variable (regressor/predictor) is added to the model
Add work stress to model ?
New regressor may only provide a tiny improvement in amount

of variance in the data explained by the model
Need to establish the added value of each additional regressor

in predicting the DV
Multiple Regression
R2adj - adjusted R-square
Takes into account the number of regressors in the model
Calculated as:
R2adj = 1 - (1-R2)(N-1)/(N-n-1)
where:
N = number of data points
n = number of regressors
You dont need to memorise this equation, but
Note that R2adj will always be smaller than R2
How well does a model explain the variation in the

dependent variable?
Effectiveness vs Efficiency
Effectiveness:
maximises R2
ie: maximises proportion of variance explained by model
Efficiency:
maximises increase in R2adj upon adding another regressor
ie: if new regressor doesnt add much to the variance explained,
it is not worth adding
How well does a model explain the variation in the

dependent variable?
Effectiveness (R2 and R2adj)

0 - 25%
very poor and likely to be unacceptable
25 - 50%
poor, but may be acceptable
50 - 75%
good
75 - 90%
very good
90% +
likely that there is something wrong with

your analysis
Are the regressors, taken together, significantly

associated with the dependent variable?
ANOVAb
Model
1
Regression
Residual
Total
Sum of
Squares
4065.388
3760.050
7825.438
df
3
12
15
Mean Square
1355.129
313.337
F
4.325
Sig.
.028a
a. Predictors: (Constant), AGE, GENDER, INCOME

b. Dependent Variable: DEPRESS
Analysis of Variance test checks to see if model, as a whole, has a

significant relationship with the DV
Part of the predictive value of each regressor may be shared by one

or more of the other regressors in the model, so the model must be
considered as a whole (i.e. all regressors/IVs together)
Read off ANOVA table in SPSS output, and report as you did in
week 3/4 assignments
What relationship does each individual regressor

have with the dependent variable?
Coefficientsa
Model
1
(Constant)
INCOME
GENDER
AGE
Unstandardized
Coefficients
B
Std. Error
68.285
15.444
-9.34E-02
.029
3.306
8.942
-.162
.344
Standardized
Coefficients
Beta
-.682
.075
-.101
t
4.421
-3.178
.370
-.470
Sig.
.001
.008
.718
.646
a. Dependent Variable: DEPRESS
SPSS output table entitled Coefficients
Column headed Unstandardised coefficients - B
Gives regression coefficient for each regressor variable (IV)
With all the other variables held constant
Units of coefficient are same as those for regressor (IV)

Units of coefficient are same as those for variable

eg: dependent variable score on video game (in points)
regressor time of day (in hours)
B coefficient for time = 844.57
score = (B coefficient x time) + constant
score = (844.57 time) 4239.6
This means that for every increase of one hour in the variable
time, we would predict that a persons score will increase by
844.57 points

dependent variable score on video game
regressor gender
Gender coded so that:
1 = male, 2 = female
Let B coefficient for gender = 100.00

So,
score = 100.00 gender + constant
Adding 1 to the variable gender means that we go from

male to female
This means that females would be expected to score 100.00
points more than males
Remember that the B coefficient is calculated on the basis that
1=male and 2=female (different coding will give a different
coefficient)
Which regressor has the most effect on the dependent

variable?
Units for each regression coefficient are different, so we

must standardise them if we want to compare one with
another
Column headed Standardised coeficients - Beta
Can compare the Beta weights for each regressor variable

to compare effects of each on the dependent variable
Larger Beta weight indicates stronger effect of regressor

on values of DV
Are the relationships of each regressor with the

dependent variable statistically significant?
Assessed using a t-test
Check values in column headed t and sig
If regression coefficient is negative, then t-value will also

be negative (it does not matter about the sign, it is the size
of t that is important)
Reporting regression analyses
How should I report a regression analysis?
Reporting Regression analyses
Describe the characteristics of the model before you describe

the significance of the relationship
So:
1. R2, R2adj - how well does the model fit the data?
2. Fm,n
- is the relationship significant?
3. Regression equation
- how to calculate values of

DV from known values of IVs?
4. Describe results in plain English
Reporting Regression analyses

We want to predict IQ score
using brain size (MRI), height and gender as regressors
Units:
IQ: IQ points
brain size (MRI): pixels
height: centimetres
gender: 0 = male, 1 = female
Reporting Regression analyses (1)
SPSS output tells us that:

R2 = 21.7%
R2adj = 14.6%
SPSS output tells us that:

F 3,33 = 3.051, p < 0.05
Regression equation:
y = b1x1 + b2x2 + b3x3 + b4x4 + a
IQ = 1.824x10-4 MRI 0.316 height + 2.426 gender + (-6.411)
= 0.0001824 MRI 0.316 height + 2.426 gender + (-6.411)
= 0.0002 MRI 0.316 height + 2.426 gender + (-6.411)
The regression was a poor fit, describing only 21.7% of the

variance in IQ (R2adj= 14.6%), but the overall relationship was
statistically significant (F3,33= 3.05, p<0.05).
With other variables held constant, IQ scores were negatively
related to height, decreasing by 0.32 IQ points for every extra
centimetre in height, and positively related to brain size,
increasing by 0.0002 IQ points for every extra pixel of the
scan. Women tended to have higher scores than men, by 2.43
IQ points. However, the effect of brain size (MRI) was the only
significant effect (t33=2.75, p=0.01)
Break
Five minutes please be back promptly
Selecting Regressors
What do we want of a regressor?
To have a significant effect on the dependent variable

Ability to discriminate between values of the dependent
variable
How well do potential regressors predict the Dependent Variable?
Dichotomous variable (eg: gender)
Compare using t-test
If significant, then possible regressor

predicts differences in dependent
variable
How well do potential regressors predict the Dependent Variable?
Continuous variable (eg: Height)
Compare using correlation
If significant, then possible regressor

predicts differences in dependent
variable
Some of discriminatory value in regressor may be accounted

for by regressors present in model already
gender, income, height
age, experience, value of property
In the presence of all regressors
Adding regressor may not add as much to models predictive

value as you might have anticipated
What makes the best model?
Same number of regressors
Choose model with highest value of R2adj
This gives best value per regressor
Will also have the highest value of R2 and F
Different number of regressors
Highest value of R2adj (more regressors)
Highest value of F (fewer regressors)
Efficiency vs Effectiveness
Effective: highest R2 (most complete)
will have more regressors
will be effective, but not efficient
Efficient: highest F-ratio (most significant)
will have fewer regressors
will be efficient, but not particularly effective
Compromise: largest increase in R2adj (best of both worlds)
will contain only the best regressors available
manageable number of regressors and reasonably effective
Minitabs BREG command
Tries every possible combination of available regressors (up

to maximum of 20)
eg: 20 regressors give over 1,000,000 different models
Command:
Dependent variable is in column 10
Independent variables in columns 1 to 6
BREG C10 C1-C6
Will not be required to carry out this type of analysis in

exam, but you need to be able to interpret output
Sample of BREG output

MTB > BREG C13 C1-C12
Best Subsets Regression
Response is prodebt
304 cases used 160 cases contain missing values.
Vars
7
7
8
8
9
9
10
R-Sq
19.3
19.1
19.9
19.5
20.2
20.1
20.4
Adj.
R-Sq
17.4
17.2
17.7
17.4
17.8
17.6
17.6
C-p
7.3
7.8
6.9
8.2
7.8
8.3
9.3
s
0.65539
0.65602
0.65388
0.65536
0.65375
0.65434
0.65427
i
n
c
o
m
e
g
p
X
X
X
X
X
X
X
h
o
u
s
e
c
h
i
l
d
r
e
n
X
X
X
X X
s
i
n
g
p
a
r
a
g
e
g
p
X
X
X
X
X
X
X
b
a
n
k
a
c
c
b
s
o
c
a
c
c
X
X
X
X
X
m
a
n
a
g
e
X
X
X
X
X
X
X
c
c
a
r
d
u
s
e
X
X
X
X
X
X
X
c
i
g
b
u
y
X
X
X
X
X
X
x
m
a
s
b
u
y
X
X
X
X
X
X
X
l
o
c
i
n
t
r
n
X
X
X
X
X
X
X
BREG output
Best two models for each possible number of regressors

are displayed in output
Compare R2adj values directly
Select best model(s)
Run normal regression in SPSS for each selected model
Compare F-ratio values
Best Subset Regression model
Identify best subset of regressors from BREG output
Must run ordinary regression procedure
calculates F-ratio
calculates individual coefficients and significance
Highest R2adj values result in significant F-ratios
if F-ratio not significant, check data and procedure
BUT: Advisable to try two or three models, as the

number of respondents contributing to each analysis
may not be the same between Minitab and SPSS
Equivalent SPSS procedures
Choose procedure by selecting appropriate tab in drop-down

menu
Enter procedure:
Adds all regressors to model simultaneously
Calculates F-ratio and R2adj for all regressors
Stepwise procedure:
Adds regressors one at a time
Calculates F-ratio and R2adj for each set of regressors
considers taking regressors out at each stage
Missing values
Frequently have values missing from data set
missed out questions
couldnt understand question
couldnt collect data for some reason
Must specify missing values in SPSS in Define Variable

window
Differences in R2adj or F-ratio values are most likely to be due to
missing values
Leads to different n in each analysis
Model checking
Residuals (general)
Unusual observations outliers
Model checking - Residuals
Predicted value for y (dependent variable)

y = b1x1 + b2x2 + + a
Actual (observed) value for y
Actual (observed) value minus predicted (calculated) value

180
160
160
140
120
S ymptom Index
S ymptom Index
140
120
100
80
60
100
80
60
40
40
20
20
0
0
50
100
150
200
Drug A (dose in mg)
250
50
100
150
200
Drug B (dose in mg)
Good fit
Moderate fit
low residuals
larger residuals
250

Residuals should be:
Normally distributed
Independent of one another
some big, some small, most average-sized

no constant covariation with one another
almost identical in terms of variance
regardless of the values of the IVs or DVs
These things are easy to check with SPSS plots option
Model checking - Unusual observations
Outliers
80
Linear regression would

work quite well for this
data, except for the
presence of three outlier
points
70
60
50
40
30
20
EXAM
10
0
ANXIETY
10
20
Dealing with outliers
Run regression analysis
Plot data on a scattergram
Remove outliers by deleting the rows in SPSS
Run regression analysis again
Note any qualitative differences:
if there are qualitative differences, then check data. If no

errors, report both analyses
if only quantitative differences, then leave outliers in
analysis, noting their presence
Justification
Removing outliers
80
70
Plotting data may indicate

that some participants
belong to a separate subsample.
Eg: people with an
exam phobia?
60
50
40
30
20
EXAM
10
0
ANXIETY
10
20
Residuals
DV vs IV
Differences between actual and

predicted values (ie: residual
values) should show a normal
distribution)
Some large positive
Some large negative
80
70
60
50
40
30
EXAM
20
10
0
ANXIETY
10
20
But mostly small (positive or

negative), or zero
ie: Normally distributed
Residuals
80
70
60
50
40
30
20
EXAM
DV vs IV
10
0
ANXIETY
10
20
If our best-fit line does

not fit too well, this will
be revealed in the
distribution of the
Residuals
Questions ?
Call .Veera at 012-2313979

Regression

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Regression

Caricato da

Copyright:

Formati disponibili

Regression Technique

Simple regression a reminder

Multiple regression an introduction

Reporting regression analyses

Choosing regressors (predictor variables)

Choosing a regression model

Model checking - residuals

Establish equation for the best-fit line:

Best-fit line same as Regression line

b is the regression coefficient for x

x is the predictor or regressor variable for y

Establish equation for the best-fit line:

a. Predictors: (Constant), AGE, GENDER, INCOME

For multiple regression, R2 will get larger every time another

Add work stress to model ?

New regressor may only provide a tiny improvement in amount

Need to establish the added value of each additional regressor

Takes into account the number of regressors in the model

You dont need to memorise this equation, but

Note that R2adj will always be smaller than R2

How well does a model explain the variation in the

How well does a model explain the variation in the

Effectiveness (R2 and R2adj)

very poor and likely to be unacceptable

poor, but may be acceptable

likely that there is something wrong with

Are the regressors, taken together, significantly

a. Predictors: (Constant), AGE, GENDER, INCOME

Analysis of Variance test checks to see if model, as a whole, has a

Part of the predictive value of each regressor may be shared by one

What relationship does each individual regressor

a. Dependent Variable: DEPRESS

SPSS output table entitled Coefficients

Column headed Unstandardised coefficients - B

Gives regression coefficient for each regressor variable (IV)

With all the other variables held constant

Units of coefficient are same as those for regressor (IV)

What relationship does each individual regressor

Units of coefficient are same as those for variable

What relationship does each individual regressor

Gender coded so that:

Let B coefficient for gender = 100.00

score = 100.00 gender + constant

Adding 1 to the variable gender means that we go from

Which regressor has the most effect on the dependent

Units for each regression coefficient are different, so we

Can compare the Beta weights for each regressor variable

Larger Beta weight indicates stronger effect of regressor

Are the relationships of each regressor with the

Assessed using a t-test

Check values in column headed t and sig

If regression coefficient is negative, then t-value will also

Reporting regression analyses

How should I report a regression analysis?

Reporting Regression analyses

Describe the characteristics of the model before you describe

- is the relationship significant?

- how to calculate values of

4. Describe results in plain English

Reporting Regression analyses

brain size (MRI): pixels

gender: 0 = male, 1 = female