Regression Intro PDF

Review on Linear Regression Analysis Our First Attempt (informal): City X
Example: City X has recently 3 4 To answer the questions, we need to relate Y to the Xs!
conducted a travel survey of its 10 6 9
1 10 Plot the observed Y vs X1 (scatter chart) and then Add a
zones. The collected data are 2 5
7 8 Trend Line
summarized in Table 1. Based on the
given information, Total Zonal Trips VS. Zonal Population
Table 1 Observed Trips and Zonal Activity
6000
– How to predict the total number of Zone # Pop. Households Trips
X1 X2 Y
Total Trips Produced by a Zone

5000
trips that will be produced by each 1 7212 2488 5126
2 4818 2188 3764 4000
zone in 10 years (assume zonal 3 8789 2423 4030
4 5805 2141 3921 3000
population in 10 years is known)? 5 3054 1241 1644
2000
6 10463 3451 4467
– How to relate Y (total trips produced 7 2735 1857 3407 1000
by a zone) to X1 (zonal population)? 8 1784 905 1143

9 4418 1695 3202 0
0 2000 4000 6000 8000 10000 12000
10 6657 1960 3900
Zonal Population
1 2
Outline: Simple Linear Regression
Simple Linear Regression Analysis Problem: consider two variables, X and Y, and we have n paired
observations on them (data): (x1,y1), (x2,y2), …, (xn,yn), what is the
Multiple Linear Regression relationship between Y and X?
Understanding the Regression Outputs Method: Assumed Y and X are linearly related, that is
Validating a Regression Model Y = b0 + b 1 X
where: X - independent variable (a factor that is considered to
have impact on Y!)
Y - dependent variable
b0, b1 - regression coefficients
Our Goals:
– Find the best value for b0 and b1 (but what do we mean by “best”?)
3 – Find out whether or not the identified relationship is good, i.e., 4
validate the model
Our Second Attempt (formal): City X

Results of Our Second Attempt: City X
Plot Y vs. X1 (scatter graph) and the straight line Y = b0 and b1 can be determined mathematically as follows:
b0 + b1 X1 with different intercept (b0) and slope (b1)
Total Zonal Trips VS. Zonal Population
6000
We shall find b0 and b1
Y = b0 + b 1 X so as to minimize the
Total Trips Produced by a Zone, Y
5000
total estimation error
4000 (xi, b0+b1xi) (E):
3000
error, ei = (y - b0+b1xi) E = Σ ei 2
2000
(xi, yi) = Σ [yi-(b0+b1xi)]2
1000
0
This is called the method
0 2000 4000 6000 8000 10000 12000
of least squares).
Zonal Population, X1
5 6
1
Estimating the Regression Coefficients Our Third Attempt (using Excel):
Data: (x1,y1), (x2,y2), …, (xn,yn) SUMMARY OUTPUT

From data we can calculate the following statistics
n n Regression Statistics
1 1
x=
n
∑x ,
i=1
i y=
n
∑
i=1
yi Multiple R 0.780
2
n n R 0.608
S XX = ∑( x - x ) , i
2
S YY = ∑( y - y i )
2
Adjusted R
2
0.560
b0
i=1 i=1
Standard Error 808.122
2
n S Y Y - S XY / S XY Observations 10
S XY = ∑ ( x - x )( y - y
i=1
i i ), S Y |X =
n-2
b1
Coefficients Standard Error t Stat

b0 and b1 can then be determined by
Intercept 1546.214 600.034 2.577
S XY X Variable 1 0.343 0.097 3.526
b0 = y - b1 x , b1 =
S XX 7 8
Scatter Plot and Regression Line Multiple Regression
Example 1 (Continued): it seems that the total number of trips

6000 produced by a zone should also be related the households of that zone (in
addition to population), can we relate Y to both X1(zonal population) and
Trips, Y by Zone, Y
5000 X2(zonal households)?
4000
In general:
– Data:(y1,x11,x12, ...), (y2,x21,x22,...), …, (yn,xn1,xn2,...)
HBW Productions
3000 – Assumed Relationship
2000 Y = 1546.2 + 0.343 X 1 Y = b0 + b1 x1 + b2 x2 + … + brxr

R 2 = 0.608; t 1 = 3.526
1000
– Problem: how to find coefficients b0, b1, b2 ,…
0
The principle to solve the problem is the same as for
0 2000 4000 6000 8000 10000 12000
Zone Population, X1
the simple linear regression, that is, the method of
9 least squares. 10
Multiple Regression Using Excel City X Understanding Regression Output
SUMMARY OUTPUT We obtained Y = b0 + b1 X, but

Regression Statistics – how good is the final regression equation?
Multiple R 0.891 – How good are the estimates b0 and b1?
2
R 0.794
Adjusted R
2
0.735
– ...
Observations 10
Regression coefficients are estimates of the “true”
Coefficients Standard Error t Stat population coefficients. If the true relationship is
Intercept 491.361 637.458 0.771
X Variable 1 0.804 0.575 1.398 Y = β0 + β1 x
X Variable 2 0.552 0.398 1.385
then b0 and b1 are estimates of β0 and β1 and a different
Resulting Equation: Y = 491.361 + 0.804 X1 + 0.552 X2 sample would result in different b0 and b1 !!!
11 12
2
Understand Regression Output (1):
Regression Output from Excel Regression Coefficients, b0, b1, ...
SUMMARY OUTPUT
Regression Statistics Example 1: Y = 1546.2 + 0.343 X1

Multiple R 0.7801
R Square 0.6085
Adjusted R Square 0.5595 b0 = 1546.2 interpreted as: amount of trips would be produced
Observations 10 if there is no population in a zone (X1=0)?!)
ANOVA
df SS MS F ignificance F b1 = 0.343 interpreted as: one additional person in a zone
Regression 1 8119632.892 8119633 12.43318 0.007777
would produce additional 0.343 trips!)
Residual 8 5224491.608 653061.5
Total 9 13344124.5
Coefficien Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 1546.2 600.0341482 2.576876 0.032775 162.5315 2929.896
X Variable 1 0.3435 0.097406453 3.526072 0.007777 0.118842 0.568082
13 14
Understand Regression Output:

Regression Output from Excel
Confidence Interval and Hypothesis Testing for bi
SUMMARY OUTPUT
A (1- α)100% confidence interval for the parameter β1 in
Regression Statistics
Multiple R 0.7801
the true regression line
R Square 0.6085 This is the SY|X value !
Adjusted R Square 0.5595
Standard Error
Observations
808.12
10
Test the hypothesis H0: β1 = 0. If this hypothesis is not rejected,
then X1 is not a significant factor influencing Y and should not be included in
This is the “Standard error the function!
ANOVA
(deviation) of regression
df SScoefficient of
MSb1” value !F ignificance F
Regression 1 8119632.892 8119633 12.43318 0.007777 If |t| < tα/2, n-k-1 , reject H0 at (1- α)100% confidence level
Residual 8 5224491.608 653061.5
Total 9 13344124.5

Intercept 1546.2 600.0341482 2.576876 0.032775 162.5315 2929.896
X Variable 1 0.3435 0.097406453 3.526072 0.007777 0.118842 0.568082
16 17

Coefficient of Determination, R2
SUMMARY OUTPUT
R2 is a measure of overall quality of the regression,
Multiple R 0.7801
specifically, it is the percentage of total variation in Y that
R Square 0.6085 is accounted by the sample regression line.
Adjusted R Square 0.5595
Standard Error 808.12 This is the 95% confidence
Observations 10 interval of b1 A high value of R2 means that most of the variation we
observed in Y (yi) can be attributed to their corresponding
ANOVA
X values
df SS MS F ignificance F
This is the t value !
Regression 1 8119632.892 8119633 12.43318 0.007777
Residual 8 5224491.608 653061.5
Total 9 13344124.5
There is however NO standard on how high a R2 value is
“good” enough - it depends on the application! In
Coefficien Standard Error t Stat P-value Lower 95% Upper 95% transportation planning, a R2 score over 0.3 is often
Intercept 1546.2 600.0341482 2.576876 0.032775 162.5315 2929.896 acceptable!
X Variable 1 0.3435 0.097406453 3.526072 0.007777 0.118842 0.568082
18 19
3
Coefficient of Determination, R2
SUMMARY OUTPUT
R2 is a measure of overall quality of the regression:
Multiple R 0.7801
Examples
R Square 0.6085 This is the R2 value !
Adjusted R Square 0.5595 Y Y
Observations 10
ANOVA
df SS MS F ignificance F X X
Regression 1 8119632.892 8119633 12.43318 0.007777
Residual 8 5224491.608 653061.5 Y
Total 9 13344124.5

Intercept 1546.2 600.0341482 2.576876 0.032775 162.5315 2929.896
X Variable 1 0.3435 0.097406453 3.526072 0.007777 0.118842 0.568082
20 X 21
Example 1 (continued): Validate Regression Model

(pitfalls in regression analysis!)
Regression SUMMARY OUTPUT Important Assumptions Used in Linear

Output Regression Statistics
Regression Analysis
Multiple R 0.780 – Y is linearly related to x
2
R 0.608
Adjusted R
2
0.560
– For any given x, Y is Normally distributed
Standard Error 808.122 – Residual Error has constant standard deviation (for
Observations 10
all x)
Coefficients Standard Error t Stat – Independent variables are NOT linearly related
Intercept 1546.214 600.034 2.577 (multiple regression only!)
Questions: X Variable 1 0.343 0.097 3.526
– R2 ? How to find out whether or not these conditions
– Is variable X1(population) significant at 95% confidence
level? are met?
– What is 95% confidence interval of the slope (β1)? 22 23
Check for Linearity Check for Normality
Objective: is Y really linearly related to x ? Objective: is the residual normally distributed

Method: check visually from scatter plot (ei) ?
Y Y Method: plot a histogram of the residuals
[ei = yi- (b0+b1xi)]. (note that a more formal
procedure is available, but will not be covered in this
course!)
X X
frequency frequency
Y
e e
X 24 normal not normal 25
4
Check for Homogeneous Variance Check for Multicolinearity
Objective: is the standard deviation of the Objective: is independent variable Xi linearly

residual (e) constant across all X values? correlated to independent variable Xj?
Method: check scatter plot of the residuals VS. Method: check scatter plot of Xi vs. Xj
each X variable 4000
1400
X Variable 1 Residual Plot X Variable 1 Residual Plot 1200
Zone Car Onership, X3

3000
1000
Office Space, X3
1000 1000
800
2000
500 500
Residuals
600
Residuals
0 0 400
1000
0 1000 2000 3000 4000 0 500 1000 1500 2000 2500 3000 200
-500 -500
0 0
-1000 -1000 0 2000 4000 6000 8000 10000 12000 0 1000 2000 3000 4000
X Variable 1 X Variable 1 Population, X1 Zone Households, X2
homogeneous variance non-homogeneous variance No evidence of multicolinearity Evidence of multicolinearity

26 27
Check for Multicolinearity - cont.
Sample Correlation Coefficient (r): is a measure of the degree

of linear-relationship between two variables.
-1≤r≤1:
– if r = +1, X1 and X2 have a perfect positive correlation
– if r = -1, X1 and X2 have a perfect negative correlation
– if r = 0, X1 and X2 have no linear relationship
– if |r| > 0.4 for two independent variables, it may run into the
multicolinearity problem if both variables are included in a regression
equation.
Correlation Coefficient can also be directly obtained using Excel

function
28

Regression Intro PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Regression Intro PDF

Caricato da

Copyright:

Formati disponibili

Review on Linear Regression Analysis Our First Attempt (informal): City X

Total Trips Produced by a Zone

by a zone) to X1 (zonal population)? 8 1784 905 1143

Outline: Simple Linear Regression

Our Second Attempt (formal): City X

Total Zonal Trips VS. Zonal Population

 Data: (x1,y1), (x2,y2), …, (xn,yn) SUMMARY OUTPUT

Coefficients Standard Error t Stat

Scatter Plot and Regression Line Multiple Regression

 Example 1 (Continued): it seems that the total number of trips

5000 X2(zonal households)?

3000 – Assumed Relationship

2000 Y = 1546.2 + 0.343 X 1 Y = b0 + b1 x1 + b2 x2 + … + brxr

Multiple Regression Using Excel City X Understanding Regression Output

SUMMARY OUTPUT  We obtained Y = b0 + b1 X, but

Regression Statistics Example 1: Y = 1546.2 + 0.343 X1

Coefficien Standard Error t Stat P-value Lower 95% Upper 95%

Understand Regression Output:

Coefficien Standard Error t Stat P-value Lower 95% Upper 95%

Understand Regression Output (3):

Coefficien Standard Error t Stat P-value Lower 95% Upper 95%

Example 1 (continued): Validate Regression Model

 Regression SUMMARY OUTPUT  Important Assumptions Used in Linear

Check for Linearity Check for Normality

 Objective: is Y really linearly related to x ?  Objective: is the residual normally distributed

 Objective: is the standard deviation of the  Objective: is independent variable Xi linearly

X Variable 1 Residual Plot X Variable 1 Residual Plot 1200

Zone Car Onership, X3

homogeneous variance non-homogeneous variance No evidence of multicolinearity Evidence of multicolinearity

Check for Multicolinearity - cont.

 Sample Correlation Coefficient (r): is a measure of the degree

 Correlation Coefficient can also be directly obtained using Excel

Potrebbero piacerti anche

Data: (x1,y1), (x2,y2), …, (xn,yn) SUMMARY OUTPUT

Example 1 (Continued): it seems that the total number of trips

SUMMARY OUTPUT We obtained Y = b0 + b1 X, but

Regression SUMMARY OUTPUT Important Assumptions Used in Linear

Objective: is Y really linearly related to x ? Objective: is the residual normally distributed

Objective: is the standard deviation of the Objective: is independent variable Xi linearly

Sample Correlation Coefficient (r): is a measure of the degree

Correlation Coefficient can also be directly obtained using Excel