Linear Regression Analysis and Least Square Methods

LINEAR REGRESSION
ANALYSIS AND LEAST

SQUARE METHODS
CDR SUMEET SINGH
CDR SUNIL TYAGI
CDR LOVEKESH THAKUR
CDR ASHIM MAHAJAN
THE SCHEME
ORIGIN
SCATTER DIAGRAM AND REGRESSION
LEAST SQUARE METHODS
STANDARD ERROR ESTIMATES
CORRELATION ANALYSIS
EXAMPLES
LIMITATIONS ERRORS AND CAVEATS
ORIGIN OF WORD REGRESSION

First used as a statistical term in 1877 by Sir Francis
Galton
A study by him showed that children born to tall
parents tend to move back or regress towards the
mean height of the population.
He designated the word regression as the name of

a general process of predicting one variable from
another.
WHY REGRESSION ANALYSIS

The new CEO of a pharmaceutical firm wants
evidence that suggests the profit of the firm is related
to the amount of spending in the R&D.
The past data on R&D spending and the annual profit
earned in past is available.
Using the Regression techniques and by making use
of the past known data, the estimation of future
outcome can be made.
PRACTICAL APPLICATIONS OF
REGRESSION ANALYSIS
Epidemiology - Early evidence relating tobacco
smoking to mortality and morbidity came from observational
studies employing regression analysis.
Finance - The capital asset pricing model uses linear
regression for analyzing and quantifying the systematic risk of
an investment.
Economics - Linear regression is the predominant empirical tool
in economics. Eg., it is used to predict consumption spending, fixed
investment spending, inventory investment, purchases of a
country's exports, spending on imports, the demand to hold liquid
assets, labor demand and labor supply.
Environmental Science - Linear regression finds application in
a wide range of environmental science applications.
INTRODUCTION TO REGRESSION
ANALYSIS
How to determine the relationship between variables.
Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent variable
on the dependent variable
Dependent variable: the variable we wish to explain

Independent variable: the variable used to explain the
dependent variable
SIMPLE LINEAR REGRESSION MODEL

Only one independent variable, x
Relationship between x and y is described
by a linear function
Changes in y are assumed to be caused by
changes in x
TYPES OF RELATIONSHIPS
Direct Relationship : As the independent variable
increases, the dependent variable also increases.
Positive Linear Relationship
TYPES OF RELATIONSHIPS
Inverse Relationship : In this relationship the
dependent variable decreases with an increase in the
independent variable
Negative Linear Relationship
SCATTER PLOTS AND

CORRELATION
A scatter plot (or scatter diagram) is used to show
the relationship between two variables
Correlation analysis is used to measure strength
of the association (linear relationship) between two
variables
Only concerned with strength of the
relationship
No causal effect is implied
SCATTER PLOT EXAMPLES

Linear relationships
y
Curvilinear relationships
y
x
y
x
y

Strong relationships
y
Weak relationships
y
x
y
x
y

No relationship
y
x
y
ESTIMATION USING THE REGRESSION

LINE
(X2, Y2) or (4, 11)
EQUATION FOR A STRAIGHT LINE
Dependent
Variable
Y intercept
Slope of the
Line
Y a bX
Independent
variable
LEAST SQUARES METHOD
LINEAR REGRESSION
ASSUMPTIONS
Error values (e)
are statistically independent
Error values are normally distributed for any given
value of x
The probability distribution of the errors is normal
The probability distribution of the errors has constant
variance
The underlying relationship between the x variable
and the y variable is linear
20
LINEAR REGRESSION
y
Y = a + bx
Observed Value
of y for xi
i
Predicted Value
of y for xi
Slope = b
Random Error for
this x value
Intercept = a
xi
21
METHOD OF LEAST SQUARES

Means - square of the errors
Errors - is the difference between the actual
data point and the corresponding point on the
estimated line.
Why least squares and not algebraic sum or
absolute sum?
Lets go step by step
GOOD FIT
Graph 1
ALGEBRAIC SUM
Y-Values
1 + 2 -3=0
1
2
-3
Graph 2
ABSOLUTE SUM
Y-Values
4 +2+2=8
GOOD FIT
Graph 1
Graph 2
LEAST SQUARES
(1)^2 +( 2)^2 +(-3) ^2= 14
Y-Values
1
2
-3
(4)^2 +(2)^2+(2)^2= 24
Y-Values
GOOD FIT - ALGEBRAIC SUM

Method 1 ALGEBRAIC SUM
Let us take a data sample three points for
ease
(4,8) (8,1) ( 12,6)
Graph 1 and 2 show the types of line that could
describe the association between the points
Basic understanding of good fit
A line should be a good fit if it minimises the error
between the estimated points on a line and the
actual points
GOOD FIT - ALGEBRAIC SUM

Two different lines - mean error = 0 .
The problem with adding the individual errors is the
cancelling effect of the positive and negative values
Graph 1
Graph 2
Y-Values
Y-Values
1 + 2 -3=0
4 -2 -2=0
2
-3
The individual random error terms ei have a mean of zero
GOOD FIT - ABSOLUTE SUM

Method 2 ABSOLUTE SUM
Let us take a data sample three points for
ease
(4,8) (8,1) ( 12,6)
Graph 1 and 2 show the types of line that could
describe the association between the points
Basic understanding of good fit
Let us now take the absolute values of errors
without their signs -
IeI - for the two lines
GOOD
FIT
ABSOLUTE
SUM
Two different lines absolute sum seems to represent the
relation between the variables better.
Graph 1
Graph 2
1 +2+3=6
Y-Values
2
3
4+2+2=8
Y-Values
GOOD
FIT
ABSOLUTE
SUM
But before we reach any conclusion let us look at a peculiar
situation
Data set { (2,4), (7,6), (10,2)
Graph 1
Graph 2
0 +0+3=3
Y-Values
1+2+1.5=4.5
Y-Values
3
0
0
Graph 1 ignores the middle points but still has lower absolute error. Intuitively Graph 2
should have given a better fit for the complete data. So what is the problem?
GOOD FIT - ABSOLUTE SUM

Problem with absolute sum is that in the line
passing through the middle of the data, Which is
a better representative may have larger absolute
error and hence get rejected.
Sum of the absolute error method does not
stress the magnitude of the error with respect to
the sample data.
A representative line should have several

small errors rather than a few large errors
GOOD FIT LEAST SQUARES
For the same data set now let us use the least square method
we square the individual errors before we add them
Data set { (2,4), (7,6), (10,2)
Graph 1
Graph 2
0 +0+(3)^2=9
Y-Values
Y-Values
(1)^2+(2)^2+(1.5)^2=
7.25
3
0
0
Graph 2 which Intuitively was giving a better fit of the data sample now shows the line to
be giving a better fit than Graph 1
GOOD FIT LEAST SQUARES

Squaring the errors has the following advantages:It magnifies or penalises the larger errors.
It cancels the negative errors Sq of a negative value is a
positive number
The estimating line that minimises the sum of the
square of errors is called the line of the least square
method.
LEAST SQUARES CRITERION

a and b are obtained by finding the values
of a and b that minimize the sum of the
squared residuals
2
e (y y)
2
33
(y (a bx))
THE LEAST SQUARES EQUATION

y i a bx
The formulas for b1 and b0 are:
( x x )( y y )
b
(x x)
2
algebraic equivalent:
x y
xy
n
2
(
x
)
2
x n
and
a y bx
34
INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
a is the estimated average value of y
when the value of x is zero
b is the estimated change in the

average value of y as a result of a one-unit
change in x
35
ERRORS AND CORRELATION
ERRORS
How to Check Accuracy of Estimated Line
How to Check Reliability of Estimated Line
DRUNKEN DRIVING AND

HOSPITAL EMERGENCIES EXPD
Checks
Expenditure(Lakhs)
123
130
110
10
60
15
21
Exp
Accuracy Check
Follow of Path
Individual Errors should cancel each other
CHECKING ACURACY
y
Y = a + bx
x
40
Reliability
More Reliable
y
Lesser Reliable
y
Measured as Deviation around the regression line
Standard Error of Estimate. . .

The standard deviation of the variation of
observations around the regression line is
estimated by
SSE
n2
Where
SSE = Sum of squares error = (Y Y)2
n = Sample size
INTERPRETING STANDARD ERROR

Smaller the Se : Better is the reliability
If Se = 0 : All points will lie on the regression line

: 100% Reliabiity
INTERPRETING STANDARD ERROR

Assuming that observed points are normally distributed and
Variance of distribution around each possible value of Y is same
68.2%
Se
95.5%
2 X Se
Interpreting SE
y
Y = a + bx +2Se
Y = a + bx +1Se
Y = a + bx
Y = a + bx -1Se
Y = a + bx -2Se
45
Interpreting SE
y
Y = a + bx +2Se
Y = a + bx +1Se
Y = a + bx
Y = a + bx -1Se
Y = a + bx -2Se
x
46
Drunk Driving Checks

X:Checks
Se = 1.88 LAKHS
Y:
Expenditure(Lakhs)
123
130
110
10
60
15
21
68.2% ACCURACY WITHIN 1.88 LAKHS

95.5% ACCURACY WITHIN 3.76 LAKHS
EXCEL FUNCTION : STEYX
CORRELATION ANALYSIS
Describe Degree to which one variable is
linearly related to another
Used in conjunction with RA to explain how
well the regression line explains the variation
of dependent variable
Coefficient of Determination
Coefficient of Correlation
COEFFICIENT OF
DETERMINATION
Measures Strength of Association
Developed from Variations
Fitted Regression Line (Y Y)2
Their own mean (Y Y)2
R2 = 1 - (Y Y)2
(Y Y)2
Varies Between 0 and 1
Interpretting R2
COEFFICIENT OF CORRELATION
Another Measure of Association
R = R2
Varies Between -1 and 1
-0.9 explains negative relation between x,y

2 = 0.81 means 81% variation in Y is
plained by Regression Line
EXAMPLES AND USAGE OF

REGRESSION
Simple Linear Regression Example
Cost accountants often estimate overhead

based on the level of production. At the
Standard Knitting Co., they have collected
information on overhead expenses and
units produced at different plants, and want
to estimate a regression equation to predict
future overhead.
54
Data provided
OVERHEADS
UNITS PRODUCED
191
40
170
42
272
53
155
35
280
56
173
39
234
48
116
30
153
37
178
40
55
Simple Linear Regression Example

Develop the Regression Equation
Predict overhead when 50 units are produced
Calculate the standard error of estimate
Firstly , determine what is the
Dependent variable (y) = overhead
Independent variable (x) = units produced
56
Remember- Least Squares Equation

y a bx
The formulas for b and a are:
( x x )( y y )
(x x)
2
algebraic equivalent:
x y
xy
n
2
(
x
)
2
x n
and
a y bx
57
Working out the problem

OVERHEAD
(y)
UNITS
(x)
y2
x2
xy
191
40
36481
1600
7640
170
42
28900
1764
7140
272
53
73984
2809
14416
155
35
24025
1225
5425
280
56
78400
3136
15680
173
39
29929
1521
6747
234
48
54756
2304
11232
116
30
13456
900
3480
153
37
23409
1369
5661
10
178
40
31684
1600
7120
Sums
1922
420
395024
18228
84541
Means
192.2
42
58
Substituting in formulae
OVERHEAD
(y)
UNITS
(x)
y2
x2
xy
Sums
1922
420
395024
18228
84541
Means
192.2
42
x y
xy
n
2
(
x
)
2
x n
a y bx
a 192.2 (6.4915)(42)
(420)(1922)
84541
10
b
(420) 2
18228
10
3817
b
6.4915
588
a 80.4430
59
Regression Equation developed

y a bx
y - 80.4430 6.4915x
Predict overhead when 50 units are produced
y - 80.4430 6.4915x
y - 80.4430 6.4915 (50)
y 244.1320
The predicted price for 50 units is 244.1320
60
Remember Standard Error of

SSE
s
Estimate
n = Sample size
n2

n = Sample size
However, easier for calculations is this :-
a y b xy
n2
61
Substituting in Formulae.
Sums
OVERHEAD
(y)
UNITS
(x)
y2
x2
xy
1922
420
395024
18228
84541
a = -80.4430
b= 6.4915
a y b xy
n2
395024 ( 80.4430)(1922) 6.4915(84541)

10 2
s 10 .2320
62
GRAPHICAL PRESENTATION
Overhead Unit Produced: Scatter plot
and Regression line
y - 80.4430 6.4915x
63
Calculating the
Correlation Coefficient
Sample correlation coefficient:
( x x )( y y )
[ ( x x ) ][ ( y y ) ]
2
or the algebraic equivalent:
n xy x y
[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
64
Calculation Example
Tree
Height
Trunk
Diameter
xy
y2
x2
35
280
1225
64
49
441
2401
81
27
189
729
49
33
198
1089
36
60
13
780
3600
169
21
147
441
49
45
11
495
2025
121
51
12
612
2601
144
=321
=73
=3142
=14111
=713
65
Calculation Example
Tree
Height,
y
r
70
n xy x y
[n( x 2 ) ( x) 2 ][n( y 2 ) ( y)2 ]
60
50
40
8(3142) (73)(321)
[8(713) (73)2 ][8(14111) (321)2 ]
0.886
30
20
10
0
0
10
Trunk Diameter, x
12
14
r = 0.886 relatively strong positive

linear association between x and y
66
LIMITATIONS , ERRORS & CAVEATS

Specific limited range over which regression
equation holds from which the sample was taken
initially
Regression & Correlation analyses do not
determine cause and effect
Conditions change and invalidate the regression
equation since we use past trends to estimate
future trends , values of variables change over
time
67
LIMITATIONS , ERRORS & CAVEATS

Misrepresenting the Coefficients of Correlation
and Determination
l Coeff
of correlation is misinterpreted as a percentage

lTotal variation in regression line is explained by coeff of
determination
Use of Common Sense

Use knowledge of the inherent limitation of tool
Do not find statistical relationship between random
samples with no common bond
68
REFERENCES
STATISTICS FOR MANAGEMENT LEVIN & RUBIN
Statistics for Managers using Microsoft Excel, 5e 2008 Prenticehall, Inc.
Mba512 Simple Linear Regression Notes, uploaded by Wilkes
University
Wikipedia
dss.princeton.edu Online help Analysis
resources.esri.com/help/9.3/.../com/.../regression_analysis_basic
s.htm
Linear regression , uploaded by MBA CORNER By Babasab Patil
Linear regression , Tech_MX
Multiple Linear Regression II James Neill, 2013
Multiple PPTs on Slide Share

Linear Regression Analysis and Least Square Methods

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Linear Regression Analysis and Least Square Methods

Caricato da

Copyright:

Formati disponibili

LINEAR REGRESSION

ANALYSIS AND LEAST

ORIGIN OF WORD REGRESSION

He designated the word regression as the name of

WHY REGRESSION ANALYSIS

Dependent variable: the variable we wish to explain

SIMPLE LINEAR REGRESSION MODEL

SCATTER PLOTS AND

SCATTER PLOT EXAMPLES

SCATTER PLOT EXAMPLES

SCATTER PLOT EXAMPLES

ESTIMATION USING THE REGRESSION

(X2, Y2) or (4, 11)

EQUATION FOR A STRAIGHT LINE

LEAST SQUARES METHOD

METHOD OF LEAST SQUARES

GOOD FIT - ALGEBRAIC SUM

GOOD FIT - ALGEBRAIC SUM

The individual random error terms ei have a mean of zero

GOOD FIT - ABSOLUTE SUM

IeI - for the two lines

GOOD FIT - ABSOLUTE SUM

A representative line should have several

GOOD FIT LEAST SQUARES

GOOD FIT LEAST SQUARES

LEAST SQUARES CRITERION

THE LEAST SQUARES EQUATION

b is the estimated change in the

ERRORS AND CORRELATION

DRUNKEN DRIVING AND

Measured as Deviation around the regression line

Standard Error of Estimate. . .

INTERPRETING STANDARD ERROR

If Se = 0 : All points will lie on the regression line

INTERPRETING STANDARD ERROR

Drunk Driving Checks

68.2% ACCURACY WITHIN 1.88 LAKHS

-0.9 explains negative relation between x,y

EXAMPLES AND USAGE OF

Simple Linear Regression Example

Cost accountants often estimate overhead

Simple Linear Regression Example

Remember- Least Squares Equation

Working out the problem

Regression Equation developed

Predict overhead when 50 units are produced

Remember Standard Error of

SSE = Sum of squares error = (Y Y)2

395024 ( 80.4430)(1922) 6.4915(84541)

or the algebraic equivalent:

[n( x 2 ) ( x) 2 ][n( y 2 ) ( y)2 ]

r = 0.886 relatively strong positive

LIMITATIONS , ERRORS & CAVEATS

LIMITATIONS , ERRORS & CAVEATS

of correlation is misinterpreted as a percentage

Use of Common Sense

Potrebbero piacerti anche