Sei sulla pagina 1di 65

LINEAR REGRESSION

ANALYSIS AND LEAST


SQUARE METHODS
CDR SUMEET SINGH
CDR SUNIL TYAGI
CDR LOVEKESH THAKUR
CDR ASHIM MAHAJAN

THE SCHEME
ORIGIN
SCATTER DIAGRAM AND REGRESSION
LEAST SQUARE METHODS
STANDARD ERROR ESTIMATES
CORRELATION ANALYSIS
EXAMPLES
LIMITATIONS ERRORS AND CAVEATS

ORIGIN OF WORD REGRESSION


First used as a statistical term in 1877 by Sir Francis
Galton
A study by him showed that children born to tall
parents tend to move back or regress towards the
mean height of the population.

He designated the word regression as the name of


a general process of predicting one variable from
another.

WHY REGRESSION ANALYSIS


The new CEO of a pharmaceutical firm wants
evidence that suggests the profit of the firm is related
to the amount of spending in the R&D.
The past data on R&D spending and the annual profit
earned in past is available.
Using the Regression techniques and by making use
of the past known data, the estimation of future
outcome can be made.

PRACTICAL APPLICATIONS OF
REGRESSION ANALYSIS
Epidemiology - Early evidence relating tobacco
smoking to mortality and morbidity came from observational
studies employing regression analysis.
Finance - The capital asset pricing model uses linear
regression for analyzing and quantifying the systematic risk of
an investment.
Economics - Linear regression is the predominant empirical tool
in economics. Eg., it is used to predict consumption spending, fixed
investment spending, inventory investment, purchases of a
country's exports, spending on imports, the demand to hold liquid
assets, labor demand and labor supply.
Environmental Science - Linear regression finds application in
a wide range of environmental science applications.

INTRODUCTION TO REGRESSION
ANALYSIS
How to determine the relationship between variables.
Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent variable
on the dependent variable

Dependent variable: the variable we wish to explain


Independent variable: the variable used to explain the
dependent variable

SIMPLE LINEAR REGRESSION MODEL


Only one independent variable, x
Relationship between x and y is described
by a linear function
Changes in y are assumed to be caused by
changes in x

TYPES OF RELATIONSHIPS
Direct Relationship : As the independent variable
increases, the dependent variable also increases.
Positive Linear Relationship

TYPES OF RELATIONSHIPS
Inverse Relationship : In this relationship the
dependent variable decreases with an increase in the
independent variable
Negative Linear Relationship

SCATTER PLOTS AND


CORRELATION
A scatter plot (or scatter diagram) is used to show
the relationship between two variables
Correlation analysis is used to measure strength
of the association (linear relationship) between two
variables
Only concerned with strength of the
relationship
No causal effect is implied

SCATTER PLOT EXAMPLES


Linear relationships
y

Curvilinear relationships
y

x
y

x
y

SCATTER PLOT EXAMPLES


Strong relationships
y

Weak relationships
y

x
y

x
y

SCATTER PLOT EXAMPLES


No relationship
y

x
y

ESTIMATION USING THE REGRESSION


LINE

(X2, Y2) or (4, 11)

EQUATION FOR A STRAIGHT LINE

Dependent
Variable

Y intercept

Slope of the
Line

Y a bX

Independent
variable

LEAST SQUARES METHOD

LINEAR REGRESSION
ASSUMPTIONS
Error values (e)
are statistically independent
Error values are normally distributed for any given
value of x
The probability distribution of the errors is normal
The probability distribution of the errors has constant
variance
The underlying relationship between the x variable
and the y variable is linear

20

LINEAR REGRESSION
y

Y = a + bx

Observed Value
of y for xi

i
Predicted Value
of y for xi

Slope = b
Random Error for
this x value

Intercept = a

xi
21

METHOD OF LEAST SQUARES


Means - square of the errors
Errors - is the difference between the actual
data point and the corresponding point on the
estimated line.
Why least squares and not algebraic sum or
absolute sum?
Lets go step by step

GOOD FIT
Graph 1
ALGEBRAIC SUM
Y-Values
1 + 2 -3=0
1

2
-3

Graph 2
ABSOLUTE SUM
Y-Values
4 +2+2=8

GOOD FIT

Graph 1

Graph 2

LEAST SQUARES
(1)^2 +( 2)^2 +(-3) ^2= 14

Y-Values
1

2
-3

(4)^2 +(2)^2+(2)^2= 24

Y-Values

GOOD FIT - ALGEBRAIC SUM


Method 1 ALGEBRAIC SUM
Let us take a data sample three points for
ease
(4,8) (8,1) ( 12,6)
Graph 1 and 2 show the types of line that could
describe the association between the points
Basic understanding of good fit
A line should be a good fit if it minimises the error
between the estimated points on a line and the
actual points

GOOD FIT - ALGEBRAIC SUM


Two different lines - mean error = 0 .
The problem with adding the individual errors is the
cancelling effect of the positive and negative values

Graph 1

Graph 2

Y-Values

Y-Values
1 + 2 -3=0

4 -2 -2=0

2
-3

The individual random error terms ei have a mean of zero

GOOD FIT - ABSOLUTE SUM


Method 2 ABSOLUTE SUM
Let us take a data sample three points for
ease
(4,8) (8,1) ( 12,6)
Graph 1 and 2 show the types of line that could
describe the association between the points
Basic understanding of good fit
Let us now take the absolute values of errors
without their signs -

IeI - for the two lines

GOOD
FIT
ABSOLUTE
SUM
Two different lines absolute sum seems to represent the
relation between the variables better.

Graph 1

Graph 2
1 +2+3=6

Y-Values

2
3

4+2+2=8

Y-Values

GOOD
FIT

ABSOLUTE
SUM
But before we reach any conclusion let us look at a peculiar
situation
Data set { (2,4), (7,6), (10,2)

Graph 1

Graph 2
0 +0+3=3

Y-Values

1+2+1.5=4.5

Y-Values

3
0
0

Graph 1 ignores the middle points but still has lower absolute error. Intuitively Graph 2
should have given a better fit for the complete data. So what is the problem?

GOOD FIT - ABSOLUTE SUM


Problem with absolute sum is that in the line
passing through the middle of the data, Which is
a better representative may have larger absolute
error and hence get rejected.
Sum of the absolute error method does not
stress the magnitude of the error with respect to
the sample data.

A representative line should have several


small errors rather than a few large errors

GOOD FIT LEAST SQUARES

For the same data set now let us use the least square method
we square the individual errors before we add them
Data set { (2,4), (7,6), (10,2)

Graph 1

Graph 2
0 +0+(3)^2=9
Y-Values

Y-Values
(1)^2+(2)^2+(1.5)^2=
7.25

3
0
0

Graph 2 which Intuitively was giving a better fit of the data sample now shows the line to
be giving a better fit than Graph 1

GOOD FIT LEAST SQUARES


Squaring the errors has the following advantages:It magnifies or penalises the larger errors.
It cancels the negative errors Sq of a negative value is a
positive number
The estimating line that minimises the sum of the
square of errors is called the line of the least square
method.

LEAST SQUARES CRITERION


a and b are obtained by finding the values
of a and b that minimize the sum of the
squared residuals
2

e (y y)
2

33

(y (a bx))

THE LEAST SQUARES EQUATION


y i a bx
The formulas for b1 and b0 are:

( x x )( y y )

b
(x x)
2

algebraic equivalent:

x y

xy
n
2
(
x
)

2
x n

and

a y bx

34

INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
a is the estimated average value of y
when the value of x is zero

b is the estimated change in the


average value of y as a result of a one-unit
change in x

35

ERRORS AND CORRELATION

ERRORS
How to Check Accuracy of Estimated Line
How to Check Reliability of Estimated Line

DRUNKEN DRIVING AND


HOSPITAL EMERGENCIES EXPD
Checks

Expenditure(Lakhs)

123

130

110

10

60

15

21

Exp

Accuracy Check
Follow of Path
Individual Errors should cancel each other

CHECKING ACURACY
y

Y = a + bx

x
40

Reliability
More Reliable
y

Lesser Reliable
y

Measured as Deviation around the regression line

Standard Error of Estimate. . .


The standard deviation of the variation of
observations around the regression line is
estimated by

SSE
n2

Where
SSE = Sum of squares error = (Y Y)2
n = Sample size

INTERPRETING STANDARD ERROR


Smaller the Se : Better is the reliability

If Se = 0 : All points will lie on the regression line


: 100% Reliabiity

INTERPRETING STANDARD ERROR


Assuming that observed points are normally distributed and
Variance of distribution around each possible value of Y is same

68.2%

Se
95.5%

2 X Se

Interpreting SE
y

Y = a + bx +2Se
Y = a + bx +1Se
Y = a + bx
Y = a + bx -1Se

Y = a + bx -2Se

45

Interpreting SE
y

Y = a + bx +2Se
Y = a + bx +1Se
Y = a + bx
Y = a + bx -1Se

Y = a + bx -2Se

x
46

Drunk Driving Checks


X:Checks

Se = 1.88 LAKHS

Y:
Expenditure(Lakhs)
123

130

110

10

60

15

21

68.2% ACCURACY WITHIN 1.88 LAKHS


95.5% ACCURACY WITHIN 3.76 LAKHS
EXCEL FUNCTION : STEYX

CORRELATION ANALYSIS
Describe Degree to which one variable is
linearly related to another
Used in conjunction with RA to explain how
well the regression line explains the variation
of dependent variable
Coefficient of Determination
Coefficient of Correlation

COEFFICIENT OF
DETERMINATION
Measures Strength of Association
Developed from Variations
Fitted Regression Line (Y Y)2
Their own mean (Y Y)2

R2 = 1 - (Y Y)2
(Y Y)2
Varies Between 0 and 1

Interpretting R2

COEFFICIENT OF CORRELATION
Another Measure of Association
R = R2
Varies Between -1 and 1

-0.9 explains negative relation between x,y


2 = 0.81 means 81% variation in Y is
plained by Regression Line

EXAMPLES AND USAGE OF


REGRESSION

Simple Linear Regression Example

Cost accountants often estimate overhead


based on the level of production. At the
Standard Knitting Co., they have collected
information on overhead expenses and
units produced at different plants, and want
to estimate a regression equation to predict
future overhead.
54

Data provided
OVERHEADS

UNITS PRODUCED

191

40

170

42

272

53

155

35

280

56

173

39

234

48

116

30

153

37

178

40
55

Simple Linear Regression Example


Develop the Regression Equation
Predict overhead when 50 units are produced
Calculate the standard error of estimate
Firstly , determine what is the
Dependent variable (y) = overhead
Independent variable (x) = units produced

56

Remember- Least Squares Equation


y a bx
The formulas for b and a are:

( x x )( y y )

(x x)
2

algebraic equivalent:

x y

xy
n
2
(
x
)

2
x n

and

a y bx

57

Working out the problem


OVERHEAD
(y)

UNITS
(x)

y2

x2

xy

191

40

36481

1600

7640

170

42

28900

1764

7140

272

53

73984

2809

14416

155

35

24025

1225

5425

280

56

78400

3136

15680

173

39

29929

1521

6747

234

48

54756

2304

11232

116

30

13456

900

3480

153

37

23409

1369

5661

10

178

40

31684

1600

7120

Sums

1922

420

395024

18228

84541

Means

192.2

42
58

Substituting in formulae
OVERHEAD
(y)

UNITS
(x)

y2

x2

xy

Sums

1922

420

395024

18228

84541

Means

192.2

42

x y

xy
n
2
(
x
)

2
x n

a y bx
a 192.2 (6.4915)(42)

(420)(1922)
84541
10
b
(420) 2
18228
10

3817
b
6.4915
588

a 80.4430
59

Regression Equation developed


y a bx

y - 80.4430 6.4915x

Predict overhead when 50 units are produced

y - 80.4430 6.4915x
y - 80.4430 6.4915 (50)
y 244.1320
The predicted price for 50 units is 244.1320

60

Remember Standard Error of


SSE
s
Estimate
SSE = Sum of squares error = (Y Y)2
n = Sample size

n2

SSE = Sum of squares error = (Y Y)2


n = Sample size
However, easier for calculations is this :-

a y b xy
n2

61

Substituting in Formulae.
Sums

OVERHEAD
(y)

UNITS
(x)

y2

x2

xy

1922

420

395024

18228

84541

a = -80.4430

b= 6.4915

a y b xy
n2

395024 ( 80.4430)(1922) 6.4915(84541)


10 2

s 10 .2320

62

GRAPHICAL PRESENTATION
Overhead Unit Produced: Scatter plot
and Regression line

y - 80.4430 6.4915x
63

Calculating the
Correlation Coefficient
Sample correlation coefficient:

( x x )( y y )
[ ( x x ) ][ ( y y ) ]
2

or the algebraic equivalent:

n xy x y

[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]

where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
64

Calculation Example
Tree
Height

Trunk
Diameter

xy

y2

x2

35

280

1225

64

49

441

2401

81

27

189

729

49

33

198

1089

36

60

13

780

3600

169

21

147

441

49

45

11

495

2025

121

51

12

612

2601

144

=321

=73

=3142

=14111

=713

65

Calculation Example
Tree
Height,
y

r
70

n xy x y

[n( x 2 ) ( x) 2 ][n( y 2 ) ( y)2 ]

60

50

40

8(3142) (73)(321)
[8(713) (73)2 ][8(14111) (321)2 ]

0.886

30

20

10

0
0

10

Trunk Diameter, x

12

14

r = 0.886 relatively strong positive


linear association between x and y

66

LIMITATIONS , ERRORS & CAVEATS


Specific limited range over which regression
equation holds from which the sample was taken
initially
Regression & Correlation analyses do not
determine cause and effect
Conditions change and invalidate the regression
equation since we use past trends to estimate
future trends , values of variables change over
time
67

LIMITATIONS , ERRORS & CAVEATS


Misrepresenting the Coefficients of Correlation
and Determination
l Coeff

of correlation is misinterpreted as a percentage


lTotal variation in regression line is explained by coeff of
determination

Use of Common Sense


Use knowledge of the inherent limitation of tool
Do not find statistical relationship between random
samples with no common bond

68

REFERENCES
STATISTICS FOR MANAGEMENT LEVIN & RUBIN
Statistics for Managers using Microsoft Excel, 5e 2008 Prenticehall, Inc.
Mba512 Simple Linear Regression Notes, uploaded by Wilkes
University
Wikipedia
dss.princeton.edu Online help Analysis
resources.esri.com/help/9.3/.../com/.../regression_analysis_basic
s.htm
Linear regression , uploaded by MBA CORNER By Babasab Patil
Linear regression , Tech_MX
Multiple Linear Regression II James Neill, 2013
Multiple PPTs on Slide Share

Potrebbero piacerti anche