Simple Linear Regression Part 1

Simple Linear Regression
Ms RU Cruz
Overview
1. Pearsons Correlation (r)
2. Simple Regression Model
3. Least Squares Method
4. Coefficient of Determination
5. Model Assumptions
6. Testing of Significance
7. Residual Analysis
Decisions, Decisions, Decisions
Managerial decisions are often based on the
relationship between two or more variables.
Advertising Expenditures relative to Sales
Marketing Manager is to predict sales for a given level of
advertising expenditures
Sometimes a manager will rely on intuition to judge how
two variables are related.
Pattern recognition and data analysis would served as your
fundamental mind framework.
Correlation
A quantitative relationship between two interval or ratio level variables
Explanatory Response
(Independent) Variable (Dependent) Variable
x y
Hours of Training Number of Accidents
Shoe Size Height
Cigarettes smoked per day Lung Capacity
Score on SAT Grade Point Average
Height IQ
What type of relationship exists between the two variables and is

the correlation significant?
Correlation
measures and describes the

strength and direction of the
relationship (Pearson's r
n XY X Y
r
n X X nY
2 2 2
Y
2

Correlation Coefficient r
A measure of the strength and direction of a linear relationship between two
variables
The range of r is from 1 to 1.
1 0 1
If r is close to 1 there If r is close to 0 If r is close to 1

is a strong negative there is no linear there is a strong
correlation. correlation. positive
correlation.
Application
Final
Absences Grade
x y
95
90 8 78
85
Final Grade
80 2 92
75
70 5 90
65
60
55
12 58
50
45
15 43
40
9 74
0 2 4 6 8 10 12 14 16
Absences 6 81
X Compute for
r
Computation of r
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
57 516 3751 579 39898
r value
r value =
+.70 or higher Very strong positive relationship
+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 weak positive relationship

+.01 to +.19 No or negligible relationship
0 No relationship [zero order correlation]
-.01 to -.19 No or negligible relationship

-.20 to -.29 weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship

Example # 2
Number of Hours Spent in Studying & Grade Received
2 57
2 63
2 70
3 72
3 69
4 75
5 73
5 84
6 82
6 89
Example 3
Let's begin by asking if people tend to
marry other people of about the same age
The sample data below are the ages of 10
married couples. Notice that husbands and
wives tend to be of about the same age,
with men having a tendency to be slightly
older than their wives. What we know of
statistics, however, tells us that what we see
is not always significant.
So lets apply the Pearson r formula and see
what happens.
Example 2
Husband
36 72 37 36 51 50 47 50 37 41
(x)
Wife
35 67 33 35 50 46 47 42 36 41
(y)
Regression Terminologies
Regression analysis can be used to develop an
equation showing how the variables are related.
Dependent variable the variable being
predicted or explained under the regression
model. Denoted by y.
Independent variable the variable that is doing
the predicting or explaining. Denoted by x
Simple Linear Regression involves one
independent variable and one dependent
variable approximated by a straight line.
Multiple Regression involves two or more
independent variables
Simple Linear Regression Model
Regression Model is the equation that
describes how y is related to x and an error
term.
Model:
y = b0 + b1x +e
Where 0 and 1 are called parameters of the
model, while is a random variable called the
error term
Simple Linear Regression Equation
E(y) = b0 + b1x
b0 is the y intercept of the regression line.
b1 is the slope of the regression line.
E(y) is the expected value of y for a given x value.
E(y)
Regression line
Intercept
b0 Slope b1
is positive
x
Scatter Plots and Types of Correlation
x = SAT score
y = GPA
4.00
3.75
3.50
3.25
GPA
3.00
2.75
2.50
2.25
2.00
1.75
1.50
300 350 400 450 500 550 600 650 700 750 800
Math SAT
Positive Correlationas x increases, y increases

Negative Linear Relationship
E(y)
Intercept
b0 Regression line
Slope b1
is negative
x
x = hours of training (horizontal axis)
y = number of accidents (vertical axis)
60
50
40
Accidents
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20
Hours of Training
Negative Correlationas x increases, y decreases

No Relationship
E(y)
Intercept Regression line

b0
Slope b1
is 0
x
x = height y = IQ
160
150
140
130
IQ
120
110
100
90
80
60 64 68 72 76 80
Height
No linear correlation
Estimated Simple
Linear Regression Equation
The graph is called the estimated regression line.
b0 is the y intercept of the line.
b1 is the slope of the line.
is the estimated value of y for a given x value.
y b0 b1 x
Y intercept for
b0 y b1 x
Estimated
Regression
Equation
Slope and for the Estimated
Regression Equation
b1 ( x x )( y y )
i i
(x x )
i
2
xi = value of independent variable for ith

observation
yi = value of dependent variable for ith
observation
x = mean value for independent variable
y = mean value for dependent variable
Example
MCAS periodically has a special week long
sale. As part of the advertising campaign
MCAS runs one or more TV commercials
during the weekend preceding the sale. Data
from a sample of 5 weekly previous sales are
shown: Week TV Ads Cars Sold
1 1 14
2 3 24
3 2 18
4 1 17
5 3 27
Questions
1. Develop a scatter diagram for these data
2. What is the slope using the simple linear
regression model?
3. What is the y intercept?
4. What is the estimated regression equation?
Scatter Diagram
30
25
20
15
Series1
10
0
0 0.5 1 1.5 2 2.5 3 3.5
Computation Table for Slope
Week TV Ads Cars Sold xi x yi y (xi x) (xi x)2
(x) (y) (yi y)
1 1 14 -1 -6 6 1
2 3 24 1 4 4 1
3 2 18 0 -2 0 0
4 1 17 -1 -3 3 1
5 3 27 1 7 7 1
x = 10 y = 100 20 4
x=2 y = 20
b1 ( x x )( y y ) 20
i i
5
(x x )
i
2
4
3. Y intercept is
b0 y b1 x 20 5(2) 10
4. Regression Equation
y b0 b1 x
y 10 5x
y = mx + b
y= 5x + 10
SEATWORK: 1/2 CROSWISE BY PAIR
The grades of a sample of 9 students on a midterm
exam (X) and on the final exam (Y) are as follows:
x 96 81 71 72 50 94 77 99 67
y 99 47 78 34 66 85 82 99 68
a) Find the equation of the regression line.

b) Estimate the final exam grade of a student who
received a grade of 85 in the midterm exam but
was sick at the time the final exam was given.
Correlation and Regression
Example 2
x 96 81 71 72 50 94 77 99 67
y 99 47 78 34 66 85 82 99 68
y 12.062 0.777 x
y (85) 12.062 0.777(85) 78

Least Squares Method
Least Squares method is a procedure used to
develop the estimated regression equation.
Objective: Minimize or min (yi i)2
yi = observed value of the dependent variable
for the ith observation
i = estimated value of the dependent variable for
the ith observation
yi i = residual for observation i
Scatter Diagram a graph of data in which the
independent variable is on the horizontal axis and
the dependent variable is on the vertical axis.
EXAMPLE 1
A company has recorded data on the daily
demand for its product (y in thousands of units)
and the unit price (x in hundreds of dollars). A
sample of 15 days demand and associated prices
resulted in the following data
x 75 x 2 437 y 180 y 2 2266 x y 844
a.
Using the above information, develop the least-squares
estimated regression line and write the equation.

ANSWER
y 16.515 - 0.903x

Coefficient of Determination
Coefficient of Determination is a measure of
the goodness of fit of the estimated regression
equation. It can be interpreted as the proportion
of the variability in the dependent variable y that
is explained by the estimated regression
equation.
ith residual is the difference between the
observed value of the dependent variable and
the value predicted using the estimated
regression equation. For the ith observation the
ith residual is yi i
Relationship Among SST
SST = SSR + SSE
(yi y)2 = (i y)2 + (yi i)2
r2 = SSR/SST
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Calculation for SSE
Week TV Ads Cars Sold Predicted yi i (yi i)2
(x) (y) Sales Residuals
= 10 +5x
1 1 14 15 -1 1
2 3 24 25 -1 1
3 2 18 20 -2 4
4 1 17 15 2 4
5 3 27 25 2 4
x = 10 y = 100 SSE = 14
x=2 y = 20
Calculation for SST
Week TV Ads Cars Sold yi yi (yi yi)2
(x) (y)
1 1 14 -6 36
2 3 24 4 16
3 2 18 -2 4
4 1 17 -3 9
5 3 27 7 49
x = 10 y = 100 114
x=2 y = 20
The regression relationship is very strong; 87.72%of
SSR = SST SSE = 114 14 = 100 the variability in the number of cars sold can be
r2 = SSR/SST = 100/114 = 0.8772 explained by the linear relationship between the
number of TV ads and the number of cars sold.
Model Assumptions
1. The error term is a random variable with a
mean or expected value of zero.
2. The variance of denoted by 2 is the same
for all values of x.
3. The values of are independent
4. The error term is a normally distributed
random variable.
Testing for Significance
To test for a significant regression relationship,
we must conduct a hypothesis test to determine
whether the value of 1 is zero
The two most commonly used tests for are:
F test and the T test
Both tests require an estimate of 2, the variance
of in the regression model.
The mean square error (MSE) provides the
estimate of 2, and the notation s2 is also used.
s 2 = MSE = SSE/(n 2) SSE ( yi y i ) 2 ( yi b0 b1 xi ) 2
Standard Error of the Estimate
To estimate standard deviations , you simply
square root 2
In this case, the resulting s is called the
standard error of the estimate.
SSE
s MSE
n2
Testing for Significance: t Test
Hypothesis: Ho : 1 = 0
Ha : 1 0
Test Statistic: b1 where s
t sb1
sb1 ( xi x ) 2
Rejection Rule:
p value approach: Reject Ho if p value
critical value approach: Reject Ho if t -t/2
or t t/2
Where t/2 is based on a t distribution with n 2
degrees of freedom.
Steps of Significant t Tests
1. Determine the hypotheses
2. Specify the level of significance
3. Select the test statistic
b1
t
sb1
4. State the rejection rule
5. Compute the value of the test statistic
6. Determine conclusion via rejection rule
Testing for Significance: F Test
Hypothesis: Ho : 1 = 0
Ha : 1 0
Test Statistic: F = MSR/MSE
Rejection Rule:
p value approach: Reject Ho if p value
Critical value approach: Reject Ho if F F
Where F is based on an F distribution with 1
degree of freedom in the numerator and
n 2 degrees of freedom in the denominator.
Steps of Significant F Tests
1. Determine the hypotheses
2. Specify the level of significance
3. Select the test statistic
F = MSR/MSE
4. State the rejection rule
5. Compute the value of the test statistic
6. Determine conclusion via rejection rule
Hypothesis Test for Significance
r is the correlation coefficient for the sample. The
correlation coefficient for the population is (rho).
For a two tail test for significance:
(The correlation is not significant)
(The correlation is significant)
The sampling distribution for r is a t-distribution with

n 2 d.f.
Standardized test
statistic
Test of Significance
The correlation between the number of times absent and a
final grade r = 0.975. There were seven pairs of data.Test the
significance of this correlation. Use = 0.01.
1. Write the null and alternative hypothesis.

(The correlation is not significant)
(The correlation is significant)
2. State the level of significance.

= 0.01
3. Identify the sampling distribution.

A t-distribution with 5 degrees of freedom
Rejection Regions
Critical Values t0
t
4.032 0 4.032
df\
0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005
p
4. Find the critical value. 1

0.32492
0
1.00000
0
3.07768
4
6.31375
2
12.7062
0
31.8205
2
63.6567
4
636.619
2
0.28867 0.81649 1.88561 2.91998

2 4.30265 6.96456 9.92484 31.5991
5 7 8 6
5. Find the rejection region. 3 0.27667 0.76489 1.63774 2.35336

3.18245 4.54070 5.84091 12.9240
1 2 4 3
0.27072 0.74069 1.53320 2.13184

4 2.77645 3.74695 4.60409 8.6103
2 7 6 7
6. Find the test statistic.
0.26718 0.72668 1.47588 2.01504
5 2.57058 3.36493 4.03214 6.8688
1 7 4 8
t
0
4.032 +4.032
7. Make your decision.
t = 9.811 falls in the rejection region. Reject the null hypothesis.
8. Interpret your decision.
There is a significant negative correlation between the number of times

absent and final grades.
The Line of Regression
Regression indicates the degree to which the variation in one variable X, is related to or can
be explained by the variation in another variable Y
Once you know there is a significant linear correlation, you can write an equation describing
the relationship between the x and y variables. This equation is called the line of regression
or least squares line.
The equation of a line may be written as y = mx + b
where m is the slope of the line and b is the y-intercept.
The line of regression is:
The slope m is:
The y-intercept is:

(xi,yi) = a data point
= a point on the line with the same x-value
= a residual
Best fitting straight line

260
250
240
revenue
230
220
210
200
190
180
1.5 2.0 Ad $ 2.5 3.0
x y xy x2 y2
Write the equation of the
1 8 78 624 64 6084
2 2 92 184 4 8464
line of regression with
3 5 90 450 25 8100 x = number of absences
4 12 58 696 144 3364 and y = final grade.
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
Calculate m and b.
57 516 3751 579 39898
The line of regression is: = 3.924x + 105.667

The Line of Regression
m = 3.924 and b = 105.667
The line of regression is:
95
90
85
Grade
80
75
70
65
Final
60
55
50
45
40
0 2 4 6 8 10 12 14 16
Absences
Note that the point = (8.143, 73.714) is on the line.

Predicting y Values
The regression line can be used to predict values of y for values of x
falling within the range of the data.
The regression equation for number of times absent and final grade is:
= 3.924x + 105.667
Use this equation to predict the expected grade for a student with
(a) 3 absences (b) 12 absences
(a) = 3.924(3) + 105.667 = 93.895

(b) = 3.924(12) + 105.667 = 58.579
Strength of the Association
The coefficient of determination, r2, measures the strength of the association and is the
ratio of explained variation in y to the total variation in y.
The correlation coefficient of number of times absent and final grade is r = 0.975.
The coefficient of determination is r2 = (0.975)2 = 0.9506.
Interpretation: About 95% of the variation in final grades can be explained by the
number of times a student is absent. The other 5% is unexplained and can be due to
sampling error or other variables such as intelligence, amount of time studied, etc.
ANOVA table
Source of Sum of Degrees of Mean F Test P value
Variation Squares Freedom Square
Regression
SSR 1 MSR = F =
SSR/1 MSR/
MSE
Error
SSE n2 MSE =
SSE/
n2
Total
SST n1
Residual Analysis
Residual Analysis the analysis of the residuals
used to determine whether the assumptions
made about the regression model appear to be
valid. Residual analysis is used to identify outliers
and influential observation.
Outliers a data point or observation that does
not fir the trend shown by the remaining data
Influential observation an observation that has
a strong influence or effect on the regression
results.
Residual Plot Graphical representation of the
residuals that can be used to determine whether
the assumptions made about the regression
model appear to be valid.
Residual Analysis
If the assumptions about the error term appears
questionable, the hypothesis tests about the
significance of the regression relationship and the
interval estimation results may not be valid. The
residuals provide the best information about . Much
of the residual analysis is based on an examination of
graphical plots.
Residuals for observation : yi i
If the assumption that the variance of is the same for
all values of x is valid, and the assumed regression
model is an adequate representation of the
relationship between the variable then, the residual
plot should give an overall impression band of points.
Residual Plot Against x
y y
Good Pattern
Residual
x
y y
Nonconstant Variance
Residual
x
y y
Model Form Not Adequate
Residual
x
Using Excel to Produce a Residual Plot

The steps outlined earlier to obtain the regression
output are performed with one change.
When the Regression dialog box appears, we must
also select the Residual Plot option.
The output will include two new items:
A plot of the residuals against the
independent variable, and
A list of predicted values of y and the
corresponding residual values.
TV Ads Residual Plot

3
2
Residuals
1
0
-1
-2
-3
0 1 2 3 4
TV Ads
Standardized Residuals
Standard Residuals the value obtained by
dividing the residual by its standard deviation
Normal probability plot a graph of the
standardized residuals plotted against the
values of the normal scores. This plot helps
determine whether the assumption that the
error term has a normal probability
distribution that appears to be valid.
High leverage points observations with
extreme values for the independent variables
Standardized Residual
Standardized Residual for Observation i.
y i y i
syi y i
Standard Deviation of the ith residual
syi y i s 1 hi
Leverage of observation i.
1 ( xi x ) 2
hi
n ( x i x )2

Simple Linear Regression Part 1

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Simple Linear Regression Part 1

Caricato da

Copyright:

Formati disponibili

Simple Linear Regression

Shoe Size Height

Cigarettes smoked per day Lung Capacity

Score on SAT Grade Point Average

What type of relationship exists between the two variables and is

measures and describes the

The range of r is from 1 to 1.

If r is close to 1 there If r is close to 0 If r is close to 1

+.70 or higher Very strong positive relationship

+.40 to +.69 Strong positive relationship

+.30 to +.39 Moderate positive relationship

+.20 to +.29 weak positive relationship

0 No relationship [zero order correlation]

-.01 to -.19 No or negligible relationship

-.30 to -.39 Moderate negative relationship

-.40 to -.69 Strong negative relationship

-.70 or higher Very strong negative relationship

Positive Correlationas x increases, y increases

Negative Linear Relationship

Negative Correlationas x increases, y decreases

Intercept Regression line

xi = value of independent variable for ith

a) Find the equation of the regression line.

y (85) 12.062 0.777(85) 78

Correlation and Regression

Correlation and Regression

Correlation and Regression

(The correlation is significant)

The sampling distribution for r is a t-distribution with

1. Write the null and alternative hypothesis.

(The correlation is significant)

2. State the level of significance.

3. Identify the sampling distribution.

4. Find the critical value. 1

0.28867 0.81649 1.88561 2.91998

5. Find the rejection region. 3 0.27667 0.76489 1.63774 2.35336

0.27072 0.74069 1.53320 2.13184

7. Make your decision.

t = 9.811 falls in the rejection region. Reject the null hypothesis.

8. Interpret your decision.

There is a significant negative correlation between the number of times

The line of regression is:

The slope m is:

The y-intercept is:

= a point on the line with the same x-value

Best fitting straight line

The line of regression is: = 3.924x + 105.667

Note that the point = (8.143, 73.714) is on the line.

(a) 3 absences (b) 12 absences

(a) = 3.924(3) + 105.667 = 93.895

Using Excel to Produce a Residual Plot

TV Ads Residual Plot

Potrebbero piacerti anche