Sei sulla pagina 1di 18

10.

0 Simple Linear Regression

One of the most important application of statistics involves


estimating the mean value of a response variable Y or predicting
some future value y based on knowledge of a set of related variable
X. We may refer to X as the independent variable that is used to
predict the dependent variable Y.

There are a good number of variables that we can easily


associate with another, and have some sense of prediction:

 A person’s weight (Y, in kilograms) is related to his/her


height (X in centimeters). If we know the person’s height,
we can predict his/her weight. Taller people are heavier
than shorter ones, in most cases.
 The volume of water (X, in liters) in a kettle left to boil on an
oven determines how long (Y, in minutes) it will take before
the whole vat of water boils. A full kettle of tap water takes
a longer time to boil than a half-full one.
 The demand Y for a commodity in terms of units sold is
inversely proportional with its price X. When department
stores go on “sale”, more people troop to buy things there
because the prices are cheaper. The same goes for new
cellphones models whose prices go down, there will be
more people who buy them.
 The amount of rainfall X is associated with the amount of
particulates Y that are removed from a surface exposed to
rain. More rain, more removed particulates.
 The density X of a particle board or plywood determines its
stiffness Y.
 The entrance exam final grade X of an applicant to DLSU-
Manila is a predictive indicator of his/her CGPA (variable Y)
upon graduation. Most universities have this predictive
model in use. (at lease, DLSU-M and UP-Diliman does.)
 The advertising expense X incurred promoting a product
has a predictive relationship with its sales Y in pesos.
Within certain range of advert expense, the more spent on
advertising, the higher the total sales of the product. This
is like saying if a TV advertisement for a product like
Kentucky Fried Chicken ran more often during times when
people are known to tune in to television, one could expect
that more people would eat at KFC.
 The number of sweet nothings that a guy does each day for
a girl he is courting would be a predictive indicator of how
long the courtship lasts before they go steady. More sweet
nothings per day, less time until steady. (Ok, that’s a
stretch, but there are other variables to consider, like if the
girl already likes the guy in the first place, or if the guy has
pleasant features, or if the guy can really make a girl laugh,
or if the guy is famous because of being an athlete or an
entertainment celebrity. But these factors are also
predictive independent variables to the dependent variable
called “courtship length.”)

How does one determine which variable is X and which variable


is Y? Practically speaking, variable X must be the variable that can
be easily measured or else controlled by an experimenter, and
variable Y must be the variable that is thought to be associated with
X, but is of predictive interest. That is, we would want to predict Y
because it is desirable to know, and we have X to predict it because
it is hypothesized to be associated with Y.

Let’s use a fairly simple example: Height information can be


used to predict a person’s weight. We could collect a dozen males’
height and weight data:

Person Height Weight Person Height Weight


(inches) (lbs) (inches) (lbs)
1 60 105 7 65 127
2 61 110 8 66 134
3 63 115 9 67 145
4 64 120 10 67 138
5 64 118 11 68 150
6 65 124 12 72 136

We could reformat the table above so that we will have 12


pairs of x and y data points.

Let Xi be the height of the ith person, for i=1,2,..12


Yi be the weight of the ith person.

Xi 60 61 63 64 64 65 65 66 67 67 68
Yi 105 110 115 120 118 124 127 134 145 138 150

Now, we choose height to be our random variable X


because height can be easily measured -- we should be able to
say with good accuracy what a person’s height is just by sight
and some practice, but a person’s weight may be harder to
predict, but may be thought to be associated with varying value
of height X. So we will try to predict weight Y based on height X.
We could make an X-Y graph of these paired sets of data,
called a scatter plot, with height on the x-axis, and weight on
the y-axis, like on the next page:

Scatter plot of Height and Weight

155
150
145
140
Weight (lbs)

135
130
125
120
115
110
105
100
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
Height (Inches)

From this scatter plot, we would see that as a person’s


height increases, so does the weight, that these two variables
are directly proportional.

Now, we choose height to be our random variable X


because height can be easily measured, we should be able to
say with good accuracy what a person’s height is just by sight
and some practice, but a person’s weight may be harder to
predict. So we will try to predict weight Y based on height X.

We would know want to create an equation for weight Y as a


function of X:

Y = f (X)

And this equation can be said to be a line, based on the scatter


plot shown.

The equation for a line is of the form:

Y= a+bX

Where: a = y-intercept (the value of y when x is zero)


b = slope of the line.

A linear regression model can be used to predict the values


of a and b. The name “linear regression” means that the
collected data pairs are said to follow or “regress back” into a
predicted value corresponding to the equation for a line. The
linear regression line is the line such that the total squared
difference (or error) between the actual data and the prediction
equation is minimized. This linear regression line is thus also
referred to as “least squares line.”

The equations for b and a in the least squares line are as


follows:

Y=a+bX
n
 n  n 
n xi yi    xi   y i 
b  i 1  i 1  i 1 
2
n
 n 
n xi    xi 
2

i 1  i 1 
Where
n n

 yi b xi
a i 1
 i 1

n n

Let’s use the height-weight data, reproduced here:

Xi 60 61 63 64 64 65 65 66 67 67 68
Yi 105 110 115 120 118 124 127 134 145 138 150

Some straightforward calculations may be performed via common


scientific calculators using the following keystroke guide:

Casio Calculators: Sharp Calculators:


Mode LR or Mode REG>LIN Mode StatXY

This turns your calculator in their linear regression mode.

Press ScL to clear statistical memory.


Indicate that there should be no entries by pressing the keystrokes for
n and
that the screen should show zero 0.

This varies with different calculators, but your calculator should have a
“quick reference” card or info sheet stuck against its panel to show you
the keys.

To enter data, press Xi <comma> , Yi data DT , then do so again for


each x-y data pair.
Some calculators use XdYd key instead of the comma (,) key.

Then press the keys for A and B to get the values for the least squares
line.

Some calculators are fairly straightforward with this: Shift 7 and


Shift 8 . Others may use longer winded keys like Shift S-var then 
until you see A and B (and r).

The final answers should be:

a
= -111.2830
b
= 3.6540

So the least squares line should be Weight y=a+b X

Y=-111.2830 + 3.6540 X
Where Y=weight, and X=height.
Linear regression Line plotted against scatter plot points
Y= -111.2830 +
155 3.6540X
150
145
140
Weight (lbs)

135
130
125
120
115
110
105
100
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
Height (Inches)

We could now use this formula for Y to predict the weight of a


person who is 70 inches tall:

Y= f (70 inches)
Y= -111.2830 + 3.6540 (70)
= 144.497 lbs Using calculators, simply type in 70
then ŷ

Or equivalently, we could also determine X from a value of Y: say,


we want to know what should be the expected height for a person
who is 130 lbs, it would be:
130 = -111.2830 + 3.6540 X
keystrokes: 130 x̂
X= 66.0326 inches or about 5 ft 6 inches
Correlation: the strength of association between two
variables X and Y

To know if X can be a good predictor of Y, we make use of the


correlation coefficient r as a measure of the relatedness of the
two variables.

Pearson’s Correlation coefficient


n
 n  n 
n xi y i    xi   y i 
r i 1  i 1  i 1 
 n 2  n 2   n 2  n 2 
n xi    xi    n y i    y i  
 i 1  i 1    i 1  i 1  

Range of values Interpretation


of r
0.7< r < 1 Strong positive
correlation
0<r< Weak positive
correlation
0.69
-0.69< r < Weak negative
correlation
0
-1< r <- Strong negative
correlation
0.7
+1 or -1 Perfect correlation
0 No correlation

For our height-weight example:

r = 0.913150717

indicating that the correlation between height and weight is


positive (directly proportional) and that this correlation is quite
strong (r>0.7). This means that we can reliably predict weight Y
using height X and vice-versa--this predictability works both ways.
Weak Positive correlation Weak negative correlation

30 30

25 25

20 20

15 15

10 10

5 5

0 0
10 11 12 13 14 15 16 17 18
10 11 12 13 14 15 16 17 18

Strong Positive correlation


Strong Negative correlation

30
30

25
25

20
20

15
15
10
10
5
5
0
10 11 12 13 14 15 16 17 18 0
10 11 12 13 14 15 16 17 18

Practice Exercises:

1. An engineer wants to determine how much temperature affects


the average life of a component. He undertakes an experiment
using various temperatures and the resulting lifetime of the
component:

Temp 28 29 30 32 33 35 38 42 46
(oC)
Life 100 98 89 95 92 88 90 88 85
(hrs) 0 0 0 0 1 5 0 0 0

a. Fit a linear regression model to the data


b. Determine the correlation coefficient between Temp X and Life
Y and state an interpretation of this number.
c. What would be the total lifetime of the component in question
if it is exposed at a constant 40oC?
d. At what temperature would a component last to a maximum
of 900 hours?
2. A manufacturer of laundry detergent was interested in testing a
new product prior to market release. One area of concern was
the relationship between the height of the detergent suds in a
washing machine as a function of the amount of detergent added
in the wash cycle. For a standard size washing machine tub filled
to the full level, the manufacturer made random assignments of
amounts of detergent and tested them on the washing machine.
The data appear next:

Height , Y Amount,
X
28.1 27.6 6
32.3 33.2 7
34.8 35.0 8
38.2 39.4 9
43.5 46.8 10

a. Fit a linear regression model to the data with repeated observations.


b. Determine the coefficient of correlation and state an interpretation
of this number.
c. If a standard washing machine filled at full level has an allowable
suds height of only 38 cms from the full water level and the suds
can be up to an extra 10 cms before it “overflows” out of the
washing machine, how much spoonfuls of detergent should be
recommended as the maximum amount of detergent to be used?

3. An equal number of families from eight different cities of various sizes


were asked how much money they spend for food, clothing, and housing
per year. The city size and average family responses are summarized
below.

City 3 5 7 10 15 20 17 12
size 0 0 5 0 0 0 5 0
Food 4 3 4 42 41 45 44 37
0 7 0
Clothi 1 2 2 15 16 12 14 10
ng 0 0 0
Housi 1 2 1 23 26 28 26 24
ng 5 0 9
City size in millions of people, all expenditures in thousands of US
dollars.
a. Fit a simple linear model relating city size and annual expenses
per family.
b. using the correlation coefficient betweeb city size and annual
expenses per family, state whether there is a strong or weak
correlation.
c. What would be the expected annual family expense from a city
of 65 million people?
d. Can city size predict food expenditure better than city size
predict annual family expenditure. Use the correlation coefficient
for your answer.
Other Linear Regression Models: Exponential Regression,
Power Regression.

Sometimes, the data that you have may not fit into a simple
linear model. However, if you transform the original data pair via
some function like finding its natural logarithm or its inverse, you can
transform data that is inherently not linear into linear values to fit into
our simple linear regression model.

Type of Regression Estimation Model Linear Format


Exponential Model Y = a e bX Ln(Y) = ln(a) + bX
Power Regression Y= a X b Ln(Y) = ln(a) + b
Ln(X)
Inverse Regression Y= a + b/X Y= a + b (1/X)

For each of these cases, the correlation coefficient should be applied to


the linearized data to measure the degree of association.

For some Casio™ calculators, these regression functions are


already built in in the Regression modes options.

Practice Problems: (from Hayter, Probability and Statistics for


Engineers and Scientists, Duxbury, 2002. pp.656-
657)

1. Make a plot of the following data set. What


intrinsically linear function would provide a good model for this
data set? Fit a straight line into the transformed variables and
write the fitted model back in terms of the original variables.
What is the predicted value of the dependent variable y when
x=2.0?

X -2.0 -0.4 1.5 2.4 2.7 3.5 4.6 5.3 5.8 6.4 6.8
Y 5.3 8.8 13. 17. 18. 24. 28. 34. 44. 55. 72.
3 9 9 4 3 0 0 1 2

2. A bioengineer measures the growth rate of a substance by counting the number of


cells N present at various times t as shown in the following data table:
Time 1 2 3 4 5 6 7 8
t
Cells 1 3 4 7 11 22 32 54
N 2 1 2 5 9 1 7 6
Fit the model N=AeBt exhibiting exponential growth on the data,
and show how correlated is the transformed N with t.

3. In an experiment to investigate the suitability of using a


silicon tube to model the behavior of a human artery,
the following data set was collected, which relates the
pressure differential P with cross-sectional area X.
P 2 4 7 11 13 21 32 48 64 91
X 0.5 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.9 1.0
4 7 5 9 3 8 5 7 4

Show that a model P=AXb appears to provide a good fit to


the data set.
Multiple Linear Regression

In most research problems where regression analysis is applied,


more than one independent variable is needed in the regression
model. When this model is linear in the coefficients, it is called a
multiple linear regression (MLR) model. For the case of k independent
variables X1, X2 …Xk, the estimated response is obtained from the
sample regression equation:

yˆ  b0  b1 X 1    bk X k

This model is simply an extension of the simple linear regression


(LR) formula y=a+bx where the y-intercept term “b o” in the MLR model
is the corresponding y-intercept term “a” in the simple LR model.

Furthermore, the multiple LR model has multiple b ixi entries


instead of just one term as in the simple LR model.

The computational effort to find the values of the bi coefficients


is quite tiring, requiring the use of matrix algebra and solving for
inverse of a matrix. The reader is referred to common statistical texts
like : Introduction to Probability and Statistics: Principles and
Applications for Engineering and the Computing Sciences (Fourth
edition) by J.Susan Milton and Jesse C. Arnold. McGraw Hill.
www.mhhe.com or Probability and Statistics for Engineers and
Scientists by Walpole, Myers, Myers and Ye.

What this lecture note would like to show instead is how to use
Microsoft™ Excel worksheets to compute for these coefficients as well
as determining which subset of available variables Xi’s should be
included in a multiple linear regression model.
Let’s say that a country’s GNP was thought to be predicted by,
say, three indicator variables: total consumption X 1 in the capital city ,
Total investments made by the citizens X 2, and finally the city’s
government expenditure X3.

The following table shows the values of each variable Xi and the
true GNP during that year.
X1 X2 X3 Y (GNP)
50 10 100 330
50 20 150 260
50 30 200 290
50 40 280 306
70 50 240 300
70 70 350 260
80 80 200 200
80 90 750 520

Procedure to get linear regression coefficients b 0, b1. b2


….bk using Excel:

2. Load up Excel from your computer and type in the data on the
worksheet.
3. Then go to the menu item Tools-Data Analysis, and choose
Regression from the Data Analysis dialogue box.

4. Input the range of values for the Y column by highlighting cell


numbers D1..D9, and input the range of values for X column
by highlighting cell numbers A1..C9. Check on the box called
“Labels”. Toggle Output range and just press a cell where
rows below it in the worksheet is available for print outs. At
this point your filled-up dialogue box should look like this (on
the next page)

5. Press OK and you should have the report after the dialogue
box on the next page:

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.974163
R Square 0.948993
Adjusted R Square 0.910738
Standard Error 28.13851
Observations 8

ANOVA
df SS MS F Significance F
Regression 3 58924.4 19641.47 24.80685 0.004794
Residual 4 3167.103 791.7758
Total 7 62091.5

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 62.341 94.4772 0.659852 0.545407 -199.97 324.6523
X1 4.454482 2.239976 1.98863 0.117635 -1.7647 10.67366
X2 -4.63867 1.271412 -3.64844 0.021801 -8.16868 -1.10866
X3 0.682428 0.082945 8.227451 0.00119 0.452135 0.912721
The coefficients are shown on this part of the report.
Therefore, for this model that uses X1, X2 and X3 as input
variables, our prediction model for Y (GNP) can be written as:

Y= 62.341 + 4.454482X1 –4.63867 X2 +0.682428 X3 Quite easily


done.
How to Interpret the figures on the Regression Report:

 What proportion of the total variation in Y can be


explained by the model ?
This is the value of R-squared. R 2 is called the coefficient of
determination.

From the Example:


Regression Statistics
Multiple R 0.974163
R Square 0.948993
Adjusted R Square 0.910738
Standard Error 28.13851
Observations 8

94.8993% of the variation in y can be explained by the model using


X1, X2 and X3.

 To test if the model containing all the variables included


is adequate to explain the variation in Y: ensure that the
“ANOVA Significance F” value is below your smallest level of
significance . (Usually referred to as alpha ). For most
statistical tests, =0.05 or less. The level of significance refers
to the probability that a model is not adequate, therefore, if the
significance F value falls below the set value , this means that
there is a far chance that the model is not adequate. Confidence
level is the complement of alpha= (1-).

From the example: Is the model adequate at 5% level of significance?


ANOVA
df SS MS F Significance F
Regression 3 58924.4 19641.47 24.80685 0.004794
Residual 4 3167.103 791.7758
Total 7 62091.5

Here, we can see that Significance-F is 0.004794, which is


well below the value of a=0.05, therefore, the model with three
variables X1, X2 and X3 are adequate to explain the variation in
GNP Y, that is, one can use the model with the coefficients
shown to predict the changing/varying values of Y.

 To determine which set of variables Xi should be included


in a model that is adequate but uses the least number of
variables. (Model selection)

Look at the p-values in the bottom (or 3rd) table of the report,
all p-values must be below significance level .

If there are any variables whose p-values are above , that


means the variable is not significant enough to explain the
variation of Y. The variable X1, X2 …Xk. whose p-values is highest
but over  must be eliminated first. Redo the regression using
EXCEL using the reduced set of variables Xi, and check the new
p-values. Iteratively eliminate the variables whose p-values are
above a. When all p-values of the remaining Xi independent
variables are below a, then stop. The resulting multiple linear
regression model should be efficient and complete.

From the Example: Which variables X1, X2 and X3 must be


retained in an efficient model to predict Y (i.e.
efficient=least number of significant variables) at a
5% level of significance.

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 62.341 94.4772 0.659852 0.545407 -199.97 324.6523
X1 4.454482 2.239976 1.98863 0.117635 -1.7647 10.67366
X2 -4.63867 1.271412 -3.64844 0.021801 -8.16868 -1.10866
X3 0.682428 0.082945 8.227451 0.00119 0.452135 0.912721

Since the p-value of variable X1 (total consumption in the


capital city, encircled) is 0.117635 >0.05, then X1 must be
eliminated from the model, redo regression using X2 and X3 as
input X-values.

A resulting table should look like the report below:

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.947926
R Square 0.898564
Adjusted R Square 0.85799
Standard Error 35.49168
Observations 8

ANOVA
df SS MS F Significance F
Regression 2 55793.2 27896.6 22.14614 0.003277
Residual 5 6298.298 1259.66
Total 7 62091.5

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 245.7008 25.9829 9.456247 0.000223 178.9097 312.4918
X2 -2.35443 0.687503 -3.42462 0.018744 -4.12171 -0.58716
X3 0.624944 0.098062 6.37296 0.001407 0.372869 0.87702

Since all p-values are below 0.05, then stop. The efficient model
to predict Y contains X2 and X3, and the model is: Y=245.7008-
2.35443X2+0.624944X3.

Practice Exercise:

Find the model that uses the least number of Xi variables to


predict Y.
Use a=0.01

X1 X2 X3 X4 X5 X6 Y
7 9 10 12 13 22 995
6 10 18 18 19 25 1325
8 9 7 10 17 26 1452
6 7 9 25 37 27 1735
7 8 10 35 22 28 2188
8 9 11 18 15 29 1435
9 7 6 51 18 30 2980
10 6 8 16 41 32 1470
9 2 17 28 36 35 1240

Potrebbero piacerti anche