Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Simple Regression
Consider the following example comparing the returns of Consolidated Moose Pasture stock (CMP) and the TSX 300 Index The next slide shows 25 monthly returns
Specifically between dependent (Y) and independent variables (x1, x2, etc.)
For example:
The effect of years of education on income The effect of engine size one gas mileage The effect of house size on price
# #
CMP
4
0 0 -2 2 4
-6
-4
-2
TSX
6
-4
3 -1 2 4 5 -3 -5 1 2
#
4 -2 -2 2 3 -5 -2 2 -1
-4 -1 0 1 0 -3 -3 1
-6
CMP
0 0 -2 2 4
-6
-4
-2
TSE
6
Most of the time when the TSX is up, CMP is up Likewise, when the TSX is down, CMP is down most of the time Sometimes, they move in opposite directions
-4
-6
0 0 -2 2 4
-6
-4
-2
TSX
6
0.00 0.00
-4
-6
When points in the upper right and lower left quadrants dominate, then the sums of the products of the deviations will be positive When points in the lower right and upper left quadrants dominate, then the sums of the products of the deviations will be negative
#
Both have means of zero and standard deviations just under 3 However, each data point does not have simply one deviation from the mean, it deviates from both means Consider Points A, B, C and D on the next graph
Statistics for Management Decisions Covariance Statistics for Management Decisions An Important Observation
The sums of the products of the deviations will give us the appropriate sign of the slope of our relationship
In the same units as Variance (if both variables are in the same unit), i.e. units squared Very important element of measuring portfolio risk in finance
Very useful in Finance for measuring portfolio risk Unfortunately, it is hard to interpret for two reasons:
Statistics for Management Decisions The Correlation Coefficient Statistics for Management A Decisions Statistic More Useful
The correlation coefficient measures the strength of the linear relationship between two variables.
We can simultaneously adjust for both of these shortcomings by dividing the covariance by the two relevant standard deviations This operation
Removes the impact of size & scale Eliminates the units
population
sample
Both variables move together, either in the same direction or in opposite directions E.g. when one goes up so does the other
Statistics for Management Decisions Correlation coefficient Statistics for Management Decisions Example
X 20 18 24 20 17 21 10 10 18 12 16 Y
X
14 18
Y 400 324 576 400 484 196 324 2704 1654 2090 100 180 100 140 441 462 289 340 324 432 144 216 256 320
X2
Y2
XY
22
20
16
18
12
24
18
20
17
22
21
14
10
18
10
136
104
Create a scatter plot, what type of relationship exists? Compute the correlation coefficient Test the significance of the correlation coefficient at the 0.05 level
#
15
10
0 0 5 10 15 20 25 30
Statistics for Management Decisions Correlation vs. Regression Statistics for Management Decisions Significance
Correlation indicates a relation between two variables Regression indicates causality between an independent and a dependent variable. Test statistic (n-2 degrees of freedom):
estimated from the sample
Changes in the independent variables are those causing the change in the dependent variable Well start with simple regression one independent variable and then look at multiple variables.
Simple regression
3.563 > 2.571 (t critical value, 5df) Reject the null hypothesis and conclude that the correlation coefficient is significant (significantly different than 0)
#
Consumption
2000 1500 1000 500 0 0 2000 4000 6000 8000 10000
Premise: there is a true relationship between income and consumption. This relationship can be described in a linear form:
Or more generally:
12000
Income
Statistics for ManagementRegression Model Simple Linear Decisions Note that both and are population
rise run =slope (=rise/run)
parameters which are usually unknown and hence estimated from the data.
Is there indeed a significant effect? What is the magnitude of this effect? (We limit our discussion to linear effects)
=y-intercept
Well create and test a regression model of the relationship between consumption and income
# #
Statistics for Management A good regression line will be the one that minimizes the Decisions errors (SSE). total of the squared
4500
4000 3500
Consumption
2000 1500 1000 500 0 0 2000
3000 2500
2000 1500
With simple linear regression we try to capture the true relationship between the two variable with a single line. The estimated regression model is:
4000 6000
8000
10000
12000
Income
5000
10000
Population (thousands)
More formally
No line can hit all the points in the scatter plot, or even most of the points The amount we miss by is called error or residual.
It is the difference between the predicted value (from the regression line) and the true value
A statistical technique for determining the best fit line through a series of data The regression line is the unique line that minimizes the total of the squared deviations (or errors).
XY -
b = 1
X Y n
The statistical term is Sum of Squared Errors or SSE This line is called the least squares line
#
Obs.
# of sites (X)
1 2 3 4 5 6 7 8 9 10 Total
13 10 13 8 5 7 4 8 7 3 SX = 78
The probability distribution of e is normal E(e) = 0 se is constant and independent of x, the independent variable The value of e associated with any particular value of y is independent of the value of e associated with any other value of y
# #
The Harris Corporation has recently done a study of homes that have sold in the Detroit area within the past 18 months. Data were recorded for the asking price (x) and the number of weeks (y) each home was on the market before it sold.
SUMMARY OUTPUT
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17
ANOVA
df
1 15 16
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
Tools
Data Analysis
Choose Regression from the dialogue box menu.
= -16.2251 + 0.00053x
The hypotheses:
HO: b1 = 0 HA: b1 0
estimated
We follow a t-test:
The standard error of the estimate
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
OR #
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
SUMMARY OUTPUT
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17
ANOVA
df
1 15 16
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
The standard error of the estimate (Se or SEE) measures how the data varies around the regression line
0.000528163 0 t= 0.000136818
[ /2, (n-2) df ]
We would like Se to be small the smaller it is the larger the t-statistic is and the more likely we are to reject the null hypothesis that the slope is zero #
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16
S=SEE SSE
SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
If we wish to test for positive or negative linear relationships we conduct one-tail tests, i.e. our research hypothesis become: H1: 1< 0 (testing for a negative slope) or H1: 1 >0 (testing for a positive slope)
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
b1
Sb1
Statistics for Management Decisions The regression output (Excel) Statistics for Management IsDecisions the test different from 1?
SUMMARY OUTPUT
The hypotheses:
HO: b1 = 1 HA: b1 1
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17
df
1 15 16
t=
0.000528163 1 0.000136818
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
SUMMARY OUTPUT
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17
ANOVA
df
1 15 16
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
Significant model
We need to see that at least one of our independent variables has a significant affect Note: we only have b1 so this test should give us the same results as the previous t-test (and well see that it does)
Statistics for Management Coefficient of Determination Decisions Statistics for Management Decisions Symmetry in Testing
SUMMARY OUTPUT Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
As we did with analysis of variance, we can partition the variation in y into two parts:
SSE Sum of Squares Error measures the amount of variation in y that remains unexplained (i.e. due to error) SSR Sum of Squares Regression measures the amount of variation in y explained by variation in the independent variable x.
Intercept Asking Price
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
SUMMARY OUTPUT
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17
2133.111647 4280.235294
df
1 15 16
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
We would like to see high values (1 is the highest) Note: for simple regression, R-squared is the square of the correlation coefficient (r): R2 = (r)2 .
Suppose you wanted to know how many weeks it would take to sell a house priced at $100,000 The regression equation was: = -16.2251 + 0.00053x
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17
Substitute x=100,000 y = -16.2251 + 0.00053*(100,000) = 36.7749 Important side note: pay attention to the units of measurement in the data
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086 b 0 and b 1 S b0 and S b1
Statistics for Management Decisions Prediction Interval Statistics for Management Decisions Scatter plot
(textbook)
60 70
y ta / 2,n- 2s 1+ e
40
1 n
30
OR derived
20 10
y ta / 2,n- 2s 1+ e
0 $50,000 $60,000 $70,000
1 n
#
60
Prediction interval
40 Xp =100000, y = 36.59126539 20 Number of weeks 0 50000 -20
62500
75000
87500
100000
112500
125000
-40 Price
y ta / 2,n- 2s e
1 n
OR derived
Suppose I own several properties in Detroit and price them all at $100,000. What is the expected number of weeks for selling these homes? Instead of predicting an individual value, I am asking for an expected value (i.e. the mean number of week)
We can use a confidence interval for the estimation of the mean. The distinction between confidence interval and prediction interval is similar to the difference between the CI of the mean vs. the CI of an individual value #
y ta / 2,n- 2s e
#
1 n
60
20
Note: Point, Prediction and Confidence intervals in Excel are obtained by Add-Ins > Data Analysis Plus > Prediction Interval #
-40
9,784 345.50*80=-17,856
y ta / 2,n- 2s e
1 n
a=0.1 n-2=18
Both intervals are curved, becoming narrower around the average value of x (xbar). The closer Xg is to X-bar the better our estimate and thus the narrower the interval.
Recall the deviations between the actual data points and the regression line were called residuals. Excel calculates residuals as part of its regression analysis:
76500
23
102000 9
48
53000
84200
26
73000
20
125000
40
109000
51
60000 18 15.46473452 2.535265477 0.230053966 We can use these residuals to determine whether the error 87000 25 29.72514286 -4.72514286 0.407099584 variable is nonnormal, whether the error variance is constant, 94000 62 33.42228576 28.57771424 2.471464614 and whether the 76000 errors are 33 23.91534687 9.084653129 0.788907244 independent 31.30963267 20.30963267 1.751163484
90000
11
Statistics for Management Nonnormality Decisions We can take the residuals and put them into a histogram
There are three conditions that are required in order to perform a regression analysis. These are: The error variable must be normally distributed, The error variable must have a constant variance, & The errors must be independent of each other. How can we diagnose violations of these conditions? Residual Analysis, that is, examine the differences between the actual data points and those predicted by the linear equation
# #
were looking for a bell shaped histogram () with the mean close to zero ( ).
Statistics for Management of the Error Variable (for Nonindependence Decisions time series data not in this course) Statistics for Management Heteroscedasticity Decisions
When the requirement of a constant variance is violated, we have a condition of heteroscedasticity.
If we were to observe the number of weeks houses stay on the market for many weeks for, say, a year, that would constitute a time series.
When the data are time series, the errors often are correlated. Error terms that are correlated over time are said to be autocorrelated or serially correlated.
We can often detect autocorrelation by graphing the residuals against the time periods. If a pattern emerges, it is likely that the independence requirement is violated.
Patterns in the appearance of the residuals over time indicates that autocorrelation exists:
If the variance of the error variable ( ) is not constant, then we have heteroscedasticity. Heres the plot of the residual against the predicted value of y:
there doesnt appear to be a change in the spread of the plotted points, therefore no heteroscedasticity
Statistics for Management Decisions Outliers our example Statistics for Management Outliers Decisions
An outlier is an observation that is unusually small or unusually large. E.g. in our houses example the prices range from $53,000 to $133,000. Suppose we have a value of $1,000,000 this point is an outlier.
1.
2.
3.
Possible reasons for the existence of outliers include: There was an error in recording the value The point should not have been included in the sample Perhaps the observation is indeed valid. Outliers can be easily identified from a scatter plot. If the absolute value of the standard residual is > 2, we suspect the point may be an outlier and investigate further. They need to be dealt with since they can easily influence the least squares line #
4.
5.
6.
7.
Develop a model that has a theoretical basis. Gather data for the two variables in the model. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. Determine the regression equation. Calculate the residuals and check the required conditions (normality, homoscedasticity, independence) Assess the models fit (t-test for the slope, the overall F-ratio, R2 ) If the model fits the data, use the regression equation to predict a particular value (confidence/prediction intervals) of the dependent variable. #