700579308

First Lecture 11
Simple Regression
Covariance and Correlation (to see whether a relationship EXISTS)
Statistics for Management Decisions Example
Statistics for Management Decisions Regression

Regression analysis enables us to estimate the strength and direction of relations between variables
Consider the following example comparing the returns of Consolidated Moose Pasture stock (CMP) and the TSX 300 Index The next slide shows 25 monthly returns
Specifically between dependent (Y) and independent variables (x1, x2, etc.)
For example:

The effect of years of education on income The effect of engine size one gas mileage The effect of house size on price
# #
Statistics for Management Graph Of Data Decisions

6
Statistics for Management Example Data Decisions TSX TSX CMP

x y x
CMP
4
0 0 -2 2 4
-6
-4
-2
TSX
6
-4
3 -1 2 4 5 -3 -5 1 2
#
4 -2 -2 2 3 -5 -2 2 -1
-4 -1 0 1 0 -3 -3 1
CMP TSX CMP y x y -3 2 4 0 -1 1 -2 4 3 0 -2 -1 0 1 2 1 -3 -4 -2 2 1 3 -2 -2

#
-6
Statistics for Management Graph Of Data Decisions

6 4
CMP
From the data, it appears that a positive relationship may exist

0 0 -2 2 4
-6
-4
-2
TSE
6
Most of the time when the TSX is up, CMP is up Likewise, when the TSX is down, CMP is down most of the time Sometimes, they move in opposite directions
-4
Lets graph this data

# #
-6
Statistics for Management Graph of Data Decisions

CMP

4 6
Statistics for Management Example Summary Statistics Decisions

The data do appear to be positively related Lets derive some summary statistics about these data:
0 0 -2 2 4
-6
-4
-2
TSX
6
Mean TSX CMP

#
s 7.25 6.25 2.69 2.50

#
0.00 0.00
-4
-6
Statistics for Management Decisions Implications

Statistics for Management Decisions Observations
When points in the upper right and lower left quadrants dominate, then the sums of the products of the deviations will be positive When points in the lower right and upper left quadrants dominate, then the sums of the products of the deviations will be negative
#
Both have means of zero and standard deviations just under 3 However, each data point does not have simply one deviation from the mean, it deviates from both means Consider Points A, B, C and D on the next graph
Statistics for Management Decisions Covariance Statistics for Management Decisions An Important Observation
The sums of the products of the deviations will give us the appropriate sign of the slope of our relationship
In the same units as Variance (if both variables are in the same unit), i.e. units squared Very important element of measuring portfolio risk in finance
Statistics for Management Decisions Using Covariance
Statistics for Management Decisions Covariance

population
Very useful in Finance for measuring portfolio risk Unfortunately, it is hard to interpret for two reasons:
What does the magnitude/size imply? The units are confusing

sample
Statistics for Management Decisions The Correlation Coefficient Statistics for Management A Decisions Statistic More Useful
The correlation coefficient measures the strength of the linear relationship between two variables.
Coefficient = -1 perfect negative Coefficient = 0 no relation Coefficient = 1 perfect positive
We can simultaneously adjust for both of these shortcomings by dividing the covariance by the two relevant standard deviations This operation
Removes the impact of size & scale Eliminates the units
Statistics for Management Decisions Calculating Correlation
Statistics for Management Decisions Correlation

Correlation indicates a positive/negative relation between two variables

population
sample
Both variables move together, either in the same direction or in opposite directions E.g. when one goes up so does the other
Statistics for Management Decisions Correlation coefficient Statistics for Management Decisions Example
X 20 18 24 20 17 21 10 10 18 12 16 Y
X
14 18
Y 400 324 576 400 484 196 324 2704 1654 2090 100 180 100 140 441 462 289 340 324 432 144 216 256 320
X2
Y2
XY
22
20
16
18
12
24
18
20
17
22
21
14
10
18
10
136
104
Create a scatter plot, what type of relationship exists? Compute the correlation coefficient Test the significance of the correlation coefficient at the 0.05 level
#
Statistics for Management Decisions Correlation coefficient
Statistics for Management Decisions Scatter plot

25 20
15
10
0 0 5 10 15 20 25 30
In Excel: use the CORREL function, =CORREL(A2:A8,B2:B8) # #
Statistics for Management Decisions Correlation vs. Regression Statistics for Management Decisions Significance
Hypothesis test on the true population parameter (rho, r)

H0: r=0 HA: r0
Correlation indicates a relation between two variables Regression indicates causality between an independent and a dependent variable. Test statistic (n-2 degrees of freedom):
estimated from the sample
Changes in the independent variables are those causing the change in the dependent variable Well start with simple regression one independent variable and then look at multiple variables.
Statistics for Management Decisions Significance test
Simple regression

3.563 > 2.571 (t critical value, 5df) Reject the null hypothesis and conclude that the correlation coefficient is significant (significantly different than 0)
#
Statistics for Management Decisions Modeling the linear relationships

4000 3500 3000 2500
Statistics for Management A Decisions scatter plot
Consumption
2000 1500 1000 500 0 0 2000 4000 6000 8000 10000
Premise: there is a true relationship between income and consumption. This relationship can be described in a linear form:
Or more generally:
12000
Income
The income/consumption example

Dependent variable (y) consumption Independent variable (x) income
Statistics for ManagementRegression Model Simple Linear Decisions Note that both and are population

rise run =slope (=rise/run)
Statistics for Management Decisions The story

We think that income affects consumption
The more you make the more you buy
parameters which are usually unknown and hence estimated from the data.
We are looking to study this relationship in more depth

Is there indeed a significant effect? What is the magnitude of this effect? (We limit our discussion to linear effects)
=y-intercept
Well create and test a regression model of the relationship between consumption and income
# #
Statistics for Management A good regression line will be the one that minimizes the Decisions errors (SSE). total of the squared
4000 3500 3000 2500
Statistics for Management Decisions Modeling the linear relationships
4500
4000 3500
Consumption
2000 1500 1000 500 0 0 2000
error (residual, deviation)
3000 2500
2000 1500
1000 Traffic Fatalities 500 15000 20000 25000 30000 35000
With simple linear regression we try to capture the true relationship between the two variable with a single line. The estimated regression model is:
4000 6000
8000
10000
12000
Income
5000
10000
Population (thousands)
Statistics for Management Decisions Understanding the error term

More formally
No line can hit all the points in the scatter plot, or even most of the points The amount we miss by is called error or residual.
It is the difference between the predicted value (from the regression line) and the true value
Statistics for Management Decisions Finding the line equation

estimated
Statistics for Management Decisions Regression Analysis

The equation: Where:
A statistical technique for determining the best fit line through a series of data The regression line is the unique line that minimizes the total of the squared deviations (or errors).

XY -
b = 1
X Y n
The statistical term is Sum of Squared Errors or SSE This line is called the least squares line
#

calculated
Capacity (Y) XY X2
Statistics for Management Decisions Required Conditions - e

Data on refining capacity of an oil companys sites

81.82 81.82 58.18 43.64 40 36.36 34.55 32.73 29.09 25.45 SY = 463.64 1063.66 818.2 756.34 349.12 200 254.52 138.2 261.84 203.63 76.35 SXY = 4121.86 169 100 169 64 25 49 16 64 49 9 SX2 = 714
Obs.
# of sites (X)
1 2 3 4 5 6 7 8 9 10 Total
13 10 13 8 5 7 4 8 7 3 SX = 78
The probability distribution of e is normal E(e) = 0 se is constant and independent of x, the independent variable The value of e associated with any particular value of y is independent of the value of e associated with any other value of y
# #

Weeks on the Market 23 48 9 26 20 40 51 18 25 62 33 11 15 26 27 56 12 Asking Price $76,500 $102,000 $53,000 $84,200 $73,000 $125,000 $109,000 $60,000 $87,000 $94,000 $76,000 $90,000 $61,000 $86,000 $70,000 $133,000 $93,000
Statistics for Management Decisions The least squares line

Meaning: capacity will increase by 4.787units for every site added
The Harris Corporation has recently done a study of homes that have sold in the Detroit area within the past 18 months. Data were recorded for the asking price (x) and the number of weeks (y) each home was on the market before it sold.
Statistics for Management Decisions The model to be estimated
Understanding and assessing the regression model
SUMMARY OUTPUT
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17
ANOVA
df
Statistics for Management Decisions The estimated model

MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
Regression Residual Total
1 15 16
SS 2133.111647 2147.123648 4280.235294
Statistics for Management Decisions Excel The Regression Tool
Intercept Asking Price
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086
Tools
Data Analysis
Choose Regression from the dialogue box menu.
= -16.2251 + 0.00053x
Intercept (b0): -16.2251 Slope (b1): 0.00053
What is the meaning of -16.2251?
Is the effect of asking price on number of weeks significant?

#
Statistics for Management Decisions Test1: Testing the Slope

true
Statistics for Management Decisions The regression output (Excel)

SUMMARY OUTPUT
The hypotheses:
HO: b1 = 0 HA: b1 0
estimated
We follow a t-test:
The standard error of the estimate
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
OR #
SUMMARY OUTPUT
ANOVA
df
Statistics for Management Decisions Test1: Testing the Slope

1 15 16
SS 2133.111647 2147.123648 4280.235294
Statistics for Management Decisions The standard error of the estimate
The standard error of the estimate (Se or SEE) measures how the data varies around the regression line
0.000528163 0 t= 0.000136818
Similar to the concept of standard deviation
[ /2, (n-2) df ]
k = 1 for simple regression
=> Can conclude that the slope is different from zero #
We would like Se to be small the smaller it is the larger the t-statistic is and the more likely we are to reject the null hypothesis that the slope is zero #
Statistics for Management Testing the Slope Decisions
Statistics for Management Decisions What do we have in the tables?

SUMMARY OUTPUT
Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16
S=SEE SSE
SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
If we wish to test for positive or negative linear relationships we conduct one-tail tests, i.e. our research hypothesis become: H1: 1< 0 (testing for a negative slope) or H1: 1 >0 (testing for a positive slope)
Of course, the null hypothesis remains: H0: 1 = 0.

#
b1
Sb1
Statistics for Management Decisions The regression output (Excel) Statistics for Management IsDecisions the test different from 1?
SUMMARY OUTPUT
The hypotheses:
HO: b1 = 1 HA: b1 1
ANOVA SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
df
1 15 16
t=
0.000528163 1 0.000136818
SUMMARY OUTPUT
Statistics for Management F Decisions Ratio
Statistics for Management Decisions Test 2: model fit

Testing the overall significance of the model

ANOVA
df
1 15 16
SS 2133.111647 2147.123648 4280.235294
H0: b1 = b2 = b3 = = 0 H1: at least one b is different than zero

Mean Squares = SS/df
Significant model
The general form for a simple regression:

Mean Square MSR = SSR/1 MSE = SSE/(n-2) F-Statistic F = MSR/MSE
We need to see that at least one of our independent variables has a significant affect Note: we only have b1 so this test should give us the same results as the previous t-test (and well see that it does)
The test statistic is an F-ratio
Degrees of Sum of Freedom Squares 1 SSR n-2 SSE n-1 SST
Well have an ANOVA table (from Excel) # #
Statistics for Management Coefficient of Determination Decisions Statistics for Management Decisions Symmetry in Testing
SUMMARY OUTPUT Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
As we did with analysis of variance, we can partition the variation in y into two parts:
SST = Variation in y = SSE + SSR
SSE Sum of Squares Error measures the amount of variation in y that remains unexplained (i.e. due to error) SSR Sum of Squares Regression measures the amount of variation in y explained by variation in the independent variable x.
Statistics for Management InDecisionsoutput the Excel
Statistics for Management Decisions Test 3: R2 - Coefficient of Determination

The R2 tells of the proportion of the variability in the dependent variable is explained by the independent variable
SUMMARY OUTPUT
2133.111647 4280.235294
ANOVA MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
df
1 15 16
SS 2133.111647 2147.123648 4280.235294
We would like to see high values (1 is the highest) Note: for simple regression, R-squared is the square of the correlation coefficient (r): R2 = (r)2 .
Statistics for Management Coefficient of Determination Decisions

R2 has a value of .4984. This means 49.84% of the variation in the weeks on market (y) is explained by the variation in the asking price (x). The remaining 50.16% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general the higher the value of R2, the better the model fits the data. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y. #
Confidence and prediction intervals
Statistics for Management Decisions Prediction
Statistics for Management Decisions Summary of simple regression output

SUMMARY OUTPUT Correlation coefficinet between x and y Coefficient of determination S N ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765
Suppose you wanted to know how many weeks it would take to sell a house priced at $100,000 The regression equation was: = -16.2251 + 0.00053x
Learn the relationships between the three tables components
Substitute x=100,000 y = -16.2251 + 0.00053*(100,000) = 36.7749 Important side note: pay attention to the units of measurement in the data
y = 36.7749 is a point estimate of the number of weeks

what is the true price?
Point estimates are subject to errors
Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086 b 0 and b 1 S b0 and S b1
Statistics for Management Decisions Prediction Interval Statistics for Management Decisions Scatter plot
(textbook)
60 70
y ta / 2,n- 2s 1+ e
40
1 n
30
Need to construct a prediction interval around this estimate 50
OR derived
20 10
y ta / 2,n- 2s 1+ e
0 $50,000 $60,000 $70,000
1 n
#
$80,000 $90,000 $100,000 $110,000 $120,000 $130,000 $140,000
Statistics for Management InDecisions our example
Statistics for Management Decisions Prediction interval

100 80
60
Prediction interval
40 Xp =100000, y = 36.59126539 20 Number of weeks 0 50000 -20
62500
75000
87500
100000
112500
125000
-40 Price
Statistics for Management Decisions Confidence Interval
Statistics for Management A Decisions different question

(textbook)
y ta / 2,n- 2s e
1 n
OR derived

Suppose I own several properties in Detroit and price them all at $100,000. What is the expected number of weeks for selling these homes? Instead of predicting an individual value, I am asking for an expected value (i.e. the mean number of week)
We can use a confidence interval for the estimation of the mean. The distinction between confidence interval and prediction interval is similar to the difference between the CI of the mean vs. the CI of an individual value #
y ta / 2,n- 2s e
#
1 n
Statistics for Management InDecisions our example

100 80
Statistics for Management Decisions Confidence Interval
60
Narrower than the prediction interval

40
20
0 50000 -20 62500 75000 87500 100000 112500 125000
Note: Point, Prediction and Confidence intervals in Excel are obtained by Add-Ins > Data Analysis Plus > Prediction Interval #
-40
Statistics for Management Decisions Solution

80 67.20
Statistics for Management Decisions The curve
9,784 345.50*80=-17,856
y ta / 2,n- 2s e
1 n
a=0.1 n-2=18
Both intervals are curved, becoming narrower around the average value of x (xbar). The closer Xg is to X-bar the better our estimate and thus the narrower the interval.
Need to compute using SSE # #

The following summary statistics were obtained from a regression analysis:
Computing the standard error of the estimate
From the t table: t0.05,18=1.734
Provide a 90% CI for the average y, given xg=80

# #
Statistics for Management Residual Analysis Decisions

Fitted 24.17942851 37.64759194 11.76759162 28.2462857 22.33085706 49.79534718 41.34473484 9.655265163 0.862374431 9.795347185 0.946239962 2.330857056 0.203458396 2.246285699 0.193608631 2.767591621 0.259720474 10.35240806 0.906924006 1.179428507 0.102346097 Residual St.resid
Recall the deviations between the actual data points and the regression line were called residuals. Excel calculates residuals as part of its regression analysis:

y ta / 2,n- 2s e 1 n
76500
23
102000 9
48
53000
84200
26
73000
20
125000
40
109000
51
60000 18 15.46473452 2.535265477 0.230053966 We can use these residuals to determine whether the error 87000 25 29.72514286 -4.72514286 0.407099584 variable is nonnormal, whether the error variance is constant, 94000 62 33.42228576 28.57771424 2.471464614 and whether the 76000 errors are 33 23.91534687 9.084653129 0.788907244 independent 31.30963267 20.30963267 1.751163484
90000
11
Statistics for Management Nonnormality Decisions We can take the residuals and put them into a histogram
Statistics for Management Regression Diagnostics Decisions
to visually check for normality
There are three conditions that are required in order to perform a regression analysis. These are: The error variable must be normally distributed, The error variable must have a constant variance, & The errors must be independent of each other. How can we diagnose violations of these conditions? Residual Analysis, that is, examine the differences between the actual data points and those predicted by the linear equation
# #
were looking for a bell shaped histogram () with the mean close to zero ( ).
Statistics for Management of the Error Variable (for Nonindependence Decisions time series data not in this course) Statistics for Management Heteroscedasticity Decisions
When the requirement of a constant variance is violated, we have a condition of heteroscedasticity.
If we were to observe the number of weeks houses stay on the market for many weeks for, say, a year, that would constitute a time series.
When the data are time series, the errors often are correlated. Error terms that are correlated over time are said to be autocorrelated or serially correlated.
We can often detect autocorrelation by graphing the residuals against the time periods. If a pattern emerges, it is likely that the independence requirement is violated.
We can diagnose heteroscedasticity by plotting the residual against the predicted y. # #
Statistics for Management Nonindependence of the Error Decisions Variable
Statistics for Management Heteroscedasticity Decisions
Patterns in the appearance of the residuals over time indicates that autocorrelation exists:
If the variance of the error variable ( ) is not constant, then we have heteroscedasticity. Heres the plot of the residual against the predicted value of y:
Note the runs of positive residuals, replaced by runs of negative residuals
Note the oscillating behavior of the residuals around zero.
there doesnt appear to be a change in the spread of the plotted points, therefore no heteroscedasticity
Statistics for Management Decisions Outliers our example Statistics for Management Outliers Decisions
An outlier is an observation that is unusually small or unusually large. E.g. in our houses example the prices range from $53,000 to $133,000. Suppose we have a value of $1,000,000 this point is an outlier.
Olga Kaminer, 2009
1.
Statistics for Management Procedure for Regression Diagnostics Decisions
Statistics for Management Outliers Decisions
2.
3.
Possible reasons for the existence of outliers include: There was an error in recording the value The point should not have been included in the sample Perhaps the observation is indeed valid. Outliers can be easily identified from a scatter plot. If the absolute value of the standard residual is > 2, we suspect the point may be an outlier and investigate further. They need to be dealt with since they can easily influence the least squares line #
4.
5.
6.
7.
Develop a model that has a theoretical basis. Gather data for the two variables in the model. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. Determine the regression equation. Calculate the residuals and check the required conditions (normality, homoscedasticity, independence) Assess the models fit (t-test for the slope, the overall F-ratio, R2 ) If the model fits the data, use the regression equation to predict a particular value (confidence/prediction intervals) of the dependent variable. #

700579308

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

700579308

Caricato da

Copyright:

Formati disponibili

First Lecture 11

Covariance and Correlation (to see whether a relationship EXISTS)

Statistics for Management Decisions Example

Statistics for Management Decisions Regression

Statistics for Management Graph Of Data Decisions

Statistics for Management Example Data Decisions TSX TSX CMP

CMP TSX CMP y x y -3 2 4 0 -1 1 -2 4 3 0 -2 -1 0 1 2 1 -3 -4 -2 2 1 3 -2 -2

Statistics for Management Graph Of Data Decisions

Statistics for Management Decisions Example

From the data, it appears that a positive relationship may exist

Lets graph this data

Statistics for Management Graph of Data Decisions

Statistics for Management Example Summary Statistics Decisions

Mean TSX CMP

s 7.25 6.25 2.69 2.50

Statistics for Management Decisions Implications

Statistics for Management Decisions Observations

Statistics for Management Decisions Using Covariance

Statistics for Management Decisions Covariance

What does the magnitude/size imply? The units are confusing

Coefficient = -1 perfect negative Coefficient = 0 no relation Coefficient = 1 perfect positive

Statistics for Management Decisions Calculating Correlation

Statistics for Management Decisions Correlation

Statistics for Management Decisions Correlation coefficient

Statistics for Management Decisions Scatter plot

In Excel: use the CORREL function, =CORREL(A2:A8,B2:B8) # #

Hypothesis test on the true population parameter (rho, r)

Statistics for Management Decisions Significance test

Statistics for Management Decisions Modeling the linear relationships

Statistics for Management A Decisions scatter plot

The income/consumption example

Statistics for Management Decisions The story

We are looking to study this relationship in more depth

4000 3500 3000 2500

Statistics for Management Decisions Modeling the linear relationships

error (residual, deviation)

1000 Traffic Fatalities 500 15000 20000 25000 30000 35000

Statistics for Management Decisions Understanding the error term

Statistics for Management Decisions Finding the line equation

Statistics for Management Decisions Regression Analysis

The equation: Where:

Statistics for Management Decisions Example

Statistics for Management Decisions Required Conditions - e

Data on refining capacity of an oil companys sites

Statistics for Management Decisions Example

Statistics for Management Decisions The least squares line

Statistics for Management Decisions The model to be estimated

Understanding and assessing the regression model

Statistics for Management Decisions The estimated model

Regression Residual Total

SS 2133.111647 2147.123648 4280.235294

Statistics for Management Decisions Excel The Regression Tool

Intercept Asking Price

Intercept (b0): -16.2251 Slope (b1): 0.00053

What is the meaning of -16.2251?

Is the effect of asking price on number of weeks significant?

Statistics for Management Decisions Test1: Testing the Slope

Statistics for Management Decisions The regression output (Excel)

Intercept Asking Price

Statistics for Management Decisions Test1: Testing the Slope

Regression Residual Total

SS 2133.111647 2147.123648 4280.235294

Statistics for Management Decisions The standard error of the estimate