Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Contents
Basic terms and concepts................................................................................................................. 1
Simple regression ............................................................................................................................ 5
Multiple Regression ....................................................................................................................... 13
Regression terminology .................................................................................................................. 20
Regression formulas ...................................................................................................................... 21
A scatter plot is a graphical representation of the relation between two or more variables. In
the scatter plot of two variables x and y, each point on the plot is an x-y pair.
2.
We use regression and correlation to describe the variation in one or more variables.
A.
Variation=
x-x
i=1
B.
$800,000
x-x
Variance=
C.
3.
$700,000
$600,000
$500,000
Sales
price $400,000
$300,000
i=1
N-1
$200,000
$100,000
$0
0
500
1,000
1,500
2,000
2,500
3,000
Square footage
The covariance captures how one variable is different from its mean as the other variable
is different from its mean.
B.
A positive covariance indicates that the variables tend to move together; a negative
covariance indicates that the variables tend to move in opposite directions.
C.
The covariance is calculated as the ratio of the covariation to the sample size less one:
N
(x i -x)(y i -y)
Covariance =
D.
E.
i=1
N-1
where N
is the sample size
xi
is the ith observation on variable x,
x
is the mean of the variable x observations,
yi
is the ith observation on variable y, and
y
is the mean of the variable y
observations.
Note: Correlation does not
The actual value of the covariance is not meaningful
imply causation. We may say
because it is affected by the scale of the two
that two variables X and Y are
variables. That is why we calculate the correlation
correlated, but that does not
coefficient to make something interpretable from
mean that X causes Y or that Y
the covariance information.
causes X they simply are
related or associated with one
The correlation coefficient, r, is a measure of the
another.
strength of the relationship between or among
variables.
Calculation:
(x i -x) (y i -y)
i=1
r=
N-1
(x i -x)2
i=1,n
(y i -y)2
i=1
N-1
N-1
Observation
1
2
3
4
5
6
7
8
9
10
Sum
Calculations:
12
13
10
9
20
7
4
22
15
23
135
50
54
48
47
70
20
15
40
35
37
416
Squared
Squared
Deviation deviation of Deviation deviation of
of x
x
of y
y
y- y
(y- y )2
x- x
(x- x )2
-1.50
-0.50
-3.50
-4.50
6.50
-6.50
-9.50
8.50
1.50
9.50
0.00
2.25
0.25
12.25
20.25
42.25
42.25
90.25
72.25
2.25
90.25
374.50
8.40
12.40
6.40
5.40
28.40
-21.60
-26.60
-1.60
-6.60
-4.60
0.00
Product of
deviations
(x- x )(y- y )
70.56
153.76
40.96
29.16
806.56
466.56
707.56
2.56
43.56
21.16
2,342.40
-12.60
-6.20
-22.40
-24.30
184.60
140.40
252.70
-13.60
-9.90
-43.70
445.00
x = 135/10 = 13.5
y = 416 / 10 = 41.6
s 2x = 374.5 / 9 = 41.611
s 2y = 2,342.4 / 9 = 260.267
r=
445/9
41.611
i.
ii.
260.267
49.444
=0.475
(6.451)(16.133)
+1 >r > 0
positive relationship
r=0
no relationship
0>r> 1
negative relationship
r= 1
You can determine the degree of correlation by looking at the scatter graphs.
If the relation is upward there is positive correlation.
If the relation downward there is negative correlation.
iii.
X
X
The correlation coefficient is bound by 1 and +1. The closer the coefficient to 1 or +1,
the stronger is the correlation.
iv.
With the exception of the extremes (that is, r = 1.0 or r = -1), we cannot really
talk about the strength of a relationship indicated by the correlation coefficient
without a statistical test of significance.
v.
H 0:
=0
Ha:
=
/ 0
vii.
r N-2
Example 2, continued
In the previous example,
r = 0.475
1 - r2
N = 10
0.475 8
1 0.4752
1.3435
0.88
1.5267
We lose two degrees of freedom because we use the mean of each of the two variables in performing this test.
Problem
Suppose the correlation coefficient is 0.2 and the number of observations is 32. What is the calculated test
statistic? Is this significant correlation using a 5% level of significance?
Solution
Hypotheses:
H0:
=0
Ha:
Calculated t-statistic:
t=
0.2 32-2
1-0.04
0.2 30
0.96
=1.11803
Problem
Suppose the correlation coefficient is 0.80 and the number of observations is 62. What is the calculated test
statistic? Is this significant correlation using a 1% level of significance?
Solution
Hypotheses:
H0:
=0
Ha:
Calculated t-statistic:
0.80 62 2
1 0.64
0.80 60
0.36
6.19677
0.6
10.32796
The critical t-value for a 1% level of significance and 61 observations is 2.665. Therefore, we reject the null
hypothesis and conclude that there is correlation.
F.
G.
An outlier is an extreme value of a variable. The outlier may be quite large or small
(where large and small are defined relative to the rest of the sample).
i.
ii.
The researcher must exercise judgment (and caution) when deciding whether to
include or exclude an observation.
The correlation coefficient does not indicate a causal relationship. Certain data
items may be highly correlated, but not necessarily a result of a causal
relationship.
ii.
Simple regression
1.
Regression is the analysis of the relation between one variable and some other variable(s),
assuming a linear relation. Also referred to as least squares regression and ordinary least
squares (OLS).
A.
The purpose is to explain the variation in a variable (that is, how a variable differs from
it's mean value) using the variation in one or more other variables.
B.
Suppose we want to describe, explain, or predict why a variable differs from its mean.
Let the ith observation on this variable be represented as Yi, and let n indicate the
number of observations.
y i -y
= SSTotal
i=1
The least squares principle is that the regression line is determined by minimizing the
sum of the squares of the vertical distances between the actual Y values and the
predicted values of Y.
Y
A line is fit through the XY points such that the sum of the squared residuals (that is, the
sum of the squared the vertical distance between the observations and the line) is
minimized.
2.
The dependent variable is the variable whose variation is being explained by the other
variable(s). Also referred to as the explained variable, the endogenous variable, or
the predicted variable.
B.
The independent variable is the variable whose variation is used to explain that of the
dependent variable. Also referred to as the explanatory variable, the exogenous
variable, or the predicting variable.
C.
The parameters in a simple regression equation are the slope (b1) and the intercept (b0):
yi = b0 + b1 xi +
where yi
xi
b0
b1
i
b0
b1
0
D.
The slope, b1, is the change in Y for a given oneunit change in X. The slope can be positive,
negative, or zero, calculated as:
(y i
cov(X, Y)
var(X)
b1
y)(x i
x)
i 1
N 1
N
(x i
i 1
x) 2
N 1
Suppose that:
(y
y)(x i
x) = 1,000
i 1
i 1
(x i
x ) 2 = 450
i 1
b1
(yi
N
i 1
N=
y)(xi
N 1
(xi
30
E.
1,000
29 = 34.48276 =2.2222
450
15.51724
29
i 1
xi yi
x)2
i 1
N
N 1
N
i 1
Then
=
b
1
x)
xi
xi2
i 1
N
i 1
yi
N
xi
N
The intercept, b0, is the lines intersection with the Y-axis at X=0. The intercept can be
positive, negative, or zero. The intercept is calculated as:
=y-b x
b
0
1
3.
B.
Example 1, continued:
-$200,000
-$400,000
-1,000
1,000
2,000
3,000
4,000
Square footage
C.
D.
There is a constant variance of the disturbance term; that is, the disturbance or residual
terms are all drawn from a distribution with an identical variance. In other words, the
disturbance terms are homoskedastistic. [A violation of this is referred to as
heteroskedasticity.]
E.
The residuals are independently distributed; that is, the residual or disturbance for one
observation is not correlated with that of another observation. [A violation of this is
referred to as autocorrelation.]
F.
The disturbance term (a.k.a. residual, a.k.a. error term) is normally distributed.
4.
The standard error of the estimate, SEE, (also referred to as the standard error of the
residual or standard error of the regression, and often indicated as se) is the standard
deviation of predicted dependent variable values about the estimated regression line.
5.
yi
SEE
where
b 0
b i x i
i 1
(y i
y i )
i 1
N 2
SSResidual
N 2
s e2 =
N
2
i
i 1
N 2
N 2
Example 2, continued
Consider the following observations on X and Y:
Observation
1
2
3
4
5
6
7
8
9
10
Sum
X
12
13
10
9
20
7
4
22
15
23
135
Y
50
54
48
47
70
20
15
40
35
37
416
x
12
13
10
9
20
7
4
22
15
23
y
50
54
48
47
70
20
15
40
35
37
y- y
39.82
41.01
37.44
36.25
49.32
33.88
30.31
51.70
43.38
52.89
10.18
12.99
10.56
10.75
20.68
-13.88
-15.31
-11.70
-8.38
-15.89
0
103.68
168.85
111.49
115.50
427.51
192.55
234.45
136.89
70.26
252.44
1,813.63
Therefore,
SSResidaul = 1813.63 / 8 = 226.70
SEE = 226.70 = 15.06
A.
6.
The standard error of the estimate helps us gauge the "fit" of the regression line; that is,
how well we have described the variation in the dependent variable.
i.
ii.
The standard error of the estimate is a measure of close the estimated values
(using the estimated regression), the y 's, are to the actual values, the Y's.
iii.
The is (a.k.a. the disturbance terms; a.k.a. the residuals) are the vertical
distance between the observed value of Y and that predicted by the equation,
the y 's.
iv.
The is are in the same terms (unit of measure) as the Ys (e.g., dollars, pounds,
billions)
The coefficient of determination, R2, is the percentage of variation in the dependent variable
(variation of Yi's or the sum of squares total, SST) explained by the independent variable(s).
A.
Explained variation
Total variation
SSRe gression
SS Total
An R2 of 0.49 indicates that the independent variables explain 49% of the variation in
the dependent variable.
B.
Example 2, continued
Continuing the previous regression example, we can calculate the R2:
Observation
1
2
3
4
5
6
7
8
9
10
Total
R2
or
R2
7.
(y- y )2
y
12
13
10
9
20
7
4
22
15
23
50
54
48
47
70
20
15
40
35
37
416
70.56
39.82
153.76
41.01
40.96
37.44
29.16
36.25
806.56
49.32
466.56
33.88
707.56
30.31
2.56
51.70
43.56
43.38
21.16
52.89
2,342.40 416.00
Y- y
10.18
12.99
10.56
10.75
20.68
-13.88
-15.31
-11.70
-8.38
-15.89
0.00
( y - y )2
3.18 103.68
0.35 168.85
17.30 111.49
28.59 115.50
59.65 427.51
59.65 192.55
127.43 234.45
102.01 136.89
3.18
70.26
127.43 252.44
528.77 1,813.63
A confidence interval is the range of regression coefficient values for a given value estimate of
the coefficient and a given level of probability.
A.
is calculated as:
The confidence interval for a regression coefficient b
1
b 1
t c s b
b 1
t c s b
or
1
b1
b 1
t c s b
where tc is the critical t-value for the selected confidence level. If there are 30 degrees
of freedom and a 95% confidence level, tc is 2.042 [taken from a t-table].
B.
The interpretation of the confidence interval is that this is an interval that we believe will
include the true parameter ( s b in the case above) with the specified level of confidence.
1
8.
As the standard error of the estimate (the variability of the data about the regression line)
rises, the confidence widens. In other words, the more variable the data, the less confident you
will be when youre using the regression model to estimate the coefficient.
10
9.
The standard error of the coefficient is the square root of the ratio of the variance of the
regression to the variation in the independent variable:
s b
s 2e
n
x) 2
(x i
i 1
A.
To test the hypothesis of the slope coefficient (that is, to see whether the
estimated slope is equal to a hypothesized value, b , Ho: b = b1, we calculate
a t-distributed statistic:
tb
-b
b
1 1
sb
1
ii.
B.
C.
b0
b1 = 0
To test whether an independent variable explains the variation in the dependent variable,
the hypothesis that is tested is whether the slope is zero:
Ho:
b1= 0
versus the alternative (what you conclude if you reject the null, Ho):
Ha:
b1 =
/ 0
There are hypotheses in economics that refer to the sign of the relation between the
dependent and the independent variables. In this case, the alternative is directional (>
11
or <) and the t-test is one-sided (uses only one tail of the t-distribution). In the case of a
one-sided alternative, there is only one critical t-value.
Example 3: Testing the significance of a slope coefficient
Suppose the estimated slope coefficient is 0.78, the sample size is 26, the standard error of the
coefficient is 0.32, and the level of significance is 5%. Is the slope difference than zero?
The calculated test statistic is: tb =
b 1 b1
s b
0.78 0
0.32
2.4375
2.060:
-2.060
2.060
____________|_____________________|__________
Reject H0
Fail to reject H0
Reject H0
Therefore, we reject the null hypothesis, concluding that the slope is different from zero.
10.
11.
Interpretation of coefficients.
A.
The estimated intercept is interpreted as the value of the dependent variable (the Y) if
the independent variable (the X) takes on a value of zero.
B.
The estimated slope coefficient is interpreted as the change in the dependent variable for
a given one-unit change in the independent variable.
C.
Forecasting is using regression involves making predictions about the dependent variable based
on average relationships observed in the estimated regression.
A.
values of the
dependent variable
based on the estimated
regression coefficients
and a prediction about
the values of the
independent variables.
B.
Example 4
Suppose you estimate a regression model with the following
estimates:
y = 1.50 + 2.5 X1
In addition, you have forecasted value for the independent variable,
X1=20. The forecasted value for y is 51.5:
12
where
12.
b 0
b i xp
y
xp
An analysis of variance table (ANOVA table) table is a summary of the explanation of the
variation in the dependent variable. The basic form of the ANOVA table is as follows:
Source of variation
Degrees of
freedom
1
Regression (explained)
Sum of squares
Mean square
SSRegression
1
Error (unexplained)
N-2
Total
N-1
Example 5
Source of
variation
Regression (explained)
Error (unexplained)
Total
R2 =
Degrees of
freedom
1
28
29
5,050
=0.8938 or 89.38%
5,650
SEE =
600
=
28
21.429 =4.629
Sum of
squares
5050
600
5650
Mean
square
5050
21.429
SSResidual
N-2
13
Multiple Regression
1.
Multiple regression is regression analysis with more than one independent variable.
A.
B.
2.
3.
The slope coefficient is the parameter that reflects the change in the dependent variable for a
one unit change in the independent variable.
A.
4.
Regression model:
Yi = b0 + b1 x1i + b2 x2i + i
where:
b
j
xji
5.
6.
A.
The degrees of freedom for the test of a slope coefficient are N-k-1, where n is the
number of observations in the sample and k is the number of independent variables.
B.
In multiple regression, the independent variables may be correlated with one another,
resulting in less reliable estimates. This problem is referred to as multicollinearity.
b i
or
t c s b
b i
t c s b
bi
b i
t c s b
A.
This is the same interval using in simple regression for the interval of a slope coefficient.
B.
If this interval contains zero, we conclude that the slope is not statistically different from
zero.
14
7.
A.
B.
The independent variables are uncorrelated with the residuals; that is, the independent
variable is not random. In addition, there is no exact linear relation between two or
more independent variables. [Note: this is modified slightly from the assumptions of the
simple regression model.]
C.
The expected value of the disturbance term is zero; that is, E( i)=0
D.
There is a constant variance of the disturbance term; that is, the disturbance or residual
terms are all drawn from a distribution with an identical variance. In other words, the
disturbance terms are homoskedastistic. [A violation of this is referred to as
heteroskedasticity.]
E.
The residuals are independently distributed; that is, the residual or disturbance for one
observation is not correlated with that of another observation. [A violation of this is
referred to as autocorrelation.]
F.
The disturbance term (a.k.a. residual, a.k.a. error term) is normally distributed.
G.
The residual (a.k.a. disturbance term, a.k.a. error term) is what is not explained by the
independent variables.
In a regression with two independent variables, the residual for the ith observation is:
i =Yi ( b 0 +
8.
The standard error of the estimate (SEE) is the standard error of the residual:
N
se = SEE
9.
b 1 x1i + b 2 x2i)
2t
SSE
N k 1
t 1
N k 1
df
A.
number of
number of
N k 1
N (k 1)
The degrees of freedom are the number of independent pieces of information that are
used to estimate the regression parameters. In calculating the regression parameters,
we use the following pieces of information:
The mean of the dependent variable.
The mean of each of the independent variables.
B.
Therefore,
if the regression is a simple regression, we use the two degrees of freedom in
estimating the regression line.
if the regression is a multiple regression with four independent variables, we use five
degrees of freedom in the estimation of the regression line.
Example 6:
information
Using
analysis
of
variance
15
10.
B.
Solution
Given:
SSResidual = 789
N = 65
k=5
SEE =
789
789
=
=13.373
65-5-1 59
b 0
b 1 x 1
b 2 x 2
where
y
is the predicted value of the dependent variable,
b
is the estimated parameter, and
i
x i
is the predicted value of the independent variable
C.
1.50 + 2.5 X1
0.2 X2 + 1.25 X3
X2=120
X3=50
Solution
The forecasted value for Y is 90:
^
Y
11.
1.50 + 50
24 + 62.50 = 90
The F-statistic is a measure of how well a set of independent variables, as a group, explain the
variation in the dependent variable.
A.
MSR
MSE
SSRegression
k
SSResidual
N-k-1
N
i 1
N
i 1
y)2
i
(y
k
(yi
y)2
N k 1
16
B.
C.
12.
The F statistic can be formulated to test all independent variables as a group (the most
common application). For example, if there are four independent variables in the model,
the hypotheses are:
H 0:
b1
Ha:
at least one bi
b2
b3
b4
The F-statistic can be formulated to test subsets of independent variables (to see
whether they have incremental explanatory power). For example if there are four
independent variables in the model, a subset could be examined:
H 0:
b1=b4 =0
Ha:
b1 or b4
The coefficient of determination, R2, is the percentage of variation in the dependent variable
explained by the independent variables.
R2
Explained variation
=
Total variation
N
R2
Total
Unexplained
variation
variation
Total variation
(y
y)2
(y
y)2
i 1
N
0 < R2 < 1
i 1
A.
B.
13.
N 1
(1 R 2 )
N k
i.
ii.
Adding independent variables to the model will increase R2. Adding independent
variables to the model may increase or decrease the adjusted-R2 (Note:
adjusted-R2 can even be negative).
iii.
The adjusted R2 does not have the clean explanation of explanatory power that
the R2 has.
The purpose of the Analysis of Variance (ANOVA) table is to attribute the total variation of the
dependent variable to the regression model (the regression source in column 1) and the residuals
(the error source from column 1).
A.
SSTotal is the total variation of Y about its mean or average value (a.k.a. total sum of
squares) and is computed as:
n
SS Total =
(yi -y)2
i=1
SSResidual (a.k.a. SSE) is the variability that is unexplained by the regression and is
computed as:
17
SSResidual =SSE=
i )2 =
(yi -y
i=1
SSRegression (a.k.a. SSExplained) is the variability that is explained by the regression equation
and is computed as SSTotal SSResidual.
N
SSRegression =
i -y)2
(y
i=1
D.
MSE is the mean square error, or MSE = SSResidual / (N k - 1) where k is the number of
independent variables in the regression.
E.
Source
df
SS
(Degrees of
Freedom)
(Sum of
Squares)
Regression
Error
N-k-1
Total
N-1
R2=
F=
14.
15.
SSRegression
SS Total
=1-
Mean
Square
(SS/df)
SSRegression
MSR
SSResidual
MSE
SSTotal
SSResidual
SS Total
MSR
MSE
Dummy variables are qualitative variables that take on a value of zero or one.
A.
B.
These types of variables are called dummy variables and the data is assigned a value of
"0" or "1". In many cases, you apply the dummy variable concept to quantify the impact
of a qualitative variable. A dummy variable is a dichotomous variable; that is, it takes
on a value of one or zero.
C.
Use one dummy variable less than the number of classes (e.g., if have three classes, use
two dummy variables), otherwise you fall into the dummy variable "trap" (perfect
multicollinearity violating assumption [2]).
D.
Heteroskedasticity is the situation in which the variance of the residuals is not constant across
all observations.
A.
An assumption of the regression methodology is that the sample is drawn from the same
population, and that the variance of residuals is constant across observations; in other
words, the residuals are homoskedastic.
18
B.
16.
Heteroskedasticity is a problem because the estimators do not have the smallest possible
variance, and therefore the standard errors of the coefficients would not be correct.
Autocorrelation is the situation in which the residual terms are correlated with one another.
This occurs frequently in time-series analysis.
17.
A.
Autocorrelation usually appears in time series data. If last years earnings were high, this
means that this years earnings may have a greater probability of being high than being
low. This is an example of positive autocorrelation. When a good year is always
followed by a bad year, this is negative autocorrelation.
B.
Autocorrelation is a problem because the estimators do not have the smallest possible
variance and therefore the standard errors of the coefficients would not be correct.
B.
The presence of multicollinearity can cause distortions in the standard error and
may lead to problems with significance testing of individual coefficients, and
ii.
18.
19
To
use
the t-statistic.
the F-statistic.
estimate the change in the dependent variable for a oneunit change in the independent variable
the intercept.
the R2.
20
Regression terminology
Analysis of variance
ANOVA
Autocorrelation
Positive correlation
Coefficient of determination
Predicted value
Confidence interval
R2
Correlation coefficient
Regression
Covariance
Residual
Covariation
Scatterplot
Cross-sectional
se
Degrees of freedom
SEE
Dependent variable
Simple regression
Explained variable
Slope
Explanatory variable
Slope coefficient
Forecast
Spurious correlation
F-statistic
SSResidual
Heteroskedasticity
SSRegression
Homoskedasticity
SSTotal
Independent variable
Intercept
Time-series
Multicollinearity
t-statistic
Multiple regression
Variance
Negative correlation
Variation
21
Regression formulas
Variances
N
Variation
(x i -x)(y i -y)
i 1
Variance
Covariance
N 1
i 1
i 1
N-1
Correlation
N
x) 2 (y i
(x i
y) 2
i 1
N 1
(x i
x)
1 - r2
y) 2
(y i
i 1, n
r N-2
i 1
N 1
N 1
Regression
yi = b0 + b1 xi +
(y i
b1
y)(x i
x)
i 1
cov(X, Y)
var(X)
N 1
N
(x i
x)
=y-b x
b
0
1
i 1
N 1
b 0
yi
b i x i
(y i
i 1
se
y i )
i 1
N 2
s b
2
i
i 1
N 2
N 2
s 2e
n
(x i
x) 2
i 1
tb =
-b
b
1 1
sb
b 1
t c s b
b1
b 1
t c s b
MSR
MSE
SSRegression
k
SSResidual
N-k-1
N
i 1
N
i 1
y)2
i
(y
k
(yi
y)2
N k 1
22
Forecasting
+b
x +b
x +b
x +...+b
x
y=b
0
i 1p
i 2p
i 3p
i Kp
Analysis of Variance
n
SS Total =
(yi -y)2
SSResidual =SSE=
i=1
R2=
SS Total
SSRegression =
i=1
SSRegression
i )2 =
(yi -y
=1-
SSResidual
SS Total
(y
y)2
(y
y)2
i 1
N
i=1
i 1
MSR
MSE
SSRegression
k
SSResidual
N-k-1
N
i 1
N
i 1
y)2
i
(y
k
(yi
y)2
N k 1
i -y)2
(y