Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Coefficients: Coefficients:
Estimate Std. Error t value Pr(>|t|) Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.612e+02 7.638e+01 6.038 2.2e09 *** (Intercept) 4.428e+02 5.372e+01 8.242 5.29e16 ***
Catalogs 4.780e+01 2.757e+00 17.337 < 2e16 *** Salary 2.041e02 5.929e04 34.417 < 2e16 ***
Salary 2.089e02 8.238e04 25.362 < 2e16 *** Catalogs 4.770e+01 2.755e+00 17.310 < 2e16 ***
GenderMale 4.364e+01 3.732e+01 1.169 0.243 Children 1.987e+02 1.709e+01 11.628 < 2e16 ***
MarriedSingle 2.723e+01 4.850e+01 0.561 0.575
Children 2.014e+02 1.723e+01 11.690 < 2e16 *** Residual standard error: 562.5 on 996 deg of freedom
Multiple Rsquared: 0.6584, Adj Rsquared: 0.6574
Residual standard error: 562.6 on 994 degrees of freedom Fstatistic: 640 on 3 and 996 DF, pvalue: < 2.2e16
Multiple Rsquared: 0.659, Adj Rsquared: 0.6573
Fstatistic: 384.2 on 5 and 994 DF, pvalue: < 2.2e16
Question:
Model m2 has slightly higher adjusted R-squared, but is this a valid model?
Does it meet the requirements (assumptions) of OLS regression?
Assumptions of the Regression Model
1. Independence: Observations independent of each other.
2. Multivariate normality: Residuals (errors) follow a normal distribution.
3. Homoskedasticity: Constant variance of residuals.
4. Linearity: Linear relationship between response and predictor variables.
5. Little or no multicollinearity: Predictors should not be highly correlated.
6. No autocorrelation: Variables not correlated with itself over time (in panel or time series data).
Independence
Independence:
Observations (of random variables) are independent of each other, i.e., occurrence of one
observation does not affect the probability of other.
Example of violation: The same person was surveyed twice.
How to detect violation of independence:
Cannot be tested statistically (there are statistical tests for autocorrelation, discussed later).
Requires knowledge of study design or data collection.
What can we do about ensuring independence?
Collect your own data (i.e., design and run your own experiments or surveys) or use data
from reputed sources that clearly describe the data collection process, e.g., Center for
Medicare & Medicaid Services (CMS), Bureau of Labor Services (BLS), World Bank, …
Residuals
Our actual model : Y = �0 + �1 X + � (Total variance SST)
Our estimated model: Ŷ = �0 + �1 X (Explained variance SSR)
Error term or residual: �=Y–Ŷ (Unexplained variance SSE)
Call:lm(formula = AmountSpent ~ Catalogs + Salary + Children, data = d)
Residuals:
Min 1Q Median 3Q Max Residual estimates
1775.9 348.7 38.7 255.5 3211.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.428e+02 5.372e+01 8.242 5.29e16 ***
Catalogs 4.770e+01 2.755e+00 17.310 < 2e16 ***
Salary 2.041e02 5.929e04 34.417 < 2e16 ***
Children 1.987e+02 1.709e+01 11.628 < 2e16 ***
Actual residuals are
Residual standard error: 562.5 on 996 degrees of freedom stored in the model
Multiple Rsquared: 0.6584, Adjusted Rsquared: 0.6574
Fstatistic: 640 on 3 and 996 DF, pvalue: < 2.2e16
vector m2$residuals
OLS Requirement for Residuals
OLS assumes that the residual � is a random variable that is independently and identically
distributed (i.i.d.) with the properties � ~ N(0, �2).
� must be independent of each other (“random”).
� will be independent if observations (X,Y) are independent.
Note: You can plot residuals � against X or Ŷ (or scaled X), but not against Y because Y includes the residual �
qqnorm(m2$res) qqnorm(d$AmountSpent)
qqline(m2$res, col="red") qqline(d$AmountSpent, col="red")
Sample QQ-Plots
Histogram of Residuals Histogram of Residuals Histogram of Residuals
Normal Normal Normal
Mean 0 Mean 0 0.6 Mean 0.00008
0.4 0.4
StDev 1 StDev 1 StDev 1.191
N 100 N 100 N 100
0.5
0.3 0.3
0.4
Density
Density
Density
0.2 0.2 0.3
0.2
0.1 0.1
0.1
2 2 2
Extreme Extreme
Theoretical (Z Scores)
Theoretical (Z Scores)
Theoretical (Z scores)
1 1 1
0 0 residuals 0 residuals
-1 -1 -1
-2 -2 -2
-3 -3 -3
-3 -2 -1 0 1 2 3 -4 -2 0 2 4 6 8 10 -2 -1 0 1 2 3 4 5
Sample (Residuals) Sample (Residuals) Sample (Residuals)
An almost perfect normal distribution High kurtosis with one outlier Highly skewness (positive)
Sample QQ-Plots
Histogram of Residuals
Normal Heavy tailed distribution:
Mean 0
0.4
Has many extreme values (both high and low) or many
StDev 1
N 100
0.3
outliers, often more frequently than random variables (e.g.,
Pareto, lognormal, etc.).
Density
0.2
0.1
Implies larger variance.
0.0
-6 -3 0 3 6 9
Light tailed distribution:
has fewer extreme values than random variables (e.g.,
Residuals
Normal Probability Plot
3
Exponential, Gamma, Weibull distributions).
Implies smaller variance.
2
Theoretical (Z Scores)
-1
-2
-3
-5 0 5 10
Sample (Residuals)
Kolmogorov-Smirnov Test:
Question: Are Catalogs residuals
Two-sample test, for large samples (n>2000).
normally distributed?
More general but less powerful test than Shapiro-Wilk.
H0: Both samples have equal distributions.
Construct a normally distributed sample, and compare
residuals against this normal distribution.
If p<0.05, reject H0 ⇒ data is not normally distributed.
If p>0.05, fail to reject H0 ⇒ data is normally distributed.
Homoskedasticity
plot(m2$res ~ m2$fit)
Levene Test:
Bartlett's test is more sensitive to violations of normality than Levene's test.
Use Levene test if residuals are non-normal.
leveneTest(m2$res, m2$fit, center=mean)
Df F value Pr(>F)
group 996 4.935e+31 < 2.2e-16 ***
3
60 60
50 50
40 40
y
y
30 30
20 20
10 10
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x x
70 70 y = 1.73 + 5.12 x
y = 1.73 + 5.12 x
60 60
50 50
40 40
y
y
30 30
y = 8.51 + 3.32 x
20 20
y = 2.96 + 5.04 x
10 10
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x x
Leverage and Residuals
70 70
60 60
50 50
40 40
y
y
30 30
20 20
10 10
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x x
Blue dot has high leverage and large residual Blue dot has low residual and low leverage
Is multicollinearity “bad”?
What does a correlation of 1 mean? (Hint: See the principal diagonal)
What do correlations of 0.96 or 0.95, or 0.92 between Popn, Outlets, Vehicles mean?
What should you do?
100 200 300 400 500 200 400 600 800
120
120
(Intercept) 10.582457 5.414913 1.954 0.07437 .
Mktg 0.036144 0.009645 3.747 0.00278 **
100
100
Outlets 0.131756 0.015815 8.331 2.48e-06 ***
Sales
Sales
Popn -6.050930 2.668976 -2.267 0.04266 *
80
80
Vehicles 6.527655 4.480298 1.457 0.17079
Reps -0.041927 0.766622 -0.055 0.95729
60
60
How do you interpret �Reps = -0.04?
40
40
For each additional sales rep, sales decrease
1 2
by 3$0.04K,
4 5
after
6 7 6 8 10 12 14
Popn Reps
accounting for the effects of mktg, outlets, popn and vehicles.
But how is this possible?
cor(Sales, Reps) = 0.85
The scatterplot of sales vs. reps shows a strong positive relationship.
Does the regression model contradict the scatterplot?
No. The scatterplot shows a positive correlation between Sales and Reps; the regression model shows
that Reps is useless in predicting Sales in the presence of all other factors.
Multicollinearity
OLS model: y = �0 + �1 x1 + �2 x2 + �3 x3 + �
What happens if x1 and x2 are highly collinear?
If two predictors are almost the same, very little is gained by adding both to the OLS model.
Leads to unstable regression coefficients �i as it tends to increase the standard errors for �i (i.e., �i
may be non-significant, even if R2 is high).
Diagnosing and Treating Multicollinearity
Correlation matrix: A bivariate method.
Include only one of the highly correlated variables.
Variance inflation factor: Examines each predictor as a linear combination of all other predictors.
xi = �0 + �1 x1 + �2 x2 + … + �p xp + �
High R2 of the above model means that xi is highly collinear with other predictors.
Compute VIF = 1/(1- Rj2), where j = 1, …, p
If VIF > 10, for any xi (high R2 => high VIF), exclude that xi from your OLS model.
install.packages("car")
library("car")
m2 <- lm(Sales ~ Mktg + Outlets + Popn + Vehicles + Reps, data=d)
vif(m2)
Mktg Outlets Popn Vehicles Reps
1.295925 16.664400 25.404095 15.070743 4.210560
Inference: Must drop Popn and Outlets from model; retest; if needed, drop Vehicles too
Example: Autoparts
Region Sales Mktg Outlets Popn Vehicles Reps
1 1 78 482 482 3 1 8
2 2 59 429 262 2 1 10
0.4 2
0.3
1
0.2
Residuals
Residual
0.1 0
0.0
-0.1 -1
-0.2
-2
-0.3
-0.4 -3
5 10 15 20 25
0 10 20 30 40
Observation Order Order
(Time)
Positive serial correlation: Residuals are followed, in time, by Negative serial correlation: Residuals of one sign are
residuals of the same sign and about the same magnitude. followed, in time, by residuals of the opposite sign.
Correcting for Autocorrelation Violation
Model autocorrelation either explicitly or implicitly.
Modeling autocorrelation explicitly (panel data models):
For a first-order autocorrelation, a model of the form y t = �0 + �1 yt-1 + �2 x could alleviate the problem.
Check residuals to see if problem is solved.
Will be discussed in time series models (e.g. autoregressive models, moving average models).
Can be modeled in R using ar( ), arma( ), and arima( ) function.
See also http://en.wikipedia.org/wiki/Autoregressive.
To be discussed later.
Summary
Question:
Which of the following regression assumptions hold for the Catalogs data? HousePrices data?
Autoparts data?
1. Independence.
2. Multivariate normality.
3. Homoskedasticity.
4. Linearity.
5. Little or no multicollinearity.
6. No autocorrelation.
Can you use OLS regression with the above data? What if your R2 is high?
How can you test for overfitting?
Key Takeaways
Before running regression models, always test for six regression assumptions.
Choose predictors using business sense and domain knowledge, not kitchen sink.
Multicollinearity may lead to spurious correlations and invalid models.
Multivariate normality of residuals: Use QQ-plots.
Homoskedasticity/heteroskedasticity of error variances.
Independence of error terms.
Autocorrelations for time series data.
Examine outliers and influence points.
OLS regression models are invalid if they fail regression assumptions.