Sei sulla pagina 1di 37

Linear Regression:

Assumptions and Violations

ANOL BHATTACHERJEE, PH.D.


UNIVERSITY OF SOUTH FLORIDA
Outline
 Understand assumptions of regression models.
 Independence, linearity, multivariate normality, homoskedasticity, multicollinearity.
 How to test them in R: QQ-plots, residual plots.
 How to correct for some violations of assumptions.
How To Model Data?
 Understanding the data:
 What is the core DV and the core IV(s) of interest? Why are we interested in these variables?
 What hypotheses do we wish to test? What is the logic behind these hypotheses?
 Descriptive analysis:
 Univariate analysis: Histograms, boxplots.
 Bivariate analysis: Scatterplots, correlation matrix.
 Examine data quality: Multicollinearity.
 Run alternative regression models:
 Build model from ground up based on domain knowledge/intuition; NOT kitchen sink or top down.
 Compare model goodness-of-fit (multiple R2, adjusted R2, AIC, BIC).
 Test for regression assumptions:
 Homoskedasticty, linearity, multivariate normality.
 Respecify model if necessary (log or other transform, non-linear models, etc.).
 Derive inferences about your hypotheses from your analysis.
 Test predictive ability of model: Validation, k-fold cross validation.
Catalogs Example
m1 <­ lm(AmountSpent ~ Catalogs + Salary + Gender  m2 <­ lm(AmountSpent ~ Salary + Catalogs + 
+ Married + Children, data=d) Children, data=d)

Coefficients: Coefficients:
                Estimate Std. Error t value Pr(>|t|)                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   ­4.612e+02  7.638e+01  ­6.038  2.2e­09 *** (Intercept) ­4.428e+02  5.372e+01  ­8.242 5.29e­16 ***
Catalogs       4.780e+01  2.757e+00  17.337  < 2e­16 *** Salary       2.041e­02  5.929e­04  34.417  < 2e­16 ***
Salary         2.089e­02  8.238e­04  25.362  < 2e­16 *** Catalogs     4.770e+01  2.755e+00  17.310  < 2e­16 ***
GenderMale    ­4.364e+01  3.732e+01  ­1.169    0.243     Children    ­1.987e+02  1.709e+01 ­11.628  < 2e­16 ***
MarriedSingle  2.723e+01  4.850e+01   0.561    0.575    
Children      ­2.014e+02  1.723e+01 ­11.690  < 2e­16 *** Residual standard error: 562.5 on 996 deg of freedom
Multiple R­squared:  0.6584, Adj R­squared:  0.6574 
Residual standard error: 562.6 on 994 degrees of freedom F­statistic:  640 on 3 and 996 DF,  p­value: < 2.2e­16
Multiple R­squared:  0.659, Adj R­squared:  0.6573 
F­statistic: 384.2 on 5 and 994 DF,  p­value: < 2.2e­16

 Question:
 Model m2 has slightly higher adjusted R-squared, but is this a valid model?
 Does it meet the requirements (assumptions) of OLS regression?
Assumptions of the Regression Model
1. Independence: Observations independent of each other.
2. Multivariate normality: Residuals (errors) follow a normal distribution.
3. Homoskedasticity: Constant variance of residuals.
4. Linearity: Linear relationship between response and predictor variables.
5. Little or no multicollinearity: Predictors should not be highly correlated.
6. No autocorrelation: Variables not correlated with itself over time (in panel or time series data).
Independence
 Independence:
 Observations (of random variables) are independent of each other, i.e., occurrence of one
observation does not affect the probability of other.
 Example of violation: The same person was surveyed twice.
 How to detect violation of independence:
 Cannot be tested statistically (there are statistical tests for autocorrelation, discussed later).
 Requires knowledge of study design or data collection.
 What can we do about ensuring independence?
 Collect your own data (i.e., design and run your own experiments or surveys) or use data
from reputed sources that clearly describe the data collection process, e.g., Center for
Medicare & Medicaid Services (CMS), Bureau of Labor Services (BLS), World Bank, …
Residuals
Our actual model : Y = �0 + �1 X + � (Total variance SST)
Our estimated model: Ŷ = �0 + �1 X (Explained variance SSR)
Error term or residual: �=Y–Ŷ (Unexplained variance SSE)

Call:lm(formula = AmountSpent ~ Catalogs + Salary + Children, data = d)

Residuals:
    Min      1Q  Median      3Q     Max  Residual estimates
­1775.9  ­348.7   ­38.7   255.5  3211.3 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) ­4.428e+02  5.372e+01  ­8.242 5.29e­16 ***
Catalogs     4.770e+01  2.755e+00  17.310  < 2e­16 ***
Salary       2.041e­02  5.929e­04  34.417  < 2e­16 ***
Children    ­1.987e+02  1.709e+01 ­11.628  < 2e­16 ***
Actual residuals are
Residual standard error: 562.5 on 996 degrees of freedom stored in the model
Multiple R­squared:  0.6584, Adjusted R­squared:  0.6574 
F­statistic:   640 on 3 and 996 DF,  p­value: < 2.2e­16
vector m2$residuals
OLS Requirement for Residuals
 OLS assumes that the residual � is a random variable that is independently and identically
distributed (i.i.d.) with the properties � ~ N(0, �2).
 � must be independent of each other (“random”).
 � will be independent if observations (X,Y) are independent.

 � must have a mean of zero.


 � must have a constant variance �2 at all values of X (the “homoskedasticity” assumption).
 � ~ N(0, �2) is the “multivariate normality” assumption.
 What if we fail to meet the above assumption:
 � may include a random component and a systematic component: � = �R + �S
 �R ~ N(0, �2) but �S ≁ N(0, �2)
 � ≁ N(0, �2) implies that there is a systematic error that we ignored in our model (e.g., an unobserved
variable) that may have biased our observed estimates.
 Our estimated parameters are not trustworthy and possibly invalid.
What i.i.d. Means
 What does the i.i.d. assumption mean for Y:
 �=Y–Ŷ Y
X
 � ~ N(0, �2) + � 1
�0
 Hence Y ~ N(Ŷ, �2) ⩝ X Ŷ=
 For any given value of X, Y ~ N(Ŷ, �2)
 For different values of X, Ŷ (i.e., its mean value) change,
but variance �2 should remain the same.
 Note:
X
 This assumption � ~ N(0, �2) may or may not be true.
At every X, Y is assumed to be
 Must be checked statistically. normally distributed about the
 If not true, our OLS estimates is biased (not trustworthy). estimated line as Y ~ N(Ŷ, �2)

 In some cases, the model can be corrected.


Autoparts: i.i.d.
plot(m2$res ~ d$Catalogs) plot(m2$res ~ m2$fit)

Note: You can plot residuals � against X or Ŷ (or scaled X), but not against Y because Y includes the residual �

Questions: Does it seem that residuals in model m2 satisfy � ~ N(0, �2)?


Anything else you see in these plots that looks strange?
Multivariate Normality: Quick Test
plot(m2$res ~ m2$fit)

 Residual is a linear combination of Y and Ŷ: � = Y – Ŷ


 Ŷ is usually normally distributed due to OLS method; hence if Y is not somewhat normal, it is unlikely
that � will be normal.
QQ-Plot for Normality
 Quantile-Quantile Plot:
 Plot of sample percentiles (qqnorm) of a vector versus percentiles of the normal distribution (qqline).
 If data follow a normal distribution, then qqnorm should be linear and roughly overlap with qqline.

qqnorm(m2$res) qqnorm(d$AmountSpent)
qqline(m2$res, col="red") qqline(d$AmountSpent, col="red")
Sample QQ-Plots
Histogram of Residuals Histogram of Residuals Histogram of Residuals
Normal Normal Normal
Mean 0 Mean 0 0.6 Mean 0.00008
0.4 0.4
StDev 1 StDev 1 StDev 1.191
N 100 N 100 N 100
0.5

0.3 0.3
0.4
Density

Density

Density
0.2 0.2 0.3

0.2
0.1 0.1
0.1

0.0 0.0 0.0


-2 -1 0 1 2 -2 0 2 4 6 8 10 -2 -1 0 1 2 3 4
Residuals Residuals Residuals

Normal probability plot Normal Probability Plot Normal probability plot


3 3 3

2 2 2

Extreme Extreme
Theoretical (Z Scores)

Theoretical (Z Scores)
Theoretical (Z scores)

1 1 1

0 0 residuals 0 residuals
-1 -1 -1

-2 -2 -2

-3 -3 -3
-3 -2 -1 0 1 2 3 -4 -2 0 2 4 6 8 10 -2 -1 0 1 2 3 4 5
Sample (Residuals) Sample (Residuals) Sample (Residuals)

An almost perfect normal distribution High kurtosis with one outlier Highly skewness (positive)
Sample QQ-Plots
Histogram of Residuals
Normal  Heavy tailed distribution:
Mean 0
0.4
 Has many extreme values (both high and low) or many
StDev 1
N 100

0.3
outliers, often more frequently than random variables (e.g.,
Pareto, lognormal, etc.).
Density

0.2

0.1
 Implies larger variance.
0.0
-6 -3 0 3 6 9
 Light tailed distribution:
has fewer extreme values than random variables (e.g.,
Residuals

Normal Probability Plot
3
Exponential, Gamma, Weibull distributions).
Implies smaller variance.
2

Theoretical (Z Scores)

-1

-2

-3
-5 0 5 10
Sample (Residuals)

Heavy tailed distribution


Formal Tests for Normality
 Shapiro-Wilk Test: shapiro.test(m2$res)
 Single-sample test, for small samples (n<2000). W = 0.93702, p-value < 2.2e-16

 H0: Data vector is normally distributed. norm <- rnorm(50)


ks.test(norm, m2$res)
 If p<0.05, reject H0 ⇒ data is not normally distributed. D = 0.54, p-value = 1.738e-12
 If p>0.05, fail to reject H0 ⇒ data is normally distributed. alternative hypothesis: two-sided

 Kolmogorov-Smirnov Test:
Question: Are Catalogs residuals
 Two-sample test, for large samples (n>2000).
normally distributed?
 More general but less powerful test than Shapiro-Wilk.
 H0: Both samples have equal distributions.
 Construct a normally distributed sample, and compare
residuals against this normal distribution.
 If p<0.05, reject H0 ⇒ data is not normally distributed.
 If p>0.05, fail to reject H0 ⇒ data is normally distributed.
Homoskedasticity
plot(m2$res ~ m2$fit)

 Heteroskedasticity: non-constant error variance.


 Opposite of homoskedasticity (constant error variance).
 Residuals vary in some non-random pattern, e.g., fanning pattern (residuals close to 0 for small fitted
values and more spread out for large fitted values).
 OLS regression not appropriate here; weighted least squares regression may work.
More Examples
OLS assumption: � ~ N(0, �2)
 Biased: E(�) ≠ 0; presence of
unobserved variables.
 Heteroskedastic: Var(�) ≠ �2

Test for linearity:


 Unbiased (flat) plots in (a)
and (d) suggests linearity.
 Curvature in (c) and (d)
suggests non-linearity.
Bias Versus Variance
Formal Tests for Homoskedasticity
 Bartlett Test:
 H0: �12 = �22 (equal variances)
 If p>0.05, fail to reject H0, i.e., two samples have equal variances.
 Can be extended to more than two groups.
list( ) coerces data objects into
bartlett.test(list(m2$res, m2$fit))
Bartlett's K-squared = 105.65, df = 1, p-value < 2.2e-16
a data frame, which is the
required input for Bartlett Test

 Levene Test:
 Bartlett's test is more sensitive to violations of normality than Levene's test.
 Use Levene test if residuals are non-normal.
leveneTest(m2$res, m2$fit, center=mean)
Df F value Pr(>F)
group 996 4.935e+31 < 2.2e-16 ***
3

Question: Are Catalogs residuals homoskedastic?


Back to Catalogs
plot(m2$res ~ m2$fit)

Questions: Are Catalogs residuals normal? Homoskedastic? Biased?


Do relationships on scatterplots look linear?
Outlier Analysis
 Outlier: library(car)
influencePlot(m2,id.method="identify”)
 A data point whose response y does not follow the
general trend of the rest of the data.
 A major source of non-linearity.
 Outliers may or may not be influential:
 They are influential if they unduly influence any part of a
regression analysis, such as predicted responses,
estimated beta coefficients, hypothesis test results, etc.
 InfluencePlot:
 A live plot of residuals against the influence of each
residual on the regression line.
 Click on points on the plot to identify observation number.
Effect of Outliers on Regression
Non-influential Outlier Influential Outlier
70 70

60 60

50 50

40 40
y

y
30 30

20 20

10 10

0 0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x x

70 70 y = 1.73 + 5.12 x
y = 1.73 + 5.12 x
60 60

50 50

40 40
y

y
30 30
y = 8.51 + 3.32 x
20 20
y = 2.96 + 5.04 x
10 10

0 0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x x
Leverage and Residuals

70 70

60 60

50 50

40 40
y

y
30 30

20 20

10 10

0 0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x x

Blue dot has high leverage and large residual Blue dot has low residual and low leverage

 Influence of an outlier depends on how “extreme” it is in x-direction and/or y-direction.


 Scatterplots are a good start, but they may not reveal the precise influence of an outlier; we can gauge
this using “leverage” and “residuals”.
To Delete or Not to Delete
 Delete outliers if:
 The observation is invalid (e.g., data entry error) and cannot be corrected.
 The observation is not representative of the population (e.g., drawn from a different population).
 The observation is invalidated by a procedural (e.g., measurement) error.
 Do NOT delete if:
 There is no good objective reason to believe that the observation is invalid.
 Extreme observations may be valid data points.
 How to deal with valid outliers:
 If linear models don’t fit, consider data transformation (e.g., log model).
 Consider a different functional relationship (e.g., quadratic models).
 Comment on deleting data points:
 If unsure how to handle an outlier, analyze data with and without the outlier and report both results.
 If you must delete any data, justify and document it in your report.
Simplified Assumption Testing in R
The plot( ) function in R generates four graphs that can provide a quick test of
(1) homoskedasticity, (2) linearity, (3) multivariate normality, and (4) outlier analysis.

Heteroskedasic, Not multivariate A few outliers, but


Non-linear normal none are influential
How to Handle Violations?
 If data is bimodal:
 Break data up into two (or more) clusters.
 If residuals (response variable) are not normal:
 Check for outliers.
 Consider data transformation: log, lognormal (exponential), Box-Cox.
 Consider more flexible generalized linear models (glm).
 If there is heteroskedasticity or multi-collinearity:
 Consider generalized least squares (gls) models.
 If relationship is non-linear:
 Consider quadratic, general additive model (gam), or other non-linear models.

Statistical modeling is not an exact science. It requires understanding the data


(i.e., distribution), domain knowledge, gut feel, and trial-and-error.
Multicollinearity
 High correlation between independent (predictor) variables.
 Not an issue in the Catalogs data.
dtemp <- cbind(d[ ,6:7], d[ ,9:10])
round(cor(dtemp), 3)

Salary Children Catalogs AmountSpent


Salary 1.000 0.050 0.184 0.700
Children 0.050 1.000 -0.113 -0.222
Catalogs 0.184 -0.113 1.000 0.473
AmountSpent 0.700 -0.222 0.473 1.000
Multicollinearity: Autoparts
Region Sales Mktg Outlets Popn Vehicles Reps
round(cor(d), 3)
1 77.51 482 482 2.83 0.83 8
Region Sales Mktg Outlets Popn Vehicles Reps 2 58.90 429 262 1.81 0.69 10
Region 1.000 -0.187 0.233 -0.218 -0.273 -0.238 -0.248 3 71.44 84 604 3.67 1.09 11
Sales -0.187 1.000 -0.035 0.972 0.896 0.896 0.852 4 117.45 245 837 5.04 1.60 13
5 98.78 134 819 6.95 2.55 9
Mktg 0.233 -0.035 1.000 -0.217 -0.273 -0.172 -0.091
6 126.60 461 939 7.43 2.77 14
Outlets -0.218 0.972 -0.217 1.000 0.953 0.920 0.858 7 37.99 323 158 1.09 0.39 6
Popn -0.273 0.896 -0.273 0.953 1.000 0.960 0.787 8 78.73 351 497 3.14 0.85 9
Vehicles -0.238 0.896 -0.172 0.920 0.960 1.000 0.743 9 104.00 298 796 5.59 1.63 14
Reps -0.248 0.852 -0.091 0.858 0.787 0.743 1.000 10 47.32 253 305 2.57 0.95 7
11 61.45 335 394 2.88 0.95 8
12 38.28 329 169 1.32 0.38 7
13 38.62 396 134 1.20 0.42 5
14 61.04 261 330 2.28 0.92 6
15 60.90 474 375 2.08 0.84 9
16 120.93 349 873 5.16 1.81 12
17 75.75 549 422 3.15 1.30 8
18 66.78 286 459 3.12 0.98 10

 Is multicollinearity “bad”?
 What does a correlation of 1 mean? (Hint: See the principal diagonal)
 What do correlations of 0.96 or 0.95, or 0.92 between Popn, Outlets, Vehicles mean?
 What should you do?
100 200 300 400 500 200 400 600 800

Autoparts: The Effect of Reps? Mktg Outlets

Estimate Std. Error t value Pr(>|t|)

120

120
(Intercept) 10.582457 5.414913 1.954 0.07437 .
Mktg 0.036144 0.009645 3.747 0.00278 **

100

100
Outlets 0.131756 0.015815 8.331 2.48e-06 ***

Sales

Sales
Popn -6.050930 2.668976 -2.267 0.04266 *

80

80
Vehicles 6.527655 4.480298 1.457 0.17079
Reps -0.041927 0.766622 -0.055 0.95729

60

60
 How do you interpret �Reps = -0.04?

40

40
 For each additional sales rep, sales decrease
1 2
by 3$0.04K,
4 5
after
6 7 6 8 10 12 14

Popn Reps
accounting for the effects of mktg, outlets, popn and vehicles.
 But how is this possible?
 cor(Sales, Reps) = 0.85
 The scatterplot of sales vs. reps shows a strong positive relationship.
 Does the regression model contradict the scatterplot?
 No. The scatterplot shows a positive correlation between Sales and Reps; the regression model shows
that Reps is useless in predicting Sales in the presence of all other factors.
Multicollinearity
 OLS model: y = �0 + �1 x1 + �2 x2 + �3 x3 + �
 What happens if x1 and x2 are highly collinear?
 If two predictors are almost the same, very little is gained by adding both to the OLS model.
 Leads to unstable regression coefficients �i as it tends to increase the standard errors for �i (i.e., �i
may be non-significant, even if R2 is high).
Diagnosing and Treating Multicollinearity
 Correlation matrix: A bivariate method.
 Include only one of the highly correlated variables.
 Variance inflation factor: Examines each predictor as a linear combination of all other predictors.
xi = �0 + �1 x1 + �2 x2 + … + �p xp + �
High R2 of the above model means that xi is highly collinear with other predictors.
Compute VIF = 1/(1- Rj2), where j = 1, …, p
 If VIF > 10, for any xi (high R2 => high VIF), exclude that xi from your OLS model.

install.packages("car")
library("car")
m2 <- lm(Sales ~ Mktg + Outlets + Popn + Vehicles + Reps, data=d)
vif(m2)
Mktg Outlets Popn Vehicles Reps
1.295925 16.664400 25.404095 15.070743 4.210560

Inference: Must drop Popn and Outlets from model; retest; if needed, drop Vehicles too
Example: Autoparts
Region Sales Mktg Outlets Popn Vehicles Reps
1 1 78 482 482 3 1 8
2 2 59 429 262 2 1 10

m2 <- lm(Sales ~ Mktg + Reps + Vehicles, data=d)

Estimate Std. Error t value Pr(>|t|)


(Intercept) 2.46349 12.26264 0.201 0.843671
Mktg 0.02567 0.02138 1.201 0.249729
Reps 4.25897 1.38835 3.068 0.008352 ** AIC = 140.76
Vehicles 19.84729 4.33583 4.578 0.000431 *** BIC = 145.21

Residual standard error: 10.37 on 14 degrees of freedom vif(m2)


Multiple R-squared: 0.8907, Adjusted R-squared: 0.8672 Mktg Reps Vehicles
F-statistic: 38.01 on 3 and 14 DF, p-value: 5.584e-07 1.033925 2.242911 2.292451

Question: Are these regression estimates trustworthy?


Autoparts: Assumptions
m2 <- lm(Sales ~ Mktg + Reps + Vehicles, data=d)
plot(m)

 Question: Are regression assumptions met?


Autocorrelation
 Autocorrelation (also called serial correlation):
 Correlation of an observation with itself at different points in time.
 Violates the independence assumption.
 Example: stock prices, sales by month, quarter, or year.
Residuals Versus the Order of the Data
(response is Value)
0.5

0.4 2

0.3
1
0.2

Residuals
Residual

0.1 0
0.0

-0.1 -1

-0.2
-2
-0.3

-0.4 -3
5 10 15 20 25
0 10 20 30 40
Observation Order Order
(Time)

Positive serial correlation: Residuals are followed, in time, by Negative serial correlation: Residuals of one sign are
residuals of the same sign and about the same magnitude. followed, in time, by residuals of the opposite sign.
Correcting for Autocorrelation Violation
 Model autocorrelation either explicitly or implicitly.
 Modeling autocorrelation explicitly (panel data models):
 For a first-order autocorrelation, a model of the form y t = �0 + �1 yt-1 + �2 x could alleviate the problem.
 Check residuals to see if problem is solved.

 Modeling autocorrelation implicitly:


 Model the autocorrelation in the error term e.g. εt= εt-1+ut where ut is a standard normal random variable.

 Will be discussed in time series models (e.g. autoregressive models, moving average models).
 Can be modeled in R using ar( ), arma( ), and arima( ) function.
 See also http://en.wikipedia.org/wiki/Autoregressive.

 To be discussed later.
Summary
 Question:
 Which of the following regression assumptions hold for the Catalogs data? HousePrices data?
Autoparts data?
1. Independence.
2. Multivariate normality.
3. Homoskedasticity.
4. Linearity.
5. Little or no multicollinearity.
6. No autocorrelation.
 Can you use OLS regression with the above data? What if your R2 is high?
 How can you test for overfitting?
Key Takeaways
 Before running regression models, always test for six regression assumptions.
 Choose predictors using business sense and domain knowledge, not kitchen sink.
 Multicollinearity may lead to spurious correlations and invalid models.
 Multivariate normality of residuals: Use QQ-plots.
 Homoskedasticity/heteroskedasticity of error variances.
 Independence of error terms.
 Autocorrelations for time series data.
 Examine outliers and influence points.
 OLS regression models are invalid if they fail regression assumptions.

Potrebbero piacerti anche