Sei sulla pagina 1di 9

REGRESSION FANTASIES

Eleven Reasons for Doubting a Regression Model


Finding a model that fits a set of data is one of the most common goals in data analysis. Least
squares regression is the most commonly used tool for achieving this goal. It’s a relatively
simple concept, it’s easy to do, and there’s a lot of readily available software to do the
calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein
lies the problem. Even if there is no intention to mislead anyone, it does happen.
Here are some of the most common reasons to doubt a regression model.
1. Not Enough Samples
Accuracy is a critical component for evaluating a model. The coefficient of determination, also
known as R-squared or R2, is the most often cited measure of accuracy. Now obviously, the more
accurate a model is the better, so data analysts look for large values of R-square.
R-squared is designed to estimate the maximum relationship between the dependent and
independent variables based on a set of samples (cases, observations, records, or whatever you
want to call them). If there aren’t enough samples compared to the number of independent
variables in the model, the estimate of R-squared will be especially unstable. The effect is
greatest when the R-squared value is small, the number of samples is small, and the number of
independent variables is large, as shown in this figure.
The inflation in the value of R-squared can be adjusted by calculating the shrunken R-square.
The figure shows that for an R-squared value above 0.8 with 30 cases per variable, there isn’t
much shrinkage. Lower estimates of R-square, however, experience considerable shrinkage.
You can’t control the magnitude of the relationship between a dependent variable and a set of
independent variables, and often, you won’t have total control of the number of samples and
variables either. So, you have to be aware that R-squared will be overestimated and treat your
regression models with some skepticism.
2. No Intercept

Almost all software that performs regression analysis provides an option to not include an
intercept term in the model. This sounds convenient, especially for relationships that presume a
one-to-one relationship between the dependent and independent variables. But when an intercept
is excluded from the model, it’s not omitted from the analysis, it is set to zero. Look at any
regression model with “no intercept” and you’ll see that the regression line goes through the
origin of the axes.

With the regression line nailed down on one end at the origin, you might expect that the value of
R-squared would be diminished because the line wouldn’t necessarily travel through the data in a
way that minimizes the differences between the data points and the regression line, called the
errors or residuals. Instead, R-squared is artificially inflated because when the correction
provided by the intercept is removed, the total variation in the model increases. The ratio of the
variability attributable to the model compared to the total variability increases, hence the increase
in R-squared.

The solution is simple. Always have an intercept term in the model unless there is a compelling
theoretical reason not to include it. In that case, don’t put all your trust in R-square (or the F-
tests).

3. Stepwise Regression
Stepwise regression is a data analyst’s dream. Throw all the variables into a hopper, grab a cup
of coffee, and the silicon chips will tell you which variables to use to get the best model. That
irritates hard-core statisticians who don’t like amateurs messing around with numbers. You can
bet, though, that at least some of them go home at night, throw all the food in their cupboard into
a crock pot, and expect to get a meal out of it.
The cause of some statistician’s consternation is that stepwise regression will select the variables
that are best for the dataset, but not necessarily the population. Model test probabilities are
optimistic because they don’t account for the ability of the stepwise procedure to capitalize on
chance. Moreover, adding new variables will always increase R-squared, so you have to have
some good ways to decide how many variables is too many. There are ways to do this. So using
stepwise regression alone isn’t a fatal flow. Like with guns, drugs, and fast food, you have to be
careful how you use it.
If you use stepwise regression, be sure to look at those diagnostic statistics. Also, verify your
results using a different data set either by splitting the original data set before you do any
analysis, by extracting observations randomly from the original data set to create new data sets,
or by collecting new samples.
4. Outliers
Outliers are a special irritant for data analysts. They’re not really that tough to identify but they
do cause a variety of problems that data analysts have to deal with. The first problem is
convincing reviewers not familiar with the data that the outliers are in fact outliers. Second, they
have to convince all reviewers that what they want to do with the outliers, delete or include or
whatever, is the appropriate thing to do. But one way or another, outliers will wreak havoc with
R-squared.
Consider this figure, which comes from an analysis of slug tests to estimate the hydraulic
conductivity of an aquifer. The red circles show the relationship between rising-head and falling-
head slug tests performed on groundwater monitoring wells. The model for this relationship has
an R-square of 0.90. The blue diamond is an outlier along the trend (same regression equation)
about 60% greater than the next highest value. The R-squared of this equation is 0.95. The green
square is an outlier perpendicular to the trend. The R-squared of this equation is 0.42. Those are
fairly sizable differences to have been caused by a single data point.

How should you deal with outliers? I usually delete them because I’m usually looking to model
trends and other patterns. But outliers are great thought provokers. Sometimes they tell you
things the patterns don’t. If you’re not comfortable deciding what to do with an outlier, run the
analysis both with and without outliers, a time consuming and expensive approach. The other
approach would be to get the reviewer, an interested stakeholder, or an independent expert
involved in the decision. That approach is time consuming and expensive too. Pick your poison.
5. Non-linear relationships
Linear regression assumes that the relationship between a dependent variable and a set of
independent variables are additive, or linear. If the relationship is actually nonlinear, the R-
squared for the linear model will be lower than it would be for a better fitting nonlinear model.
This figure shows the relationship between the number of employed individuals and the number
of individuals not in the U.S. work force between 1980 and 2009. The linear model has a
respectable R-squared value of 0.84, but the polynomial model fits the data much better with an
R-squared value of 0.95.

Non-linear relationships are a relatively simple problem to fix, or at least acknowledge, once you
know what to look for. Graph your data and go from there.
6. Overfitting
Overfitting involves building a statistical model solely by optimizing statistical parameters, and
usually involves using a large number of variables and transformations of the variables. The
resulting model may fit the data almost perfectly but will produce erroneous results when applied
to another sample from the population.
The concern about overfitting may be somewhat overstated. Overfitting is like becoming too
muscular from weight training. It doesn’t happen suddenly or simply. It’s not something that
happens in a keystroke. It takes a lot of work fine tuning variables and what not. If you know
what overfitting is, you’re not likely to become a victim. It’s also usually easy to identify
overfitting in other people’s models. Simply look for a conglomeration of manual numerical
adjustments, mathematical functions, and variable combinations.
7. Misspecification
Misspecification involves including terms in a model that make the model look great statistically
even though the model is problematical. Often, misspecification involves placing the same or
very similar variable on both sides of the equation.
Consider this example from economics. A model for the U.S. Gross Domestic Product (GDP)
was developed using data on government spending and unemployment from 1947 to 1997. The
model:
GDP = (121*Spending) - (3.5*Spending2) + (136*Time) - (61*Unemployment) - 566
had an R-squared value of 0.9994. Such a high R-squared value is a signal that something is
amiss. R-squared values that high are usually only seen in models involving equipment
calibration, and certainly not anything involving capricious human behavior. A closer look at the
study indicated that the model term involving spending were an index of the government’s
outlays relative to the economy. Usually, indexing a variable to a baseline or standard is a good
thing to do. In this case, though, the spending index was the proportion of government outlays
per the GDP. Thus, the model was:
GDP = (121*Outlays/GDP) - (3.5* (Outlays/GDP)2) + (136*Time) - (61*Unemployment) - 566
GDP appears on both sides of the equation, thus accounting for the near perfect correlation. This
is a case in which an index, at least one involving the dependent variable, should not have been
used.
Another misspecification involves creating a prediction model having independent variables that
are more difficult, time consuming, or expensive to generate than the dependent variable. You
might as well just measure the dependent variable when you need to know its value. Similarly
with forecasting (prediction of the future) models, if you need to forecast something a year in
advance, don’t use predictors that are measured less than a year in advance.
8. Multicollinearity

Multicollinearity occurs when a model has two or more independent variables that are highly
correlated with each other. The consequences are that the model will look fine, but predictions
from the model will be erratic. It’s like a football team. The players perform well together but
you can’t necessarily tell how good individual players are. The team wins, yet in some situations,
the cornerback or offensive tackle will get beat on most every play.

If you ever tried to use independent variables that add to a constant, you’ve seen
multicollinearity in action. In the case of perfect correlations, such as these, statistical software
will crash because it won’t be able to perform the matrix mathemagics of regression. Most
instances of multicollinearity involve weaker correlations that allow statistical software to
function, yet the predictions of the model will still be erratic.

Multicollinearity occurs often in the social sciences and other fields of study in which many
variables are measured in the process of model building. Diagnosis of the problem is simple if
you have access to the data. Look at correlations between the independent variables. You can
also look at the variance inflation factors, reciprocals of one minus the R-squared values for the
independent variables and the dependent variable. VIFs are measures of how much the model’s
coefficients change because of multicollinearity. The VIF for a variable should be less than 10
and ideally near 1.
If you suspect multicollinearity, don’t worry about the model but don’t believe any of the
predictions.

9. Heteroscedasticity
Regression, and practically all parametric statistics, requires that the variances in the model
residuals be equal at every value of the dependent variable. This assumption is called equal
variances, homogeneity of variances, or coolest of all, homoscedasticity. Violate the assumption
and you have heteroscedasticity.
Heteroscedasticity is assessed much more commonly in analysis of variance models than in
regression models. This is probably because the dependent variable in ANOVA is measured on a
categorical scale while the dependent variable in regression is measured on a continuous scale.
The solution to this is fairly simple. Break the dependent variable scale into intervals, like in a
histogram, and calculate the variance for each interval. The variances don’t have to be precisely
equal, but variances different by a factor of five are problematical. Unequal variances will wreak
havoc on any tests or confidence limits calculated for model predictions.
10. Autocorrelation
Autocorrelation involves a variable being correlated with itself. It is the correlation between data
points with the previously listed data points (termed a lag). Usually, autocorrelation involves
time-series data or spatial data, but it can also involve the order in which data are collected. The
terms autocorrelation and serial correlation are often used interchangeably. If the data points are
collected at a constant time interval, the term autocorrelation is more typically used.
If the residuals of a model are autocorrelated, it’s a sure bet that the variances will also be
unequal. That means, again, that tests or confidence limits calculated from variances should be
suspect.
To check a variable or residuals from a model for autocorrelation, you can conduct a Durban-
Watson test. The Durban-Watson test statistic ranges from 0 to 4. If the statistic is close to 2.0,
then serial correlation is not a problem. Most statistical software will allow you to conduct this
test as part of a regression analysis.
11. Weighting
Most software that calculates regression parameters also allows you to weight the data points.
You might want to do this for several reasons. Weighting is used to make more reliable or
relevant data points more important in model building. It’s also used when each data point
represents more than one value. The issue with weighting is that it will change the degrees of
freedom, and hence, the results of statistical tests. Usually this is OK, a necessary change to
accommodate the realities of the model. However, if you ever come upon a weighted least
squares regression model in which the weightings are arbitrary, perhaps done by an analyst who
doesn’t understand the consequence, don’t believe the test results.
Is Your Regression Model Telling the Truth?
There are many technologies we use in our lives without really understanding how they work.
Television. Computers. Cell phones. Microwave ovens. Cars. Even many things about the
human body are not well understood. But I don’t mean how to use these mechanisms. Everyone
knows how to use these things. I mean understanding them well enough to fix them when they
break. Regression analysis is like that too. Only with regression analysis, sometimes you can’t
even tell if there’s something wrong without consulting an expert.
Here are some tips for troubleshooting regression models.
Diagnosis
You may know how to use regression analysis, but unless you’re an expert, you may not know
about some of the more subtle pitfalls you may encounter. The biggest red flag that something is
amiss is the TGTBT, too good to be true. If you encounter an R-squared value above 0.9,
especially unexpectedly, there’s probably something wrong. Another red flag is inconsistency. If
estimates of the model’s parameters change between data sets, there’s probably something
wrong. And if predictions from the model are less accurate or precise than you expected, there’s
probably something wrong.
Here are some guidelines for troubleshooting a model you developed.

Your Model Identification Correction


If you have fewer than 10
Collect more samples. 100
observations for each independent
Not Enough observations per variable is a good
variable you want to put in a
Samples target to shoot for although more is
model, you don’t have enough
usually better.
samples.
Put in an intercept and see if the
No Intercept You’ll know it if you do it.
model changes.
Stepwise Don’t abdicate model building
You’ll know it if you do it.
Regression decisions to software alone.
Conduct a test on the aberrant
data points to determine if they are
statistical anomalies. Use
Plot the dependent variable
diagnostic statistics like leverage to
against each independent variable.
evaluate the effects of suspected
If more than about 5% of the data
Outliers pairs plot noticeable apart from the
outliers. Evaluate the metadata of
the samples to determine if they
rest of the data points, you may
are representative of the
have outliers.
population being modeled. If so,
retain the outlier as an influential
observation (AKA leverage point).
Plot the dependent variable
Non-linear against each independent variable. Find an appropriate transformation
relationships Look for nonlinear patterns in the of the independent variable.
data
If you have a large number of Keep the model as simple as
independent variables, especially if possible. Make sure the ratio of
they use a variety of transformation observations to independent
Overfitting and don’t contribute much to the variables is large. Use diagnostic
accuracy and precision of the statistics like AIC and BIC to help
model, you may have overfit the select an appropriate number of
model. variables.
Look for any variants of the Remove any elements of the
Misspecification dependent variable in the dependent variable from the
independent variables. Assess independent variables. Remove at
whether the model meets the least one component of variables
objectives of the effort. describing mixtures. Ensure the
model meets the objectives of the
effort with the desired accuracy
and precision..
Use diagnostic statistics like VIF to
Calculate correlation coefficients
evaluate the effects of suspected
and plot the relationships between
Multicollinearity all the independent variables in the
multicollinearity. Remove
intercorrelated independent
model. Look for high correlations.
variables from the model.
Plot the variance at each level of
an ordinal-scale dependent
Try to find an appropriate Box-Cox
variable or appropriate ranges of a
transformation or consider
Heteroscedasticity continuous-scale dependent
nonparametric regression or data
variable. Look for any differences
mining methods.
in the variances of more than
about five times.
If the autocorrelation is related to
time, develop a correlogram and a
Plot the data over time, location or partial correlogram. If the
the order of sample collection. autocorrelation is spatial, develop
Autocorrelation Calculate a Durbin–Watson a variogram. If the autocorrelation
statistic for serial correlation. is related to the order of sample
collection, examine metadata to try
to identify a cause.
Compare the weighted model with
the corresponding unweighted
model to assess the effects of
Weighting You’ll know it if you do it.
weighting. Consider the validity of
weighting; seek expert advice if
needed.

Sometimes the model you are skeptical about isn’t one you developed; it is models that are
developed by other data analysts. The major difference is that with other analysts’ models, you
won’t have access to all their diagnostic statistics and plots, let alone their data. If you have been
retained to review another analyst’s work, you can always ask for the information you need. If,
however, you’re reading about a model in a journal article, book, or website, you’ve probably
got all the information you’re ever going to get. You have to be a statistical detective. Here are
some clues you might look for.

Another Analyst’s
Identification
Model
If the analyst reported the number of samples used, look for at
Not Enough
least 10 observations for each independent variable in the
Samples model,
If the analyst reported the actual model (some don’t), look for a
No Intercept
constant term.
Stepwise Unless another approach is reported, assume the analyst used
Regression some form of stepwise regression.
Assuming the analyst did not provide plots of the dependent
Outliers variable versus the independent variables, look for R-squared
values that are much higher or lower than expected.
Assuming the analyst did not provide plots of the dependent
Non-linear variable versus the independent variables, look for a lower-
relationships than- expected R-squared value from a linear model. If there
are non-linear terms in the model, this is probably not an issue.
Look for a large number of independent variables in the model,
Overfitting
especially if they different types of transformation
Look for any variants of the dependent variable in the
Misspecification independent variables. Assess whether the model meets the
objectives of the effort.
Assuming relevant plots and diagnostic statistics are not
Multicollinearity
available, there may not be any way to identify multicollinearity.
Assuming relevant plots and diagnostic statistics are not
Heteroscedasticity available, there may not be any way to identify
heteroscedasticity.
Assuming relevant plots and diagnostic statistics are not
Autocorrelation available, there may not be any way to identify serial
correlation.
Compare the reported number of samples to the degrees of
Weighting
freedom. Any differences may be attributable to weighting.

Have No Doubts
So there are eleven reasons for doubting a regression model. Some are easy to identify, others are
more subtle. But if you’re grasping for flaws in a regression model, these are good places to start
looking. Just remember when evaluating other analyst’s models that not everyone is an expert
and that even experts make mistakes. Try to be helpful in your critiques, but at a minimum, be
professional.

Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other
Breeds of Data Analysis at Wheatmark. Stats with Cats is also available at amazon.com,
barnesandnoble.com, and other online booksellers. Read the blogs at Stats with Cats blog.