Sei sulla pagina 1di 5

ASSUMING THE W ORST

If you’re going to be poking around data looking for patterns and anomalies, you should be aware
of the fundamental requirements you need to fulfill, or at least assume you fulfill. Consider this.
All models make assumptions, an evil necessity for simplifying complex analyses. If your model
deals in probabilities, like statistical models do, you’ll be making at least five assumptions:
Representativeness – The samples or measurements used to develop the model are
representative of the population of possible samples or measurements.
Linearity – The model can be expressed in an intrinsically linear, additive form.
Independence – Errors in the model are not correlated.
Normality – Errors in the model are normally distributed.
Homogeneity of Variances – Errors in the model have equal variances for all values of
the dependent variables.

Representativeness
The most important assumption common to all statistical models
is that the samples used to develop the model are representative of
a population of possible samples that are being investigated. Some
statistics books don’t discuss this as a basic assumption because it
is viewed as more of a requirement than an assumption. But,
obtaining representative samples of populations can be a
challenge. Unlike the other assumptions, failure to obtain a
representative sample from a population under study would
necessarily be a fatal flaw for any statistical analysis. You might
not know it, though, because there’s no good way to determine if
Hmmm. Doesn’t taste like sushi.
Must not be a representative
a sample is representative of the underlying population. To do
sample. that, you would need to know the characteristics of the population.
But if you knew about the population, you wouldn’t need to
bother with a sample. So, representativeness has to be addressed indirectly by building
randomization and variance control into the sampling program before it is undertaken. If
randomization cannot be incorporated into the sampling procedure in some way, the only
alternative is to try to evaluate how the sample might not be representative. This is seldom a
satisfying exercise. Making statements like “the results are conservative because only the worst
cases were sampled” are usually conjectural, qualitative, and unconvincing to anyone who
understands statistics.

Linearity
The linearity assumption requires that the statistical model
of the dependent variable being analyzed can be expressed
by a linear mathematical equations consisting of sums of
arithmetic coefficients times the independent variables.
The effects of nonlinear relationships are usually
substantial. Applying a linear model to a nonlinear pattern
of data will result in misleading statistics and a poor fit of
the model to the data. Evaluating the linearity assumption
Linearity. It’s not so tough.
is usually straightforward. Start by plotting the dependent variable versus the independent
variables, calculate correlations, and go from there.
This assumption is seldom a problem for three reasons. First, in practice, most models of
dependent variables can be expressed as linear mathematical equations consisting of arithmetic
sums of coefficients times the independent variables. Second, the assumption will still be met
when one or more of the independent variables have a nonlinear relationship with the dependent
variable if a mathematical transformation can be found to make the relationship linear. The only
catch is that the coefficients (termed the parameters of the model) must still be linear. These
models are termed intrinsically linear. In contrast, intrinsically nonlinear models have
coefficients that are nonlinear. Third, if a transformation cannot be found to correct a nonlinear
relationship, you can still resort to using statistical methods for intrinsically nonlinear models.
Nonlinear modeling uses different terminology and optimization processes than linear regression
and usually requires specialized software.

Independence
The third assumption common to statistical models is that
the errors in the model are independent of each other.
Some introductory statistics textbooks describe this
assumption in term of the measurements on the
dependent variable. There are two reasons for this. First,
it’s a lot easier for beginning students to understand,
Am I independent? I’m a CAT! especially if they aren’t familiar with the mathematical
form of statistical models and the concept of model
errors. Second, and more importantly, the two approaches to describing the independence
assumption are equivalent. This is because a data value can be expressed as the sum of an
inherent “true” value and some random error. If you have controlled all those sources of
extraneous variation, the data and the model errors should be identically distributed.
Say you were conducting a study that involved measuring the temperature of human subjects.
Without your knowledge, a well-meaning assistant provides beverages in the waiting room –
piping hot coffee and iced tea. When you plot a histogram of the temperature data, you might see
three peaks (called modes), one centered at 98.6°F, another at a degree or so higher and a third a
degree or so lower. Your data has violated the independence assumption. The subjects who drank
the coffee all had their temperatures linked to the higher temperature of the coffee. The subjects
who drank the iced tea all had their temperatures linked to the lower temperature of the iced tea.
What are the chances you might notice this dependency? If you had a dozen or so subjects, the
chances wouldn’t be good. With 100 subjects, you might notice something. With 1,000 subjects,
you would almost certainly notice the effect, though if you’re providing beverages to 1,000
subjects, you might consider getting out of research and opening a coffee shop.
Assessing independence involves looking for serial correlations, autocorrelations, and spatial
correlations. A serial correlation is the correlation between data points with the previously listed
data points. For example, making measurements with an instrument that is drifting out of
calibration will introduce a serial correlation. Spatial or temporal dependence are often present in
environmental data. For example, two soil samples located very close together are more likely to
have similar attributes than two samples located very far apart. Likewise, two well water samples
collected a day apart are more likely to have similar attributes than two samples collected two
years apart.
Most statistical software will allow you to conduct the Durban-Watson test for serial correlation
as part of a regression analysis. For temporally related data, correlograms are used to assess
autocorrelations and partial autocorrelations. Spatial independence can be evaluated using
variograms, plots of the spatial variance versus the distances between samples. Correlograms and
variograms require specialized software to produce and some experience to interpret.
When the independence assumption is violated, the calculated probability that a population and a
fixed value (or two populations) are different will be underestimated if the correlation is
negative, or overestimated if the correlation is positive. The magnitude of the effect is related to
the degree of the correlation.
Some people confuse the independence assumption, which refers to model errors or
measurements of the dependent variable, with the assumption that the independent variables
(AKA, predictor variables) are not correlated. Correlations between predictor variables, termed
multicollinearity, are also problematical for many types of statistical models because statistics
associated with such models can be misleading.

Normality
The Normality assumption requires that model errors (or
the dependent variable) mimic the form of a Normal
distribution. This assumption is important because the
Normal model is used as the basis for calculating
probabilities related to the statistical model. If the model
errors don’t at least approximate a Normal distribution,
Normality is such a matter of opinion. the calculated probabilities will be misleading. It would
be like trying to put a square peg into a round hole.
There are many methods for evaluating the Normality of a distribution, which fall into one of three
categories:
 Descriptive Statistics – Including the coefficient of variation (the standard deviation
divided by the mean), the skewness (a measure of distribution symmetry), and the kurtosis (a
measure of relative frequencies in the center versus the tails of the distribution). If the
coefficient of variation is less than about one, and the skewness and the kurtosis are close to
zero, it’s reasonable to assume the errors approximate a Normal distribution
 Statistical Graphics – Statistical graphics are more revealing than descriptive statistics
because they indicate visually what data deviate from the Normal model. Interpreting these
graphics can be somewhat subjective, however. The most commonly used statistical
graphics are histograms, box plots, and probability plots. Other statistical graphics
sometimes used to evaluate Normality include stem-and-leaf diagrams, dot plots, and Q-Q
plots.
 Statistical Tests – Statistical tests are more rigorous than either descriptive statistics or
statistical graphics. Commonly used tests of normality include the Shapiro-Wilk test, the
Chi-squared test, and the Kolmogorov-Smirnov test. One of the problems with statistical
tests of Normality is that they become more and more sensitive as the sample size gets large.
So, a statistical test may indicate a significant departure from Normality that is so minor it is
unimportant. Thus, tests of normality may be definitive but irrelevant.
So how should you evaluate Normality? Focus on one method or decide on the basis of a
preponderance of the evidence? First, you have to understand that statistical tests, statistical
graphics, and descriptive statistics are like advisors. They all have an opinion, none is always
correct, and they sometimes provide conflicting advice.
One approach to evaluating Normality is to first look at a histogram to get a general impression of
whether the data distribution is even close to a Normal distribution. If it is, look at a test of
Normality, preferably a Shapiro-Wilk test. This test assumes Normality so if there’s no significant
difference, you can conclude that the data came from a Normally distributed population. If there is a
significant difference, then your decision becomes problematical. Look at a probability plot to
determine where the departures from Normality are. You might have a problem if the deviations are
in the tails because that’s where the test probabilities are calculated. If there is an appreciable
deviation from Normality in the tails of the distribution of errors, consider transforming the
dependent variable or using a nonparametric procedure.

Equal Variances
The last assumption, termed homoscedasticity,
means that the errors in a statistical model have
the same variance for all values of the
dependent variable. For models involving
We have equal variances. grouping variables, the assumption means that
all groups have about the same variance. For models involving continuous-scale variables,
homoscedasticity means that the variances of the errors don’t change across the entire scale of
measurement.
For example, in the case of a measurement instrument, homoscedasticity requires that the error
variance be about the same for measurements at the low, middle, and high portions of the
instrument’s range, which can be a difficult requirement to meet. Another example would be
measurements made over many years. Improvements in measurement technologies could cause
more recent measurements to be less variable (i.e., more precise) than historical measurements.
Assessing homoscedasticity is more straightforward for discrete-scale variables than for
continuous-scale variables because there are usually more than a few data points at each scale
level. A simple qualitative approach is to calculate the variances for each group and look at the
ratios of the sample sizes and the variances. There are also more sophisticated ways to evaluate
homoscedasticity, such as Levene’s test.
Violations of the homoscedasticity assumption tend to affect statistical models more than do
violations of the Normality assumption. Generally, the effects of violating the homogeneity-of-
variances assumption will be small if the largest ratio of variances is near one and the sample
sizes are about the same for all values of the independent variables. As differences in both the
variances and the numbers of samples become large, the effects can be great. Violations of
homoscedasticity can often be corrected using transformations. In fact, transformations that
correct deviations from Normality will often also correct heteroscedasticity. Non-parametric
statistics also have been used to address violations of this assumption.
So, you don’t have to assume the worst might happen in violating a statistical assumption. The
effects may be minor or there may be an alternative approach you can use. You just have to
know what to look for.

Join the Stats with Cats group on Facebook.

http://statswithcats.wordpress.com/2010/10/03/assuming-the-worst/

Potrebbero piacerti anche