Sei sulla pagina 1di 12

Choosing the Right Statistical Test

Ray Woodcock
March 11, 2013


I commenced the following manuscript in summer 2012, in an attempt to work through and
articulate basic matters in statistics. I hoped that this would clarify matters for me and for others
who struggle with introductory statistics. I did not know whether I would have time to complete
the manuscript, and in fact I have not had time for that, and am not certain when I will. I am
therefore providing it in this imperfect and incomplete form. The manuscript begins here.

Preliminary Clarification

It seemed that some preliminary remarks might be advisable. First, regarding the entire
enterprise of statistics, I had repeatedly encountered warnings that statistics is as much an art as a
science, and that even experienced statisticians could make decisions that would prove
unwarranted. Among beginners, of course, mistakes would be relatively obvious; there were
many ways to do things visibly wrong. Statistics could be a fairly black-and-white affair at the
basic level. But in some situations, it seemed, one might not need to go far off the beaten track
to encounter disagreements and misunderstandings even among relatively well-trained users of
statistics.

As a second preliminary observation, it seemed appropriate to observe that what was called
statistics in introductory and intermediate courses in statistics tended to represent only a subset
of the larger universe of statistical activity. Statistics books commonly referred to a distinction
between theoretical and applied statistics, though this distinction did not appear to be rigidly and
universally adopted. For example, the University of California Davis offered divergent
undergraduate foci in general statistics, applied statistics, and computational statistics, while
Bernstein and Bernstein (1999, p. iii) characterized general statistics (with no mention of applied
statistics) as

an interpretative discipline . . . in which the presentation is greatly simplified and often
nonmathematical. . . . [from which] each specialized field (e.g., agriculture, anthropology,
biology, economics, engineering, psychology, sociology) takes material that is
appropriate for its own numerical data.

It appeared reasonable, for present purposes, to treat theoretical statistics as a predominantly
philosophical and mathematical enterprise; to treat applied statistics as the adaptation of
theoretical statistics to various research situations, covering the real-world half (or so) of
statistical research (as in e.g., the Annals of Applied Statistics); and to treat general statistics as a
catch-all reference to the broad variety of statistical work, typically but not necessarily having an
applied orientation (as at e.g., the Washington University Center for Applied Statistics). My
experience with statistics was largely in the applied rather than theoretical area, and this was the
zone in which I had encountered the variety of statistical tools mentioned at the outset.

2
Third, within applied statistics courses and texts, there seemed to be a widely accepted
distinction between descriptive and inferential statistics. It appeared, however, that the actual
difference might vary somewhat from its depiction. Descriptive statistics seemed to be construed
as a catch-all category of noninferential tools, including both the numerical (e.g., standard
deviations) and the non-numerical (e.g., graphs) (e.g., Wikipedia; Gravetter and Wallnau, 2000,
p. 1). This approach seemed to raise several problems:

In ordinary usage, this approach to descriptive statistics had the potential to privilege
technical devices (e.g., numbers, graphs) beyond their actual significance. A table, the
basis for a potential graph, was generally just a convenient way of stating numbers that
could instead have been recited in a paragraph. In other words, if the concept of
descriptive statistics included graphs, it would also have to include text.
Descriptive statistics, construed as just the textual or other description of statistical data,
could not really be segregated into a few chapters at the start of a textbook. It might
make sense to have one or more chapters on standard deviations, graphs, and so forth.
But whereas a chapter on ANOVA might say all that an author would intend to say on
that subject, a chapter on graphs or standard deviations would be unlikely to capture the
role played by graphs and standard deviations. To the contrary, graphs and standard
deviations and descriptive text invariably pervaded statistical articles and books. One
could have a statistics work without any inferential statistics; but one could not have a
statistics work without descriptive content. Indeed, one could not even have inferential
statistics without descriptive content: the understanding and use of inferential statistics
were heavily dependent upon standard deviations, graphs, and descriptive text, especially
but not only in applied contexts.
Inferential statistics could be mischaracterized as something more than a description of
data, when in fact it was something less than a description of data. Description was
necessarily based upon observation, whereas inference tended to be a derivative matter of
speculation based upon description. Such speculation could guide the interpretation of
observation suggesting, for example, observations that might deserve more or less
attention but speculation could not replace or do without the underlying descriptive
substance.

In short, it seemed that the applied statistician working with inferential material would be
especially concerned with identifying frontiers beyond which inference, ultimately expressed and
interpreted in largely textual terms, would be relatively unsupported mathematically.
Metaphorically, my interest in statistical tools was particularly oriented toward the beach of
application, on an island of inferential statistics, within a descriptive, noninferential ocean that
would sustain and also limit inference.

A fourth preliminary remark had to do with the relationship between inference and probability.
The foregoing discussion seemed to suggest that Wikibooks (2011, p. 3) was overly simplistic in
its assertion that Statistics, in short, is the study of data. Even cockroaches study data.
Statistics could not be purely mathematical, but it did have undeniable roots in mathematics.
Both inference and probability were common topics in statistics; both were mathematical; the
question at hand was, how were these mathematical topics related? Wolfram MathWorld
seemed mistaken in suggesting that The analysis of events governed by probability is called
3
statistics because again, as hinted above statistics included a great deal of nonprobabilistic
content. A better version, built from the preceding paragraph, was that applied inferential
statistics was primarily concerned with determining the degree of probabilistic support for
inferences from available information. This seemed akin to the suggestions, by Kuzma and
Bohnenblust (2004, pp. 3, 71), that Inferential statistics is concerned with reaching conclusions
from incomplete information that is, generalizing from the specific but that Probability is the
ratio of the number of ways the specified event can occur to the total number of equally likely
events that can occur. In other words, it seemed that probability could be treated as the highly
abstract, mathematical engine used in a larger and potentially messier inferential statistics
project.

Given these observations and conclusions, it appeared that this posts goal of describing the
selection of a particular statistical tool was, more precisely, a goal of identifying the optimal
configuration of the probability engine for purposes of a particular task. Within the grand
ambiance of numbers, language, and graphics, there was a narrow question of whether
probability would support a certain inference from specified data. The study of probability had
developed better and worse ways to answer that question. At this relatively basic level, there
was apparently some consensus among statisticians as to which ways were better. The task of
this post, then, was to describe the indicators that would guide a statisticians choice of
probabilistic tools for purposes of statistical inference.

The Importance of Data Type

Numerous sources seemed to agree that the type of data would significantly influence the choice
of statistical test at an early stage in the decision process. Stevens (1946, p. 678) posited four
levels of data measurement. Those four, presented in the reverse of their usual order, were
commonly understood as follows:

Ratio. Data measured on a ratio scale were what one might consider ordinary numbers:
they could be added, subtracted, multiplied, and divided meaningfully. The defining
characteristic of ratio data was the existence of a non-arbitrary zero value. For instance,
one could count down, to zero, the number of seconds until an event. The existence of a
real zero also meant that values can be reported as fractions or multiples of one another
(i.e., in terms of ratios). For example, three pounds of a substance was half as much as
six pounds; two inches on a ruler was half as much as four inches. All statistical
measures could be used with ratio data. Thus, researchers generally considered ratio data
ideal.
Interval. Data measured on an interval scale lacked a true zero; zero was just one more
interval. For example, the Fahrenheit and Celsius (as distinct from Kelvin) temperature
scales used arbitrary zero values that did not indicate an actual absence of the thing being
measured. Therefore, multiplication and division were not directly meaningful; 33
degrees (F or C) was not half as warm as 66 degrees. Without division, some statistical
tools would be unavailable for interval data. As with ratio-level data, the existence of
uniform increments did provide consistency (for example, April 2 and 5 were as far apart
as November 8 and 11) and the possibility of subdivision into a continuous series of ever-
smaller subdivisions (e.g., 1.113 degrees would be more than 1.112 degrees). The barrier
4
to meaningful division of interval values would not prevent the calculation of ratios of
differences between values.
Ordinal. Data measured on an ordinal scale were not really mathematical. Numbers
were used just to show the rank order of entries, not the number of increments or the
difference between some value and true zero. For instance, a racer who came in fourth
was not necessarily half as fast as someone who came in second. A letter grading system
(A, B, C, D, F) was ordinal; so were IQ scores and Likert scale data (e.g., 1 = strongly
disagree, 5 = strongly agree). In such examples, the differences between the first and
second items on the scale were not necessarily equal to the differences between, say, the
fourth and fifth items. Values were discrete rather than continuous; for instance, nobody
in a race would finish in position 2.7. These limits would further restrict the kinds of
statistical tools that could be used on ordinal data.
Nominal. Data on a nominal scale could not be measured, in terms of the quality being
studied. They were simply sorted into categories (and could thus be counted). The
categories had no intrinsic arrangement. For example, unlike coming in first rather than
second in a race, being an apple was not, in itself, better than being an orange (though of
course there might be measurable differences in the chemical properties of pieces of fruit
extracted from apples and oranges). Unlike IQ scores or letter grades, the concept of
house did not automatically establish a rank order among ranch, Cape Cod, and
colonial types of houses; sorting would have to be arbitrary (by e.g., age or alphabetical
order). Few statistical tools were available for nominal data.

While the distinctions among these four data types would be important eventually, some decision
trees made a first distinction between the qualitative (also called categorical) data found in
ordinal and nominal scales and the quantitative (also called numerical) data found in ratio and
interval scales. This post thus proceeds with an introductory look at qualitative data analysis.

Nonparametric Analysis: Statistical Tools for Qualitative Data

The study of statistical methods seemed to make a fundamental distinction between parametric
and nonparametric analysis of data. Parameters were mathematical characteristics of a
population of, that is, a potentially large class of entities (e.g., cigarette smokers, beans in a
bag, cars in New York City). Means and standard deviations seemed to be the two parameters
most commonly mentioned in inferential statistics. The mean was the preferred measure of
central tendency of where the bulk of the data tended to be located. The standard deviation
was the preferred measure of dispersion of how much the individual values varied from the
mean. Together, the mean and standard deviation would give an approximate, concise sense of
how individual values tended to be distributed. Since means and standard deviations could not
be calculated for ordinal or nominal data, parametric methods were limited to ratio and interval
data.

Parametric methods would require certain assumptions about the population being studied. Such
assumptions could yield powerful and accurate estimates of population parameters, if justified,
but considerable error if unjustified. By contrast, nonparametric methods would depend upon
fewer assumptions, but would not support comparably powerful inferences, and therefore tended
to be treated as less desirable alternatives to parametric methods. But when testing samples
5
drawn from a population, all parametric and nonparametric tests depended on the crucial
assumption that the data contained in those samples were obtained through random selection
from the relevant population (Pett, 1997, p. 9).

It could seem practical for a statistics textbook to begin with a discussion of nonparametric
methods. Such methods could be more forgiving and more generally applicable across a variety
of quantitative as well as qualitative contexts. In addition, given the relatively slight attention
customarily provided to nonparametric methods, it could be appropriate to get this topic out of
the way, so as to focus on parametric methods.

On the other hand, there were also good reasons to begin with parametric methods. One such
reason was that there were numerous nonparametric methods. For example, in a book that did
not purport to be comprehensive, Pett (1997) discussed approximately 20 such methods. It
appeared that parametric methods were to be considered the standard, whereas one would go
searching among the nonparametric options primarily to meet an atypical need. An attempt to
begin by mastering the nonparametric methods could thus distract or confuse someone who was
only starting to learn about statistical methods.

In short, it appeared that the distinction of parametric versus nonparametric statistics at this point
served primarily to draw attention to the existence of such a distinction, and to encourage efforts
to obtain data amenable to parametric analysis when possible. This discussion therefore turns to
parametric analysis, leaving nonparametric methods as the best approach when dealing with
ordinal or nominal data, or when parametric assumptions were violated and could not be
counteracted.

Introduction to Parametric Analysis

This post begins with the observation that statisticians used many tools or methods to analyze
data. This observation arose from exposure to introductory statistics textbooks. Such textbooks
commonly offered chapters on descriptive statistics and other matters that might aid in
orientation to, or application of, statistical reasoning. But the central emphasis of such textbooks
typically seemed to be upon statistical inference upon, that is, the process of making inferences
about populations (especially about their means) based on information derived from samples
drawn from those populations. The analysis of samples was a matter of convenience and
feasibility: studying a sample could be much easier and less expensive than obtaining complete
raw data from an entire population; and if it was a good sample, it could be expected to yield
results representative of the population as a whole.

Customarily, the process of inference would be focused on the testing of a research hypothesis.
The researcher would hypothesize that one sample differed from another. This hypothesis would
be an alternative to the null hypothesis, which would state that there was, in fact, no
statistically significant difference between the two. The inferential tool selected for the task
would be the one that could best inform the question of whether there was sufficient evidence to
reject the null hypothesis. In such comparisons, the independent variable would be the thing
being put into comparison groups or otherwise manipulated (e.g., smoker or nonsmoker) to
produce effects upon the dependent variable (e.g., cancer rates).
6

The question of statistically significant difference was not necessarily the same as a
scientifically, practically, or clinically significant difference. The means of two samples could
be different enough from one another that one could infer, with a fair degree of confidence, that
they came from statistically divergent populations that, for example, people who received a
certain treatment tended to display a relevant difference from people who did not receive that
treatment. Yet this mathematical difference might not get past the concern that the difference
was so minor (given e.g., the cost or difficulty of the treatment) as to be insignificant in practical
terms. One could perhaps devise separate statistical studies to examine such practical issues, but
practicality was not itself a part of statistical significance.

Key Parametric Assumptions

The process of using a parametric tool could begin with preliminary tests of assumptions. If a
dataset was not compatible with a given tools assumptions, that tool might not provide accurate
analysis of the data. Not every tool required satisfaction of every assumption; that is, parametric
tools varied in robustness (i.e., in their ability to tolerate variation from the ideal).

A source at Northwestern University and others suggested that, in addition to the assumptions of
interval or ratio scale data and random sampling (above), the list of parametric assumptions
included independence, an absence of outliers, normality, and equal population or sample
variances (also known as homoscedasticity). These assumptions could be summarized briefly:

Independence meant that the data points should be independent of other data points, both
within a data set and between data sets. There were exceptions, such as where a tool was
designed specifically for use with datasets that were paired (as in e.g., a comparison of
pre- and post-intervention measurements) or otherwise matched to or dependent on one
another.

Outliers were anomalous data points significant departures from the bulk of the data.
Renze (n.d.) and others suggested a rule of thumb by which a data point was deemed an
outlier if it fell more than 1.5 times the interquartile range above or below that range.
Seaman and Allen (2010) warned, however, that one must distinguish outliers from long
tails. Unlike a single outlier (or a small number of odd variations among a much larger
set of relatively consistent data points), a series of extreme values was likely to be a
legitimate part of the data and, as such, should not be removed. Outliers resulting from
errors in data collection seemed to be the best candidates for removal via e.g., a trimmed
mean. For example, a 10% trimmed mean would be calculated on the basis of all data
points except those in the top 5% and the bottom 5% of the data points. Winsorizing
could be used to replace those upper and lower values with repetitions of the next-lowest
and next-highest values remaining after such trimming. (See Erceg-Hurn & Morsevich,
2008, pp. 595-596.)

A normal (also known as Gaussian) distribution was symmetrical, with data evenly
distributed above and below the mean, and with most data points being found near the
mean and progressively fewer as one moved further away. Wikipedia and others
7
suggested that one could test normality, first, with graphical methods, notably a
histogram, possibly a stem-and-leaf plot, and/or a Q-Q Plot. The first two would be used
to see whether data appeared to be arranged along a normal curve. There were two
options with the Q-Q Plot: one could use it to compare a sample against a truly normal
distribution or (as noted by a source at the National Institute of Standards and
Technology) against another sample. Either way, significant departures from a straight
line would tend to indicate that the two did not come from the same population. Another
test for normality used the 68-95-99.7 Rule that is, the rule that, in a normal
distribution, 99.7% of data points (i.e., 299 out of every 300) fell within three standard
deviations (s) of the mean. So if a value in a small sample fell more than 3s from the
mean, it probably indicated a nonnormal distribution. It was also possible to use the
SKEW and KURT functions in Microsoft Excel to get a sense of normality, for up to 30
values. (Excels NORMIDIST function would return a normal distribution.) Negative
values from those functions indicated left skew and flat (platykurtic) distribution; positive
values indicated right skew and peaked (leptokurtic) distribution; zero indicated
normality. Finally, there were more advanced tests for normality. These tests were built
into statistical software; some could also be added to Excel. Among these tests, Park
(2008, p. 8) and others indicated that the Shapiro-Wilk and Kolmogorov-Smirnov tests
were especially often used, with the former usually being preferred in samples of less
than 1,000 to 2,000. In addition to professional (sometimes expensive) Excel add-ons,
Mohammed Ovais offered a free Excel spreadsheet to calculate Shapiro-Wilk, and Scott
Guth offered one that reportedly used Kolmogorov-Smirnov. On the other hand, Erceg-
Hurn and Morsevich (2008, p. 594) cite prominent statisticians for the view that such
tests (naming Komogorov-Smirnov as well as Levenes (below)) are fatally flawed and
should never be used.

Homoscedasticity (Greek for having the same scatter), also known as homogeneity of
variance, meant that the populations being compared had similar standard deviations.
McDonald (2009) appeared to indicate that a lack of such homogeneity (i.e.,
heteroscedasticity) would tend to make the two populations appear different when, in
fact, they were not. Beyond the intuitive eyeballing of standard deviations and dispersion
of values, graphs (e.g., boxplots) could aid in detection of heteroscedasticity. De Muth
(2006, p. 174) alleged a rule of thumb by which one could assume homogeneity if the
largest variance, among samples being compared, was less than twice the size of the
smallest. (In an alternate version of that rule, Howell (1997, p. 321) proposed 4x rather
than 2x.) Formal tests of homoscedasticity included the F test, Levenes test, and
Bartletts test. Subject to the warning of Erceg-Hurn and Morsevich (2008) (above), the
F test appeared especially common and was available in Excels FTEST function.
According to that Northwestern site and others, however, the F test required a high
degree of normality. McDonald indicated that Bartletts was more powerful than
Levenes if the data were approximately normal, and that Levenes was less sensitive to
nonnormality. McDonald offered a spreadsheet that performed Bartletts test; Winner
offered one for Levenes.

Violation of these assumptions would apparently leave several options: the researcher could
ignore the violation and proceed with a parametric test, if the test was robust with respect to the
8
violation in question; the researcher could choose a nonparametric test that did not make the
relevant parametric assumption(s); or the researcher could transform the data to render it
amenable to parametric analysis. Here, again, Erceg-Hurn and Morsevich (2008, pp. 594-595)
criticized transformations on numerous grounds (e.g., the transformation may not solve the
problem; data interpretation is difficult because subsequent calculations are based on
transformed rather than original data, accord Northwestern) and also suggested that classic
nonparametric methods have been superseded. The following discussion begins with the first of
those options: a closer inspection of parametric tests.

Parametric Analysis Decision Trees

As noted above, the type of data being collected (above) would significantly affect the choice of
statistical test. Wadsworth (2005) provided decision trees for nominal, ordinal, and scale data,
defining the last as including dependent variables measured on approximately interval, interval,
or ratio scales. A source in the psychology department at Emory University defined
approximately (also sometimes called quasi-) interval scales as

scales that are created from a series of likert rating [sic] (ordinal scale of measurement)
by either adding up the individual responses or calculating an average rating across items.
Technically, these scales are measured at the ordinal scale of measurement but many
researchers treat them as interval scales.

Wadsworth identified eight parametric options for scale data: four for cases involving only one
sample, two for cases involving two samples, and two for cases involving more than two
samples. Wadsworth also identified nine nonparametric options for nominal and ordinal data.
Other sources (e.g., Horn, n.d.; Garner, n.d.; SFSU, n.d.; Gardener, n.d.) offered alternate
decision trees or lists. Generally, it seemed that these criteria (i.e., type of data, number of
samples) would be among the first questions asked (along with whether the study was one of
correlation or difference among samples). It also appeared that various decision trees as well
as the tables of contents of various statistics texts (e.g., Gravetter & Wallnau, 2000; Kuzma &
Bohnenblust, 2004) tended to point toward substantially the same list of primary parametric
tests. Yet this raised a different question. Ng (2008) and Erceg-Hurn and Morsevich (2008)
criticized the use of the old, classic tests appearing in such lists. In the words of Wilcox (2002),

To put it simply, all of the hypothesis testing methods taught in a typical introductory
statistics course, and routinely used by applied researchers, are obsolete; there are no
exceptions. Hundreds of journal articles and several books point this out, and no
published paper has given a counter argument as to why we should continue to be
satisfied with standard statistical techniques. These standard methods include Student's T
for means, Student's T for making inferences about Pearson's correlation, and the
ANOVA F, among others.

If that was so, there was a question as to why one would begin with a t test or with any of the
others in the classic lists, for that matter. One answer, suggested by the preceding paragraphs,
was that one might be limited by what one tended to find in classrooms, in online discussions,
and in textbooks (but see texts cited by Erceg-Hurn & Morsevich, 2008, pp. 596-597, e.g.,
9
Wilcox, 2003). A related answer was that the classic methods had been researched and explored
extensively, and had an advantage in that regard for conservatively minded researchers: such
methods might to be obsolete in the sense of being relatively weak or limited, but were
presumably not terribly obsolete in the sense of being wrong. Another answer was that, at least
at the learning stage, it behooved one to learn what everyone else was using, so as to be
employable and basically conversant in ones field and thereafter inertia would take over and
people would tend to stay with that much because, after all, most were not statisticians. That
answer suggested yet another: that the objection that people were using inept old methods was
akin to the objection that, in many instances, people were using those methods poorly.

It did seem that consumers as distinct from producers of research (e.g., the people taking
introductory statistics courses, as distinct from some of their instructors) would need to respect
the old methods for the foreseeable future. On this impression, it seemed appropriate, here, to
follow the common approach of starting with z and t scores.

z Scores

The z score for a datum (x) was calculated with a fairly simple formula: z = (x ) / . The idea
was to see how many times the standard deviation () would go into the difference between the
data point and the mean. That is, how far away from the mean was this data point, as measured
in terms of standard deviations?

The point of the z score was simply to convert raw data into standardized data. In a z
standardization, raw values were converted to z scores, where the mean was always zero and the
standard deviation was always 1.0. This had practical value. Ordinarily, for instance, it would
not be clear how one should interpret a score of 77 on an exam, or a fruit non-spoilage rate of
63%. But the information that such scores were two standard deviations below the mean would
indicate that these values were fairly unusual.

Even then, however, the situation would not be as clear as one might like. The mean score (x )
on that exam could have been an 87, with a population standard deviation () of 5, or it could
have been 79, with a standard deviation of 1. In the latter case, a student with a 77 would
probably be pretty much with the group; not so in the former. Yet even a 77 (x = 87, = 5)
could have a shred of respect in a skewed distribution if, say, some scored as low as 30 or 40.

Those sorts of uncertainties would be resolved in the case of a normal (i.e., 68-95-99.7 (see
above)) distribution. Normal distributions were the place where z scores worked. If one knew
that grades on the exam were normally distributed, it would be possible to calculate the precise
percentage of grades falling more than two standard deviations below the mean (to calculate, that
is, the percentile of a 77) without access to the list of grades. A z table or calculation could be
found at the back of a statistics text and in some calculators and computer programs. As
distributions became less normal, however, z scores would provide less accurate estimates
(Ryan, 2007, p. 93).

According to the z table, in a normal distribution, a raw score precisely two standard deviations
below the mean was 47.72% away from the mean. That is, about 95.4% of all raw scores would
10
be within two standard deviations above or below the mean: about 2.3% of scores would be
more than two standard deviations above the mean, and another 2.3% would be more than two
standard deviations below the mean. So if the grades on this exam were normally distributed, a
77 would be (rounding upwards) at the 3rd percentile: 2.3% of scores would be lower than 77.

This ability to calculate percentages without knowing specific scores faded as a distribution
became less than perfectly normal. But it was still possible to calculate percentiles, at least, with
a nonnormal distribution. Doing so required ranking the scores and finding the desired point, as
in the calculation of a median (i.e., the 50th percentile). For example, the 20th percentile
occurred at the value where 20% of all scores were lower.

The example just given involved a data point two standard deviations from the mean. What if
the score had been 78 instead of 77? This would require calculation of the z score, using the
formula noted above (i.e., z = (x )/). In a normal distribution, a raw score of 77, two standard
deviations below the mean, would have a z score of 2.0. A raw score of 78 would be one and
four-fifths of a standard deviation below the mean; thus the calculation would give z = 1.8. In a z
table, a z value of 1.8 indicated that 46.41% of all scores fell between z (inclusive) and the mean.
Of course, another 50% of the scores lay on the other side of the mean, above 87. So 96.41% of
scores would be at or above 78, leaving 3.59% below 78. So a 78 would be at about the 4th
percentile. In short, by converting raw scores into standardized scores, z scores made it possible
to compare the relative standing of raw scores in dissimilar contexts (comparing e.g., a score of
77 out of 100 on one test against a score of 35 out of 45 on another test) as long as the
distribution was normal.

The Central Limit Theorem
and the Distribution of Sample Means

Researchers did not typically seek and derive conclusions from a single raw score. They were
more likely to collect a sample of scores and calculate a mean. In this way especially if they
selected a random sample their work would tend to reflect different experiences or
observations. Their conclusions were thus less at risk of veering off into a special case of no
particular importance.

Just as a single raw value could be compared against the population containing all raw values, a
single mean value could be compared against the population of mean values. More precisely,
just as the raw value could be compared against the mean of raw values, a mean could be
compared against the mean of means. The latter comparison would be made specifically against
the mean of means of all possible samples of the same size. From a population of 1,000 raw
values, for instance, one could draw a very large but not infinitely large number of samples
containing 30 raw values each, calculate the mean of each such sample, and compare the mean of
just one of those samples against the mean of the means of all samples of that size.

Moreover, just as raw values might be distributed normally, so also the distribution of those
sample means (also known as the sampling distribution of the mean) might be normal. In fact,
the distribution of sample means became increasingly normal as the sample size and/or the
number of samples increased. That was true especially in samples of moderate to large size (n
11
30 or less, in the case of a normal distribution to n 500 or more, for a very nonnormal
population distribution; e.g., Chang, Huang, & Wu, 2006). In other words, the distribution of
sample means could be normal even if the underlying raw data distribution was not. And in fact,
the distribution of sample means would approach normality as one increased the number of
samples and/or the size of the samples.

That principle of increasing normality was known as the Central Limit Theorem, first published
by de Moivre in 1733. The Central Limit Theorem yielded some related insights. One was that,
not surprisingly, the mean of the distribution of all possible sample means of a given size (

)
was equal to the mean of the underlying raw values in a population (i.e., =

). The other, less


obvious, was that the standard deviation of the distribution of sample means (called the standard
error of the mean) (

= n) was much less than the standard deviation of the underlying


distribution of raw values. That is, a sample mean would tend to be closer to the population
mean than a raw value would be. That was because samples would tend to be moderated by raw
values from both above and below the mean: an extreme sample mean was much less likely than
an extreme single data point. This moderating tendency increased with sample sizes until, of
course, a sample as large as the whole population would have a mean equal to that of the
population. So a single large sample might provide a fair point estimate of an unknown
population mean.

Graphically, the distribution of sample means might look leptokurtic rather than normal, when
presented on the same scale as the raw data of the population. The fact that it was normal meant
that, regardless of the distribution of raw values in the population, the researcher could use the z
table and the standard error (

), relying on the normality of this distribution of sample means, to


calculate the distance of his/her sample mean from a known population mean, or to estimate its
distance from an unknown population mean.

The t Distribution

Calculations of z assumed that the populations standard deviation () was known. This
assumption was usually unrealistic (Bernstein, 1999, p. 207). Instead, it was commonly
necessary to estimate by substituting the samples standard deviation (s). One would thus use s
directly in the z table for raw scores, and would calculate s

(instead of

) = s (instead of ) n
when using the z table to work with the distribution of sample means.

That substitution would alter the equation, since s and were not calculated in exactly the same
way. The difference was as follows:

o =
_
(X-)
2
N
but s =
_
(X-X

)
2
n-1


Because of the difference in denominators (i.e., number of items in the population (N) versus
items in the sample (n) minus 1), s would not have the same value as even if the sample mean
(X

) and population mean () were the same. Hence, the normal distribution, and the values that
would work with , in the z table, would not work with s. Substituting s for would thus require
12
an alternative to the z distribution. This alternative, the t distribution, was published by Gossett
(using the pen name of Student) and refined by Fisher (Eisenhart, 1979, p. 6). Like the normal
distribution, the t distribution (more precisely, the family of t distributions) was essentially a
formula whose results, if graphed, would produce a bell-shaped curve. For relatively large
samples (n 120) the t distribution was virtually equivalent to the normal distribution (Hinkle,
1994, p. 186). The two were closely similar even at n = 30 (Kuzma & Bohnenblust, 2004, p.
114). But as sample size shrank, the t distribution became more dispersed, reflecting less
confidence in the sample as an indicator of the population from which it was drawn until, at
small values such as n = 3, the graph of the t distribution was a nearly flat line.

In the t distribution, the calculation to determine the distance of the sample mean from the
population mean was much like the calculation of the z score (above): t = (x ) / s

. As
indicated above, s

= s n. So t = (x ) / (s n). Just as z scores expressed the distance of


an individual raw value from the mean of raw values in terms of standard deviations, t scores
expressed the distance of an individual sample mean from the mean of all sample means in terms
of standard errors of the mean (s

).

t Tests

As one would expect from a parametric statistical tool, t tests were used to learn about
populations. This occurred in several different settings. The one-sample t test would be used to
estimate a population mean or to compare a sample against a known population mean (De Muth,
2006, p. 174). The latter use would ask whether the sample appeared likely to be drawn from
that population. For example, the general population of 40-year-old men might have a certain
mean weight, whereas a sample of 40-year-old men who had received a certain treatment might
weigh three pounds less. The question for the t test would be whether that was a statistically
significant difference, such that the sample no longer seemed to represent the general population
of 40-year-old men, but instead seemed likely to represent a new population of treated 40-year-
old men.

While one-sample scenarios were commonly used for educational purposes, most research
studies require the comparison of two (or more) sets of data (Gravetter & Wallnau, 2000, p.
311). Two-sample t tests would compare the means of two samples against each other, treating
each as representative of a potentially distinct population, to determine whether their differences
from one another were statistically significant. There were two kinds of two-sample t tests.
Paired (also known as dependent or related) samples would involve matching of individual
data points. For example, the same person might be tested before and after a treatment.
Unpaired (also known as independent) samples would have no such relationship; for example,
the 40-year-old men cited in the previous paragraph would be in two separate (control and
treatment) groups, without any one-to-one correspondence or pairing between individual men.
In that example, the (generally nominal or ordinal) independent variable
would be the treatment it would lead the inquiry, providing the basis for grouping and the
thing being studied (e.g., weight) would be the (integer or ratio) dependent variable.

Potrebbero piacerti anche