Sei sulla pagina 1di 9

173

British Journal of Mathematical and Statistical Psychology (2004), 57, 173–181


q 2004 The British Psychological Society
www.bps.org.uk

A note on preliminary tests of equality of


variances

Donald W. Zimmerman*
Carleton University, Canada

Preliminary tests of equality of variances used before a test of location are no longer
widely recommended by statisticians, although they persist in some textbooks and
software packages. The present study extends the findings of previous studies and
provides further reasons for discontinuing the use of preliminary tests. The study found
Type I error rates of a two-stage procedure, consisting of a preliminary Levene test on
samples of different sizes with unequal variances, followed by either a Student pooled-
variances t test or a Welch separate-variances t test. Simulations disclosed that the two-
stage procedure fails to protect the significance level and usually makes the situation
worse. Earlier studies have shown that preliminary tests often adversely affect the size
of the test, and also that the Welch test is superior to the t test when variances are
unequal. The present simulations reveal that changes in Type I error rates are greater
when sample sizes are smaller, when the difference in variances is slight rather than
extreme, and when the significance level is more stringent. Furthermore, the validity of
the Welch test deteriorates if it is used only on those occasions where a preliminary
test indicates it is needed. Optimum protection is assured by using a separate-variances
test unconditionally whenever sample sizes are unequal.

1. Introduction
Widely used significance tests of location, including the two-sample Student t test and
the ANOVA F test, assume homogeneity of variance of treatment groups. It is well
known that failure of this assumption alters Type I error rates, especially when sample
sizes are unequal. When a larger variance is associated with a larger sample size, the
probability of a Type I error declines below the nominal significance level. In contrast,
when a larger variance is associated with a smaller sample size, the probability increases,
sometimes far above the significance level (Hsu, 1938; Overall, Atlas, & Gibson, 1995;
Scheffé, 1959, 1970).

* Correspondence should be addressed to Donald W. Zimmerman, 1978 134A Street, Surrey BC V4A 6B6, Canada
(e-mail: dwzimm@telus.net).
The computer programs in this study were written in PowerBASIC, version 3.5, PowerBASIC, Inc., Carmel, CA. Listings of the
programs can be obtained by writing to the author.
174 Donald W. Zimmerman

This problem can be handled by substituting separate-variances tests, such as the


ones introduced by Welch (1938, 1947), and Satterthwaite (1946), and more recently
by Fligner and Pollicello (1981), for the Student t test. These modified tests, unlike the
two-sample Student t test, do not pool variances in computation of an error term. Also,
they alter the degrees of freedom by a function that depends on sample data. It has
been found that these methods restore Type I error probabilities to the nominal
significance level and also counteract increases or decreases of Type II error
probabilities.
A procedure once popular among researchers involves preliminary tests of equality
of variances—for example, tests devised by Bartlett (1937), Levene (1960), Boos and
Brownie (1989), and Brown and Forsythe (1974). If a preliminary test results in rejection
of the hypothesis of equal variances, one of the modified tests of location mentioned
above or a related separate-variances test is performed. If the preliminary test does not
result in rejection of H0, homogeneity is assumed to be satisfied, and the usual two-
sample Student t test of location is chosen. In recent years, many authors have
discouraged the use of these preliminary tests. Nevertheless, some introductory
textbooks still recommend them, and various software packages, such as SPSS, NCSS,
Minitab, and Statistica, contain the required algorithms. Software packages sometimes
give the impression that preliminary tests are legitimate and without shortcomings.
Many simulation studies have compared a large number of proposed tests of equality
of variances, but doubt about their capabilities as preliminary tests remains. The present
study examined properties of these tests in a somewhat different way—by regarding the
entire procedure, consisting of a preliminary test of variances, followed by selection of
one of the tests of location, as a compound significance test with its own Type I errors,
Type II errors and power. That is, the Type I error rates of the Student t test and Welch t
test were regarded as conditional probabilities, given outcomes of the initial test, and
the unconditional probabilities associated with the entire two-stage procedure were
found by simulations. The same method can be applied to variables having any
distribution.

2. Method
The random number generator used in this study was introduced by Marsaglia, Zaman,
and Tsang (1990). Normal deviates were generated by the rejection method of Marsaglia
and Bray (1964), and each replication obtained two independent samples of n1 and n2
scores. All scores in one sample were multiplied by a constant, so that the ratio s1/s2
had a predetermined value, ranging from 1 to 3 in increments of 0.1, 0.2 or 0.5. The total
sample size n1 þ n2 was fixed at 30 or 60, and the ratio n1/n2 ranged from 0.2 to 5.
There were 50 000 replications of the sampling procedure for each condition in the
study, except for the data in Table 2, where there were 200 000 replications for each
condition. All significance tests were non-directional.
On each replication, a test of equality of variances was first performed at the .01 or
.05 significance level. The test, originated by Levene (1960), consists of a two-sample
Student t test on the squared deviations, (X 2 X̄ )2 and (Y 2 Ȳ )2. If this test did not
reject the hypothesis of equal variances, the usual two-sample Student t test based on
pooled variances was performed at the .01 or .05 level. If the Levene test rejected
the hypothesis of equal variances, the Welch– Satterthwaite version of the t test based
on separate variances (Welch, 1938, 1947; Satterthwaite, 1946) was performed.
Pre-tests of variances 175

The preliminary Levene test and the subsequent test of location were always performed
at the same significance level, except for the data in Table 3, where the significance level
of the Levene test was varied while that of the test of location remained fixed.

3. Results
Table 1 shows Type I error rates for the Student t test and Welch t test performed
unconditionally and for the procedure in which the choice was determined by the

Table 1. Probability of rejecting H0 for various combinations of sample sizes, significance levels, and
ratios of standard deviations. S—Student t test; W—Welch t test; C—choice conditional on
preliminary Levene test; L—power of Levene test.

a ¼ :01 a ¼ :05

n1 n2 s1 =s2 S W C L S W C L

1.0 .010 .011 .010 .010 .048 .051 .048 .040


50 10 1.5 .001 .011 .001 .002 .012 .051 .022 .094
2.0 0 .010 0 .010 .003 .049 .022 .276
2.5 0 .010 0 .026 .001 .050 .029 .468
1.0 .011 .011 .011 .008 .051 .051 .051 .049
40 20 1.5 .003 .009 .004 .079 .026 .049 .037 .369
2.0 .002 .010 .005 .346 .017 .049 .044 .808
2.5 .001 .010 .006 .589 .013 .049 .046 .948
1.0 .010 .010 .010 .009 .051 .052 .052 .048
20 40 1.5 .025 .011 .021 .293 .086 .051 .063 .539
2.0 .035 .010 .016 .755 .111 .049 .051 .917
2.5 .046 .011 .013 .933 .127 .050 .049 .989
1.0 .010 .011 .010 .010 .049 .051 .049 .040
10 50 1.5 .046 .011 .036 .262 .133 .051 .088 .427
2.0 .093 .011 .042 .658 .203 .053 .075 .796
2.5 .134 .012 .029 .873 .253 .052 .058 .940
1.0 .010 .016 .010 .011 .051 .058 .050 .036
25 5 1.5 .001 .013 .001 0 .012 .055 .014 .013
2.0 0 .012 0 0 .004 .053 .007 .026
2.5 0 .010 0 0 .001 .050 .004 .044
1.0 .010 .010 .010 .007 .050 .050 .050 .047
20 10 1.5 .004 .010 .004 .011 .028 .050 .033 .126
2.0 .002 .009 .003 .038 .019 .049 .029 .314
2.5 .002 .009 .003 .078 .015 .050 .031 .481
1.0 .010 .010 .010 .007 .050 .049 .050 .047
10 20 1.5 .023 .011 .021 .110 .087 .051 .073 .294
2.0 .039 .011 .031 .340 .114 .052 .071 .635
2.5 .046 .011 .029 .541 .130 .048 .059 .829
1.0 .010 .016 .010 .011 .051 .057 .050 .034
5 25 1.5 .048 .015 .043 .139 .133 .057 .105 .260
2.0 .093 .015 .068 .368 .205 .054 .115 .532
2.5 .138 .014 .077 .576 .263 .054 .100 .724
176 Donald W. Zimmerman

Levene test. For both significance levels, the dependence on sample sizes and on the
magnitude of the ratio is apparent. The unconditional Student t test is severely inflated
or deflated as the ratio increases, and the unconditional Welch test remains close to the
significance level in all cases. The table also shows the power of the Levene test, that is,
the probability of rejecting the hypothesis of equal variances, as a function of the same
ratio over the range from 1 to 2 in increments of 0.5. When the probability of rejecting
H0 by the Levene test is close to zero, the outcome of the compound test is nearly the
same as that of the Student t test, and when that probability is close to 1, the outcome of
the compound test is nearly the same as that of the Welch t test.
Table 1 also indicates that the outcome is influenced to some extent by the total
sample size. For n1 þ n2 ¼ 30; the inflation or depression of the probability of rejecting
H0 of the compound test is more extreme than for n1 þ n2 ¼ 60; and this is true for all
values of the ratio n1 =n2 : This dependence on sample size presumably is explained by
the fact that, for a fixed a, the power of the Levene test increases when the sample size is
larger (shown in column L).
Figure 1 is a more detailed picture of the Type I error rates as a function of the ratio
over the range from 1 to 2.2 in increments of 0.1. In both parts of the figure, the middle
curve, representing cases in which the test of location is conditional on the outcome of
the Levene test, at first increases or declines along with the Student t test, but then
reverses direction and gradually approaches the significance level. Again, the Welch t
test remains close to the significance level over the entire range. It is notable that the
Type I error rate of the compound procedure is most distorted when the inequality of
standard deviations is slight, rather than extreme.
Similar results are presented in a somewhat different way in Table 2. This table shows
conditional probabilities of Type I errors under the conditions that the Levene test does
or does not reject the hypothesis of equal variances. These conditional probabilities
apparently are influenced primarily by the ratio of population standard deviations,
irrespective of the outcome of the Levene test on samples. That is, if the Student t test is
performed only on those occasions when the Levene test fails to reject the hypothesis of
equal variances, the inflated or deflated Type I error rate does not improve and actually
becomes slightly worse. Furthermore, if the Welch test is performed only on those
occasions when the Levene test rejects, its good properties are spoiled to some degree.
Table 3 shows the dependence of the Type I error rate of the compound test on the
significance level of the Levene test, along with the outcome of the unconditional
Student t test and the unconditional Welch t test. Somewhat paradoxically, as the
significance level becomes more stringent, the outcome of the compound test becomes
closer to that of the inflated Student t test. In other words, the test of location becomes
increasingly ineffective as the significance level of the preliminary test decreases.
A higher Type I error rate of the preliminary test actually improves the performance of
the compound test. These results are incompatible with the purpose for which the
preliminary test was designed in the first place.

4. Discussion
A preliminary test of variances itself is subject to Type I and Type II errors, and on some
occasions it leads to incorrect statistical decisions. Under the compound procedure, if
population variances are not equal, an appropriate test of location, the Welch t test, is
performed with probability equal to the power of the Levene test. An inappropriate test,
Pre-tests of variances 177

Figure 1. Type I error rates of unconditional Student t test, unconditional Welch t test, and choice of
test conditional on outcome of preliminary Levene test, as a function of ratio of population standard
deviations. Top: larger standard deviation associated with larger sample size. Bottom: larger standard
deviation associated with smaller sample size.

the Student t test, is performed on all other occasions. If population variances are equal,
the Student t test is performed with probability 1 2 aL, and the Welch test with
probability aL where aL is the significance level of the Levene test. Therefore, the
empirical probability of rejecting H0 by the compound procedure will exceed the
nominal significance level of the test of location, and sometimes will be nearly as high as
the inflated probability of the Student t test.
These relations are borne out by the data in Table 1 and Fig. 1. The probability of
rejecting H0 by the compound procedure always falls between the probability of
rejecting H0 by the Welch test and the inflated probability of the Student t test. As the
178 Donald W. Zimmerman

Table 2. Conditional probability of Type I error of the Student t test and Welch t test, given that the
Levene test rejects the hypothesis of equality of variances and given that the Levene test does not reject
ða ¼ :05Þ:

Student t test Welch t test

Levene test Levene test does Levene test Levene test does
n1 n2 s1 =s2 rejects H0 not reject H0 rejects H0 not reject H0

1.0 .051 .050 .048 .051


50 10 1.5 .009 .012 .103 .046
2.0 .002 .003 .070 .042
2.5 .001 .001 .057 .043
1.0 .053 .050 .048 .051
10 50 1.5 .119 .141 .015 .078
2.0 .194 .245 .030 .132
2.5 .249 .329 .041 .201
1.0 .048 .050 .024 .057
25 5 1.5 .012 .012 .133 .054
2.0 .003 .004 .101 .051
2.5 .001 .002 .065 .049
1.0 .046 .050 .028 .059
5 25 1.5 .111 .140 .004 .075
2.0 .178 .238 .006 .110
2.5 .235 .329 .012 .162

Table 3. Probability of rejecting H0 by unconditional Student t test, unconditional Welch t test, and
choice of test conditional on preliminary Levene test at various significance levels (aL —significance level
of Levene test; at —significance level of test of location, Student or Welch t test).

Conditional on Levene

aL
n1 n2 s1 =s2 at Student t .01 .02 .05 .10 .20 Welch t

45 15 1.5 .01 .002 .002 .003 .006 .008 .010 .010


.05 .018 .020 .023 .031 .040 .047 .050
2.0 .01 .001 .002 .004 .007 .009 .010 .010
.05 .008 .015 .023 .038 .047 .050 .050
2.5 .01 0 .003 .005 .008 .009 .010 .010
.05 .005 .018 .028 .042 .047 .049 .049
15 45 1.5 .01 .034 .026 .024 .021 .018 .015 .011
.05 .107 .085 .079 .071 .064 .058 .051
2.0 .01 .059 .024 .020 .015 .013 .011 .010
.05 .153 .072 .064 .057 .053 .050 .049
2.5 .01 .080 .016 .013 .011 .010 .010 .010
.05 .184 .057 .053 .050 .049 .049 .050

discrepancy between the two variances increases, the Type I error rate of the compound
procedure at first increases along with that of the Student t test and then decreases,
because the power of the Levene test increases. Furthermore, the discrepancy is greater
when the total sample size is smaller (30 rather than 60). Eventually, as the ratio of
Pre-tests of variances 179

variances becomes more extreme, the power of the compound test is nearly the same as
that of the Welch test. Nothing is gained by the preliminary Levene test.
This conclusion is also supported by the results in Table 3. As the Type I error rate of
the Levene test decreases, the Type I error rate of the compound test increases. In other
words, the more stringent the preliminary test, the less valid the compound procedure.
The inflated Type I error rate of the compound test disappears only when the Levene
test is far less stringent than most practical significance tests—with a significance level
of .20.
Moreover, the performance of the two tests, especially the Welch test, deteriorates
when restricted to samples that yield a designated outcome of the Levene test. That is,
the conditional probability of the t test, under the condition that the Levene test has not
rejected H0, is no longer as close to the significance level. The same is true for the
conditional probability of the Welch test under the condition that the preliminary test
has rejected H0.
The same line of reasoning applies to any preliminary test of equality of variances
that might be devised. A test more powerful than the Levene test conceivably could
detect smaller differences between variances with high probability, but nevertheless
some Type II errors would still occur and on some occasions an inappropriate Student t
test would be chosen. In the limiting case, if the power of the preliminary test were
close to 1.00, and if variances were unequal, the Welch t test would be chosen with
probability approaching 1.00. In that case the preliminary test would be pointless. The
same reasoning applies to tests that could be substituted for the Welch test. For
example, if the Fligner –Policello robust rank test were conditional on the Levene test or
a nonparametric counterpart of the Levene test, results similar to those in the present
study might be expected.
In the past, preliminary tests have been frowned upon for several reasons. Many tests
on variances are not robust to non-normality, and some that are robust lack acceptable
power. Some textbook authors have pointed out that such tests are not really necessary,
because the t and F tests are robust to heterogeneity of variance. Furthermore, many
authors have found that use of a preliminary test influences the size of the main
significance test (Albers, Boon, & Kallenberg, 2000; Arnold, 1970; Bancroft, 1964;
Gupta & Srivastava, 1993; Moser & Stevens, 1992; Moser, Stevens, & Watts, 1989; Rao &
Saxena, 1981; Saleh & Sen, 1983). Also, there is considerable evidence that the separate-
variance Welch– Satterthwaite test and related tests are superior to a pooled-variance t
test when variances are unequal (Cohen, 1974; Overall et al., 1995; Zimmerman &
Zumbo, 1993).
The data in the present note provide still more reasons for discontinuing preliminary
testing even if the above limitations did not apply. The data suggest that no possible
preliminary test can improve on the performance of a separate-variances test of location
when population variances are unequal. The Welch– Satterthwaite method, used
unconditionally without a preliminary test, keeps the Type I error rate close to the
nominal significance level. Furthermore, performance is usually much worse when
choice of a test of location is coupled with a test of variances.

5. Practical recommendations
Although many current textbooks state that preliminary tests are no longer favoured and
not really necessary, many researchers in psychology, education and other social
180 Donald W. Zimmerman

sciences are unaware of the serious disadvantages of these tests. Current


recommendations do not convey the message that they substantially modify the
significance level. All findings in the present note suggest that researchers should pay
more attention to differences in sample sizes as a danger signal rather than to
heterogeneity of variance.
One should also keep in mind that differences, or lack of differences, between
sample variances do not necessarily correspond to population parameters. In other
words, if the variances of sample values are nearly equal or differ only slightly, it is still
possible for population variances to differ to a larger extent. As can be seen from Fig. 1,
slight differences in population standard deviations have large effects if the ns of two
treatment groups are unequal. This means that the significance level of a test can be
inaccurate even when sample variances appear to be nearly the same.
Furthermore, the largest inaccuracy results when the disparity is slight rather than
extreme, and it is more difficult to rule out slight population differences from sample
data. In practice, researchers may be tempted to improve results by making the a-level
of a preliminary test more stringent, but this procedure, we have seen, is counter-
productive. Increasing sample size helps to a certain extent, but not much, as shown in
Table 1. When sample sizes are unequal, it appears that the most efficient strategy is to
perform the Welch t test or a related separate-variances test unconditionally, without
regard to the variability of sample values.

References
Albers, W., Boon, P. C., & Kallenberg, W. C. M. (2000). The asymptotic behavior of tests for normal
means based on a variance pre-test. Journal of Statistical Planning and Inference, 88, 47 – 57.
Arnold, B. C. (1970). Hypothesis testing incorporating a preliminary test of significance. Journal
of the American Statistical Association, 65, 1590 – l596.
Bancroft, T. A. (1964). Analysis and inference for incompletely specified models involving the use
of preliminary test(s) of significance. Biometrics, 20, 427 –442.
Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal
Society, Series A, 160, 268– 282.
Boos, D. D., & Brownie, C. (1989). Bootstrap methods for testing homogeneity of variances.
Technometrics, 31, 69 – 82.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for equality of variances. Journal of the
American Statistical Association, 69, 364 – 367.
Cohen, A. (1974). To pool or not to pool in hypothesis testing. Journal of the American Statistical
Association, 69, 721 – 725.
Fligner, M. A., & Policello, G. E. II (1981). Robust rank procedures for the Behrens – Fisher
problem. Journal of the American Statistical Association, 76, 162 –168.
Gupta, V. P., & Srivastava, V. K. (1993). Upper bound for the size of a test procedure using
preliminary tests of significance. Journal of the Indian Statistical Association, 7, 26 – 29.
Hsu, P. L. (1938). Contributions to the theory of Student’s t test as applied to the problem of two
samples. Statistical Research Memoirs, 2, 1 – 24.
Levene, H. (1960). Robust tests for equality of variance. In I. Olkin (Ed.). Contributions to
probability and statistics. Palo Alto, CA: Stanford University Press.
Marsaglia, G., & Bray, T. A. (1964). A convenient method for generating normal variables. Society
for Industrial and Applied Mathematics Review, 6, 260 – 264.
Marsaglia, G., Zaman, A., & Tsang, W. W. (1990). Toward a universal random number generator.
Statistics and Probability Letters, 8, 35– 39.
Moser, B. K., & Stevens, G. R. (1992). Homogeneity of variance in the two-sample means test.
American Statistician, 46(1), 19 –21.
Pre-tests of variances 181

Moser, B. K., Stevens, G. R., & Watts, C. L. (1989). The two-sample t test versus Satterthwaite’s
approximate F test. Communications in Statistics—Theory and Methods, 18, 3963– 3975.
Overall, J. E., Atlas, R. S., & Gibson, J. M. (1995). Tests that are robust against variance
heterogeneity in k £ 2 designs with unequal cell frequencies. Psychological Reports, 76,
1011 – 1017.
Rao, C. V., & Saxena, K. P. (1981). On approximation of power of a test procedure based on
preliminary tests of significance. Communications in Statistics—Theory and Methods, A10,
1305 – 1321.
Saleh, A. K., & Sen, P. K. (1983). Asymptotic properties of tests of hypotheses following a
preliminary test. Statistical Decisions, 1, 455 – 477.
Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
Biometrics Bulletin, 2, 110 – 114.
Scheffé, H. (1959). The analysis of variance. New York: Wiley.
Scheffé, H. (1970). Practical solutions of the Behrens –Fisher problem. Journal of the American
Statistical Association, 65, 1501 – 1508.
Welch, B. L. (1938). The significance of the difference between two means when the population
variances are unequal. Biometrika, 29, 350 – 362.
Welch, B. L. (1947). The generalization of Student’s problem when several different population
variances are involved. Biometrika, 34, 29– 35.
Zimmerman, D. W., & Zumbo, B. D. (1993). Rank transformations and the power of the Student t
test and Welch t0 test for non-normal populations with unequal variances. Canadian Journal
of Experimental Psychology, 47, 523 –539.

Received 12 November 2002; revised version received 18 February 2003

Potrebbero piacerti anche