Sei sulla pagina 1di 30

Psychological Methods Copyright 1998 by the American Psychological Association, Inc.

1998, Vol. 3, No. 4, 424--453 1082-989X/98/$3.00

Fit Indices in Covariance Structure Modeling: Sensitivity to


Underparameterized Model Misspecification
Li-tze Hu Peter M. Bentler
University of California, Santa Cruz University of California, Los Angeles

This study evaluated the sensitivity of maximum likelihood (ML)-, generalized


least squares (GLS)-, and asymptotic distribution-free (ADF)-based fit indices to
model misspecification, under conditions that varied sample size and distribution.
The effect of violating assumptions of asymptotic robustness theory also was ex-
amined. Standardized root-mean-square residual (SRMR) was the most sensitive
index to models with misspecified factor covariance(s), and Tucker-Lewis Index
(1973; TLI), Bollen's fit index (1989; BL89), relative noncentrality index (RNI),
comparative fit index (CFI), and the ML- and GLS-based gamma hat, McDonald's
centrality index (1989; Mc), and root-mean-square error of approximation
(RMSEA) were the most sensitive indices to models with misspecified factor
loadings. With ML and GLS methods, we recommend the use of SRMR, supple-
mented by TLI, BL89, RNI, CFI, gamma hat, Mc, or RMSEA (TLI, Mc, and
RMSEA are less preferable at small sample sizes). With the ADF method, we
recommend the use of SRMR, supplemented by TLI, BL89, RNI, or CFI. Finally,
most of the ML-based fit indices outperformed those obtained from GLS and ADF
and are preferable for evaluating model fit.

This study addresses the sensitivity of various fit esis, that is, when the model is correct. Although such
indices to underparameterized model misspecifica- an approach is useful, as noted by Maiti and Mukher-
tion. The issue of model misspecification has been jee (1991), it misses the main practical point for the
almost completely neglected in evaluating the ad- use of fit indices, namely, the ability to discriminate
equacy of fit indices used to evaluate covariance well-fitting from badly fitting models. Of course, it is
structure models. Previous recommendations on the certainly legitimate to ask that fit indices reliably
adequacy of fit indices have been primarily based on reach their maxima when the model is correct, for
the evaluation of the effect of sample size, or the example, under variations of sample size, but it seems
effect of estimation method, without taking into ac- much more vital to assure that a fit index is sensitive
count the sensitivity of an index to model misspeci- to misspecification of the model, so that it can be used
fication. In other words, virtually all studies of fit to determine whether a model is incorrect. Maiti and
indices have concentrated their efforts on the ad- Mukherjee term this characteristic sensitivity. Thus, a
equacy of fit indices under the modeling null hypoth- good index should approach its maximum under cor-
rect specification but also degrade substantially under
Li-tze Hu, Department of Psychology, University of Cali- misspecification. As far as we can tell, essentially no
fornia, Santa Cruz; Peter M. Bentler, Department of Psy- studies have inquired to what extent this basic require-
chology, University of California, Los Angeles. ment is met by the many indices that have been pro-
This research was supported by a grant from the Division posed across the years. Malti and Mukherjee have
of Social Sciences, by a Faculty Research Grant from the provided an analysis of only a few indices under very
University of California, Santa Cruz, and by U.S. Public restricted modeling conditions.
Health Service Grants DA00017 and DA01070. The com-
In this study, the sensitivity of four types of fit
puter assistance of Shinn-Tzong Wu is gratefully acknowl-
indices, derived from maximum-likelihood (ML),
edged.
Correspondence concerning this article should be ad- generalized least squares (GLS), and asymptotic dis-
dressed to Li-tze Hu, Department of Psychology, University tribution-free (ADF) estimators, to various types of
of California, Santa Cruz, California 95064. Electronic mail underparameterized model misspecification is exam-
may be sent to lth@cats.ucsc.edu. ined. Note that in an underparameterized model, one

424
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 425

or more parameters whose population values are non- tions, such as sample size being too small or violation
zero are fixed to zero. In addition, we evaluate the of an assumption underlying the test, for example,
adequacy of these four types of fit indices under con- multivariate normality of variables, in the case of the
ditions such as violation of underlying assumptions of standard chi-square test (e.g., Bentler & Dudgeon,
multivariate normality and asymptotic robustness 1996; Chou, Bender, & Satorra, 1991; Curran, West,
theory, providing evidence regarding the efficacy of & Finch, 1996; Hu, Bender, & Kano, 1992; Muthen &
the often stated idea that a model with a fit index Kaplan, 1992; West, Finch, & Curran, 1995; Yuan &
greater than (or, in some cases, less than) a conven- Bender, 1997). Thus, a significant goodness-of-fit
tional cutoff value should be acceptable (e.g., Bender chi-square value may be a reflection of model mis-
& Bonett, 1980). Also, for the first time, we evaluated specification, power of the test, or violation of some
several new and supposedly superior indices (i.e., technical assumptions underlying the estimation
gamma hat, McDonald's [1989] centrality index method. More important, it has been commonly rec-
[Mc], and root-mean-square error of approximation ognized that models are best regarded as approxima-
[RMSEA]) that have been recommended with little or tions of reality, and hence, using chi-square to test the
no empirical support. We present here a nontechnical hypothesis that the population covariance matrix
summary of the methods and the results of our study. matches the model-implied covariance matrix, X =
Readers wishing a more detailed report of this study X(0), is too strong to be realistic (e.g., de Leeuw,
should consult our complete technical report (Hu & 1983; Jrreskog, 1978). Thus the standard chi-square
Bender, 1997).
test may not be a good enough guide to model ad-
equacy.
Historical Background As a consequence, alternative measures of fit,
namely, so-called fit indices, were developed and rec-
Structural equation modeling has become a stan- ommended as plausible additional measures of model
dard tool in psychology for investigating the plausi- fit (e.g., Akaike, 1987; Bentler, 1990; Bender & Bon-
bility of theoretical models that might explain the ett, 1980; Bollen, 1986, 1989; James, Mulaik, & Brett,
interrelationships among a set of variables. In these 1982; Jrreskog & St~rbom, 1981; Marsh, Balla, &
applications, the assessment of goodness-of-fit and McDonald, 1988; McDonald, 1989; McDonald &
the estimation of parameters of the hypothesized mod- Marsh, 1990; Steiger & Lind, 1980; Tanaka, 1987;
el(s) are the primary goals. Issues related to the esti- Tanaka & Huba, 1985; Tucker & Lewis, 1973). How-
mation of parameters have been discussed elsewhere ever, despite the increasing popularity of using fit
(e.g., Bollen, 1989; Browne & Arminger, 1995; Chou indices as alternative measures of model fit, applied
& Bender, 1995); our discussion here focuses on researchers inevitably face a constant challenge in se-
those issues that are critical to the assessment of good- lecting appropriate fit indices among a large number
ness-of-fit of the hypothesized model(s). of fit indices that have recently become available in
The most popular ways of evaluating model fit are many popular structural equation modeling programs.
those that involve the chi-square goodness-of-fit sta- For instance, both LISREL 8 (Jrreskog & Srrbom,
tistic and the so-called fit indices that have been of- 1993) and the PROC CALLS procedure for structural
fered to supplement the chi-square test. The asymp- equation modeling (SAS Institute, 1993) report the
totic chi-square test statistic was originally developed values of about 20 fit indices, and EQS (Bender &
to serve as a criterion for model evaluation or selec- Wu, 1995a, 1995b) prints the values of almost 10 fit
tion. In its basic form, a large value of the chi-square indices. Frequently, the values of various fit indices
statistic, relative to its degrees of freedom, is evidence reported in a given program yield conflicting conclu-
that the model is not a very good description of the sions about the extent to which the model matches the
data, whereas a small chi-square is evidence that the observed data. Applied researchers thus often have
model is a good one for the data. Unfortunately, as difficulties in determining the adequacy of their co-
noted by many researchers, this simple version of the variance structure models. Furthermore, as noted by
chi-square test may not be a reliable guide to model Bender and Bonett (1980), who introduced several of
adequacy. The actual size of a test statistic depends these indices and popularized the ideas, fit indices
not only on model adequacy but also on which one were designed to avoid some of the problems of
among several chi-square tests actually is used, as sample size and distributional misspecification on
well as other conceptually unrelated technical condi- evaluation of a model. Initially, it was hoped that
426 HU AND BENTLER

these fit indices would more unambiguously point to m e n t a l distinction (Bollen, 1989; Gerbing & Ander-
model adequacy as compared with the chi-square test. son, 1993; Marsh et al., 1988; Tanaka, 1993). An
This optimistic state of affairs is unfortunately also absolute-fit index directly assesses how well an a
not true. priori model reproduces the sample data. Although no
reference model is used to assess the amount of in-
T h e C h i - S q u a r e Test crement in model fit, an implicit or explicit compari-
son may be made to a saturated model that exactly
The conventional overall test of fit in covariance reproduces the observed covariance matrix. As a re-
structure analysis assesses the magnitude of discrep- sult, this type of fit index is analogous to R E by com-
ancy between the sample and fitted covariance matri- paring the goodness of fit with a component that is
ces. Let S represent the unbiased estimator of a popu- similar to a total sum of squares. In contrast, an in-
lation covariance matrix, ~, of the observed variables. cremental fit index measures the proportionate im-
The population covariance matrix can be expressed as provement in fit by comparing a target model with a
a function of a vector containing the fixed and free more restricted, nested baseline model. Incremental fit
model parameters, that is, 0: E = E(0). The param- indices are also called c o m p a r a t i v e fit indices. A null
eters are estimated so that the discrepancy between model in which all the observed variables are allowed
the sample covariance matrix S and the implied co- to have variances but are uncorrelated with each other
variance matrix E(~) is minimal. A discrepancy func- is the most typically used baseline model (Bentler &
tion F = F[S, E(0)] can be considered to be a mea- Bonett, 1980), although other baseline models have
sure of the discrepancy between S and E(0) evaluated been suggested (e.g., Sobel & Bohrnstedt, 1985).
at an estimator ~ and is minimized to yield Fmin. Un- Incremental fit indices can be further distinguished
der an assumed distribution and the hypothesized among themselves. We define three groups of indices,
model E(0) for the population covariance matrix E, Types 1-3 (Hu & Bentler, 1995). 1 A Type 1 index
the test statistic T = (N - 1)Fmi n has an asymptotic uses information only from the optimized statistic T,
(large sample) chi-square distribution. The test statis- used in fitting baseline (TB) and target (TT) models. T
tic T is usually called the chi-square statistic by other is not necessarily assumed to follow any particular
researchers. In general, the null hypothesis E = E(0) distributional form, though it is assumed that the fit
is rejected if T exceeds a value in the chi-square dis- function F is the same for both models. A general
tribution associated with an ct level of significance. form of such indices can be written as Type 1 incre-
The T statistics can be derived from various estima- mental indices = ITB - TTI/T B. T h e ones we study in
tion methods that vary in the degrees of sensitivity to this article are the normed fit index (NFI; Bentler &
the distributional assumptions. The T statistic derived Bonett, 1980) and a fit index by Bollen (1986; BL86).
from ML under the assumption of multivariate nor-
mality of variables is the most widely used summary
statistic for assessing the adequacy of a structural
equation model (Gierl & Mulvenon, 1995). 1 The terminology of Type 1 and Type 2 indices follows
Marsh et al. (1988), although our specific definitions of
T y p e s o f Fit Indices these terms are not identical to theirs. Their Type 2 index
has some definitional problems, and its proclaimed major
Unlike a chi-square test that offers a dichotomous example is not consistent with their own definition. They
decision strategy implied by a statistical decision rule, define Type 2 indices as ITx - TBI/IE - TBI, where T x is the
a fit index can be used to quantify the degree of fit value of the statistic for the target model, TB is the value for
along a continuum. It is an overall summary statistic a baseline model, and E is the expected value of TT if the
that evaluates how well a particular covariance struc- target model is true. Note first that E may not be a single
ture model explains sample data. Like R 2 in multiple quantity: Different values may be obtained depending on
additional assumptions, such as on the distribution of the
regression, fit indices are meant to quantify something
variables. As a result, the formula can give more than one
akin to variance accounted for, rather than to test a Type 2 index for any given absolute index. In addition, the
null hypothesis E = E(0). In particular, these indices absolute values in the formula have the effect that their
generally quantify the extent to which the variation Type 2 indices must be nonnegative; however, they state
and covariation in the data are accounted for by a that an index called the Tucker-Lewis Index (TLI; dis-
model. One of the most widely adopted dimensions cussed later in text) is a Type 2 index. This is obviously not
for classifying fit indices is the a b s o l u t e versus i n c r e - true because TLI can be negative.
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 427

Table 1 contains algebraic definitions, properties, square residual (SRMR; Bentler, 1995); and the
and citations for all fit indices considered in this RMSEA (Steiger & Lind, 1980).
article.
Type 2 and Type 3 indices are based on an assumed Issues in Assessing Fit by Fit Indices
distribution of variables and other standard regularity
conditions. A Type 2 index additionally uses infor- There are four major problems involved in using fit
mation from the expected values of Ta. under the cen- indices for evaluating goodness of fit: sensitivity of a
tral chi-square distribution. It assumes that the chi- fit index to model misspecification, small-sample
square estimator of a valid target model follows an bias, estimation-method effect, and effects of viola-
asymptotic chi-square distribution with a mean of dfx, tion of normality and independence. The issue on sen-
where dfT is the degrees of freedom for a target sitivity of fit index to model misspecification has long
model. Hence, the baseline fit TB is compared with been overlooked and thus deserves careful examina-
dfr, and the denominator in the Type 1 index is re- tion. The other three issues are a natural consequence
placed by (Ta - dfr). Thus, a general form of such of the fact that these indices typically are based on
indices can be written as Type 2 incremental fit index chi-square tests: A fit index will perform better when
its corresponding chi-square test performs well. Be-
= ITS - TTI/(T8 - dfT). On the basis of the work of
cause, as noted above, these chi-square tests may not
Tucker and Lewis (1973), Bentler and Bonett (1980)
perform adequately at all sample sizes and also be-
called such indices nonnormed fit indices, because
cause the adequacy of a chi-square statistic may de-
they need not have a 0-1 range even if TB/> Ta.. We
pend on the particular assumptions it requires about
study their index (NNFI or TLI) and a related index
the distributions of variables, these same factors can
developed by Bollen (1989; BL89).
be expected to influence evaluation of model fit.
A Type 3 index uses Type 1 information but addi-
tionally uses information from the expected values of Sensitivity o f Fit I n d e x to
Tx or Ta, or both, under the relevant noncentral chi- Model Misspecification
square distribution. A noncentrality fit index usually
involves first defining a population-fit-index param- Among various sources of effects on fit indices, the
eter and then using estimators of this parameter to sensitivity of fit indices to model misspecification
define the sample-fit index (Bender, 1990; McDon- (Gerbing & Anderson, 1993; i.e., the effect of model
ald, 1989; McDonald & Marsh, 1990; Steiger, 1989). misspecification) has not been adequately studied be-
When the assumed distributions are correct, Type 2 cause of the intensive computational requirements. A
and Type 3 indices should perform better than Type 1 correct specification implies that a population exactly
indices because more information is being used. We matches the hypothesized model and also that the pa-
study Bentler's (1989, 1990) and McDonald and rameters estimated in a sample reflect this structure.
Marsh's (1990) relative noncentrality index (RNI) and On the other hand, a model is said to be misspecified
Bentler's comparative fit index (CFI). Note also that when (a) one or more parameters are estimated whose
Type 2 and Type 3 indices may use inappropriate population values are zeros (i.e., an overparameter-
information, because any particular T may not have ized misspecified model), (b) one or more parameters
the distributional form assumed. For example, Type 3 are fixed to zeros whose population values are non-
indices make use of the noncentral chi-square distri- zeros (i.e., an underparameterized misspecified
bution for TB, but one could seriously question wheth- model), or both. In the very few studies that have
er this is generally its appropriate reference distribu- touched on such an issue, the results are often incon-
tion. We also study several absolute-fit indices. These clusive due either to the use of an extremely small
include the goodness-of-fit (GFI) and adjusted-GFI number of data sets (e.g., Marsh et al., 1988; Mulaik
(AGFI) indices (Bender, 1983; J6reskog & S6rbom, et al., 1989) or to the study of a very small number of
1984; Tanaka & Huba, 1985); Steiger's (1989) fit indices under certain limited conditions (e.g.,
gamma hat; a rescaled version of Akaike's informa- Bentler, 1990; La Du & Tanaka, 1989; Maiti &
tion criterion (CAK; Cudeck & Browne, 1983); a Mukherjee, 1991). For example, using a small number
cross-validation index (CK; Browne & Cudeck, of simulated data sets. Marsh et al. (1988) reported
1989); McDonald's (1989) centrality index (Mc); that sample size was substantially associated with sev-
Hoelter's (1983) critical N (CN); a standardized ver- eral fit indices under both true and false models. They
sion of J6reskog and S6rbom's (1981) root-mean- showed also that the values of most of the absolute-
428 HU AND BENTLER

Table 1
Algebraic Definitions, Properties, and Citations f o r Incremental and Absolute-Fit Indices
Algebraic definition Property Citation
Incremental fit indices
Type 1
NFI = ( T B - TT)ITB N o r m e d (has a 0-1 range) Bentler & Bonett (1980)
BL86 = [(TB/df B) - (Tr/dfT)]/(Ta/dfB) N o r m e d (has a 0-1 range) Bollen (1986)
Type 2
T L I (or N N F I ) = [(TddfB) - (TT/dfT)]I[(TBIdfB) N o n n o r m e d (can fall Tucker & Lewis (1973)
- 1] outside the 0-1 range) Bentler & Bonett (1980)
Compensates for the effect
o f model complexity
BL89 = ( T B - TT)/(TB - dfT) Nonnormed Bollen (1989)
Compensates for the effect
o f model complexity
Type 3
RNI = [(T B - dfB ) - (T T - dfT)]/(T B - dfB ) Nonnormed M c D o n a l d & Marsh (1990)
Noncentrality based Bentler (1989, 1990)
CFI = 1 - max[(T T - d f T ) , 0]/max[(T r - d f T ) , N o r m e d (has a 0-1 range) B e n d e r (1989, 1990)
(TB - dfB), 0] Noncentrality based
Absolute fit indices
GFIML = 1 - [tr(X-lS -/)2/tr(~,-ls)2] Has a m a x i m u m value o f J6reskog & S6rbom (1984)
1.0
Can be less than 0
AGFIML --- 1 - [p(p + 1)/2dfT](1 - GFIML) Has a m a x i m u m value o f J6reskog & S0rbom (1984)
1.0
Can be less than 0
G a m m a hat = p / { p + 2[(TT - dfT)/(N - 1)1} Has a k n o w n distribution Steiger (1989)
Noncentrality based
C A K = [Ta./(N- 1)] + [ 2 q / ( N - 1)1 Compensates for the effect Cudeck & Browne (1983)
o f model complexity
C K = [ T a 4 ( N - 1)] + [ 2 q / ( N - p -2)] Compensates for the effect B r o w n e & Cudeck (1989)
o f model complexity
M c = e x p { - 1 / 2 [ ( T T - dfT)/(N - 1)]} Noncentrality based M c D o n a l d (1989)
Typically has the 0-1
range (but it may exceed
1)
C N = {(zc~it + ~'~'~~--l)2/[2TT/(N - 1)]} + 1 A C N value exceeding 200 Hoelter (1983)
indicates a good fit o f a
given model
Standardized J6reskog & SOrbom (1981)
SRMR = 2 , [(s o - (~i)l(siisi)]2}lp(p + 1) root-mean-square Bentler (1995)
i=l j=-l residual
R M S E A --- X[-ffo/dfT, where Fo = max[(TT - Has a k n o w n distribution Steiger & Lind (1980)
d f T ) l ( N - 1), 0] Compensates for the effect Steiger (1989)
o f model complexity
Noncentrality based
Note. NFI = normed fit index; TB = T statistic for the baseline model; TT = T statistic for the target model; BL86 = fit index by Bollen
(1986); dfn = degrees of freedom for the baseline model; dfv = degrees of freedom for the target model; TLI = Tucker-Lewis index (1973);
NNFI = nonnormed fit index; BL89 = fit index by Bollen (1989); RNI = relative noncentrality index; CFI = comparative fit index; GFI
= goodness-of-fit index; ML = maximum likelihood; tr = trace of a matrix; AGFI = adjusted-goodness-of-fit index; CAK = a rescaled
version of Akaike's information criterion; q = no. parameters estimated; CK = cross-validation index; Mc = McDonald's centrality index;
CN = critical N; zc)it = critical z value at a selected probability level; SRMR = standardized root-mean-square residual; si: = observed
covariances; 00 = reproduced covariances; sii and sjj = observed standard deviations; RMSEA = root-mean-square error of approximation.
The formulas for generalized least squares and asymptotic distribution-free versions of GFI and AGFI are shown in Hu and Bentler (1997).
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 429

and Type 2 fit indices derived from true models were fit index size for Type 1 incremental fit indices. Ob-
significantly greater than those derived from false viously, Type 1 incremental indices will be influenced
models. La Du and Tanaka (1989, Study 2) studied by the badness of fit of the null model as well as the
the effects of both overparameterized and underpa- goodness of fit of the target model, and Marsh et al.
rameterized model misspecification (both with mis- (1988) have reported this type of effect. On the other
specified path[s] between observed variables) on the hand, the Type 2 and Type 3 indices seem to be sub-
ML- and GLS-based GFI and NFI. No significant stantially less biased. The results on absolute indices
effect of overparameterized model misspecification are mixed.
on these fit indices was found. A very small but sig- A few key studies can be mentioned. Bollen (1986,
nificant effect of underparameterized model misspeci- 1989, 1990) found that the means of the sampling
fication was observed for some of these fit indices distributions of NFI, BL86, GFI, and AGFI tended to
(i.e., the ML-based NFI and ML-/GLS-based GFI). increase with sample size. Anderson and Gerbing
The ML-based NFI also was found to be more sensi- (1984) and Marsh et al. (1988) showed that the means
tive to this type of model misspecification than was of the sampling distributions of GFI and AGFI were
the ML- and GLS-based GFI. Marsh, Balla, and Hau positively associated with sample size whereas the
(1996) found that degrees of model misspecification association between TLI and sample size was not sub-
accounted for a large proportion of variance in NFI, stantial. Bentler (1990) also reported that TLI (and
BL86, TLI, BL89, RNI, and CFI. Although their NNFI) outperformed NFI on average; however, the
study included several substantially misspecified variability of TLI (and NNFI) at a small sample size
models, their analyses failed to reveal the degree of (e.g., N = 50) was so large that in many samples, one
sensitivity of these fit indices for a less misspecified would suspect model incorrectness and, in many other
model. In our study, the sensitivity of various fit in- samples, overfitting. Cudeck and Browne (1983) and
dices to model misspecification, after controlling for Browne and Cudeck (1989) found that CAK and CK
other sources of effects, are examined. improved as sample size increased. Bollen and Liang
Small-Sample Bias (1988) showed that Hoelter's (1983) CN increased as
sample size increased. McDonald (1989) reported that
Estimation methods in structural equation modeling the value of Mc was consistent across different
are developed under various assumptions. One is that sample sizes. Anderson and Gerbing (1984) found
the model ~ = E(0) is true. Another is the assump- that the mean values of RMR (the unstandardized
tion that estimates and tests are based on large root-mean-square residual; J/Sreskog & S0rbom,
samples, which will not actually obtain in practice. 1981) was related to the sample size. J. Anderson,
The adequacy of the test statistics is thus likely to be Gerbing, and Narayanan (1985) further reported that
influenced by sample size, perhaps performing more the mean values of RMR were related to the sample
poorly in smaller samples that cannot be considered size and model characteristics, such as the number of
asymptotic enough. In fact, the relation between indicators per factor, the number of factors, and indi-
sample size and the adequacy of a fit index when the cator loadings. In one of the major studies that inves-
model is true has long been recognized; for example, tigated the effect of sample size on the older fit indi-
Bearden, Sharma, and Teel (1982) found that the ces, Marsh et al. (1988) found that many indices were
mean of NFI is positively related to sample size and biased estimates of their corresponding population pa-
that NFI values tend to be less than 1.0 when sample rameters when sample size was finite. GFI appeared
size is small. Their early results pointed out the main to perform better than any other stand-alone index
problem: possible systematic fit-index bias. (e.g., AGFI, CAK, CN, or RMR) studied by them.
If the mean of a fit index, computed across various GFI also underestimated its asymptotic value to a
samples under the same condition when the model is lesser extent than did NFI.
true, varies systematically with sample size, such a The Type 2 and Type 3 incremental fit indices, in
statistic will be a biased estimator of the correspond- general, perform better than either the absolute or
ing population parameter. Thus, the decision for ac- Type 1 incremental indices. This is true for the older
cepting or rejecting a particular model may vary as a indices such as TLI, as noted above, but appears to be
function of sample size, which is certainly not desir- especially true for the newer indices based on non-
able. The general finding seems to be a positive as- centrality. For example, Bentler (1990) reported that
sociation between sample size and the goodness-of-fit FI (called RNI in this article), CFI, and IFI (called
430 HU AND BENTLER

BL89 in this article) performed essentially with no typically considered as simply uncorrelated must
bias, though by definition CFI must be somewhat actually be mutually independent, and common fac-
downward biased to avoid out-of-range values greater tors, when correlated, must have freely estimated vari-
than l, which can occur with FI. The bias, however, is ance-covariance parameters. Independence exists
trivial, and it gains lower sampling variability in the when normally distributed variables are uncorrelated.
index. The relation of RNI to CFI has been spelled out However, when nonnormal variables are uncorrelated,
in more detail by Goffin (1993), who prefers RNI to they are not necessarily independent. If the robustness
CFI for model-comparison purposes. conditions are met in large samples, normal-theory
ML and GLS test statistics still hold, even when the
Estimation-Method Effects data are not normal. Unfortunately, because the data-
As noted above, the three major problems involved generating process is unknown for real data, one can-
in using fit indices are a natural consequence of the not generally know whether the independence of fac-
fact that these indices typically are based on chi- tors and errors, or of the errors themselves, holds, and
square tests. This rationale is elaborated through a thus, the practical application of asymptotic robust-
brief review of the ML, GLS, and ADF estimation ness theory is unclear.
methods, as well as their relationships to the chi- Although Hu et al. (1992) have examined the ad-
square statistics. For a more technical review of each equacy of six chi-square goodness-of-fit tests under
method, readers are encouraged to consult Hu et al. various conditions, not much is known about estima-
(1992), Bentler and Dudgeon (1996), or, especially, tion effects on fit indices. Even if the distributional
assumptions are met, different estimators yield chi-
the original sources.
Estimation methods such as ML and GLS in co- square statistics that perform better or worse at vari-
variance structure analysis are traditionally developed ous sample sizes. This may translate into differential
under multivariate normality assumptions (e.g., performance of fit indices based on different estima-
Bollen, 1989; Browne, 1974; Jt~reskog, 1969). A vio- tors. However, the overall effect of mapping from
lation of multivariate normality can seriously invali- chi-square to fit index, while varying estimation
date normal-theory test statistics. ADF methods there- method, is unclear. In pioneering work, Tanaka
fore have been developed (e.g., Bentler, 1983; (1987) and La Du and Tanaka (1989) have found that
Browne, 1982, 1984) with the promising claim that given the same model and data, NFI behaved errati-
the test statistics for model fit are insensitive to the caUy across ML and GLS estimation methods. On the
distribution of the observations when the sample size other hand, they reported that GFI behaved consis-
is large. However, empirical studies using Monte tently across the two estimation methods. Their re-
Carlo procedures have shown that when sample size is sults must be due to the differential quality of the null
model chi-square used in the NFI but not the GFI
relatively small or model degrees of freedom are
computations. 2 On the basis of these results, Tanaka
large, the chi-square goodness-of-fit test statistic
and Huba (1989) have suggested that GFI is more
based on the ADF method may be inadequate (Chou
appropriate than NFI in finite samples and across dif-
et al., 1991; Curran et al., 1996; Hu et al., 1992;
ferent estimation methods. Using a large empirical
Muthen & Kaplan, 1992; Yuan & Bentler, 1997).
data set, Sugawara and MacCallum (1993) have found
The recent development of a theory for the asymp-
that absolute-fit indices (i.e., GFI and RMSEA) tend
totic robustness of normal-theory methods offers hope
for the appropriate use of normal-theory methods to behave more consistently across estimation meth-
ods than do incremental fit indices (i.e., NFI, TLI,
even under violation of the normality assumption
BL86, and BL89). This phenomenon is especially evi-
(e.g., Amemiya & Anderson, 1990; T.W. Anderson
dent when there is a good fit between the hypoth-
& Amemiya, 1988; Browne, 1987; Browne & Sha-
esized model and the observed data. As the degree of
piro, 1988; Mooijaart & Bentler, 1991; Satorra &
fit between hypothesized models and observed data
Bentler, 1990, 1991). The purpose of this line of re-
decreases, GFI and RMSEA behave less consistently
search is to determine under what conditions normal-
theory-based methods such as ML or GLS can still
correctly describe and evaluate a model with nonnor-
mally distributed variables. The conditions are tech- 2 Earlier versions of EQS also incorrectly computed the
nical but require the very strong condition that the null model chi-square under GLS, thus affecting all incre-
latent variables (common factors or errors) that are mental indices.
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 431

across estimation methods. Sugawara and MacCallum has been used in practice to evaluate the adequacy of
have stated that the effect of estimation methods on fit models.
is tied closely to the nature of the weight matrices
used by the estimation methods. Ding, Velicer, and Method
Harlow (1995) found that all fit indices they studied,
except the TLI, were affected by estimation method. Two types of confirmatory factor models (called
simple model and complex model), each of which can
Effects of Violation of Normality be expressed as x = At + e, were used to generate
and Independence measured variables x under various conditions on the
An issue related to the adequacy of fit indices that common factors ~ and unique variates (errors) e. That
has not been studied is the potential effect of violation is, the vector of observed variables (xs) was a
of assumptions underlying estimation methods, spe- weighted function of a common-factor vector (6) with
cifically, violation of distributional assumptions and weights given by the factor-loading matrix, A, plus a
the effect of dependence of latent variates. The de- vector of error variates (e). The measured variables
pendence condition is one in which two or more vari- for each model were generated by setting certain re-
ables are functionally related, even though their linear strictions on the common factors and unique variates.
correlations may be exactly zero. Of course, with nor- Several properties are noted in the usual application of
mal data, a linear correlation of zero implies indepen- these types of factor analytic approaches. First, factors
dence. Nothing is known about the adequacy of fit are allowed to be correlated and have a covariance
indices under conditions such as dependency among matrix, ~. Second, errors are uncorrelated with fac-
common and unique latent variates, along with viola- tors. Third, various error variates are uncorrelated and
tions of multivariate normality, at various sample have a diagonal covariance matrix, ~ . Consequently,
sizes. the hypothesized model can be expressed as E = E(0)
= A ~ A ' + ~ , and the elements of 0 are the unknown
Study Questions and Performance Criteria parameters in A, ~, and ~ .
This study investigates several critical issues re- Study Design
lated to fit indices. First, the sensitivity of various
incremental and absolute-fit indices derived from ML, Simple and complex models are both confirmatory
GLS, and ADF estimation methods to underparam- factor analytic models based on 15 observed variables
eterized model misspecification is investigated. Two with three common factors. Although many other
types of underparameterized model misspecification model types are possible, most models used in prac-
are studied: simple misspecified models (i.e., models tice involve latent variables, and the confirmatory fac-
with misspecified factor covariance[s]) and complex tor model is most representative of such models. For
misspecified models (i.e., models with misspecified example, variants of confirmatory factor models have
factor loading[s]). Second, the stability of various fit been the typically studied models in the new journal
indices across ML, GLS, and ADF methods (i.e., the Structural Equation Modeling, in the special section
effect of estimation method on fit indices) is studied. on "Structural Equation Modeling in Clinical Re-
Third, the performance of these fit indices, derived search" (Hoyle, 1994) published in the Journal of
from the ML, GLS, and ADF estimators under the Consulting and Clinical Psychology, and in the larger
following three ways of violating theoretical condi- models among the approximately two dozen modeling
tions, is examined: (a) Distributional assumptions are articles published in the Journal of Personality and
violated, (b) assumed independence conditions are vi- Social Psychology (JPSP) during 1995. In practice,
olated, and (c) asymptotic sample-size requirements correlations among factors may be replaced by hy-
are violated. Our primary goals are to recommend fit pothesized paths, and correlated residuals may be
indices that perform the best overall and to identify added. Such models also form the basis of many re-
those that perform poorly. Good fit indices should be cent simulation studies (e.g., Curran et al., 1996; Ding
(a) sensitive to model misspecification and (b) stable et al., 1995; Marsh et al., 1996). It is important to
across different estimation methods, sample sizes, and choose a number of variables that is not too small
distributions. Finally, attempts are also made to evalu- (e.g., Hu et al., 1992) yet remains practical in the
ate the "rule of thumb" conventional cutoff criterion context of a large simulation. We chose a number
for a given fit index (Bentler & Bonett, 1980), which larger than the median number of variables (9-10)
432 HU AND BENTLER

.70 .70 .75 .80 .80 .00 .00 .00 .00 .00 .00 ,00 .00 .00 .00-]
.00 .00 .00 .00 .00 .70 .70 .75 .80 .80 .00 •00 .00 .00 0o]
.00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .70 .70 .75 .80 .80.]

used in JPSP's 1995 modeling studies but smaller and hence measured variables are multivariate nor-
than many ambitious studies (e.g., Hoyle's special mally distributed.
section contains five studies with 24 or more vari- Distributional Conditions 2. Nonnormal factors
ables). We also chose distributional conditions and and errors, when uncorrelated, are independent, but
samples sizes to cover a wider range of practical rel- asymptotic robustness theory does not hold because
evance. Figure 1 displays the structures of true- the covariances of common factors are not free pa-
population and misspecified models used in this rameters. The true excess kurtoses for the nonnormal
study. factor in the population are -1.0, 2.0, and 5.0. The
The factor-loading matrix (transposed) A' for the true excess kurtoses for the unique variates are -1.0,
simple model had the structure shown at the top of the 0.5, 2.5, 4.5, 6.5, -1.0, 1.0, 3.0, 5.0, 7.0, -0.5, 1.5, 3.5,
page. 5.5, and 7.5.
The structure of the factor-loading matrix (trans- Distributional Condition 3. Nonnormal factors
posed) A' for the complex model was as shown at the and errors are independent but not multivariate nor-
bottom of the page. mally distributed• The true kurtoses for the factors and
For both the simple model and complex model, unique variates are identical to those in Distributional
variances of the factors were 1.0, and the covariances Condition 2.
among the three factors were 0.30, 0.40, and 0.50. The Distributional Condition 4. The errors and hence
unique variances were taken as values that would the measured variables are not multivariate normally
yield unit-variance measured variables under normal- distributed. The true kurtoses for the unique variates
ity for the simple model• For the complex model, the are identical to those in Distributional Conditions 2
unique variances were taken as values that would and 3, but the true kurtoses for the factors are set to
yield unit variance for most measured variables (ex- zero.
cept for the 1st, 4th, and 9th observed variables in the Distributional Condition 5. An elliptical distribu-
model) under normality. The unique variances for the tion: Factors and errors are uncorrelated but depen-
1st, 4th, and 9th observed variables were 0.51, 0.36, dent on each other.
and 0.36, respectively• In estimation, the factor load- Distributional Condition 6. The errors and hence
ing of the last indicator of each factor was fixed for the measured variables are not multivariate normally
identification at 0.80, and the remaining nonzero pa- distributed, and both factors and errors are uncorre-
rameters were free to be estimated• lated but dependent on each other.
Two hundred replications (samples) of a given Distributional Condition 7. Nonnormal factors and
sample size were drawn from a known population errors are uncorrelated but dependent on each other.
model in each of the seven distributional conditions as In Distributional Conditions 5-7, the factors and
defined by Hu et al. (1992). The first was a baseline error variates were divided by a random variable, z =
distributional condition involving normality, the next [×2(5)]~/2/'~/3, that was distributed independently of
three involved nonnormal variables that were inde- the original common and unique factors. The division
pendently distributed when uncorrelated, and the final was made so that the variances and covariances of the
three distributional conditions involved nonnormal vari- factors remained unchanged but the kurtoses of the
ables that, although uncorrelated, remained dependent. factors and errors became modified. As a conse-
Distributional Condition 1. The factors and errors quence of this division, the factors and errors were

.70 .70 .75 .80 .80 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00-]
.00 .00 .00 .70 .00 .70 .70 .75 .80 .80 .00 .00 .00 .00 .00
.70 .00 .00 .00 .00 .00 .00 .00 .70 .00 .70 .70 .75 .80 .80
]
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 433

.51 .51 .44 .36 .36 .51 .51 .44 .36 .36 .51 .51 .44 .36 .36

l I,L 1 I li liI l
v-q r;
.70
•\d
.70 ' \

Factor 1 \ Factor 2 Factor 3

...... ~ .40

Figure 1. Structures of true-population and misspecified models used in this study• Solid lines (except solid line e) represent
parameters that exist in the simple true-population model and both simple misspecified models, 1 and 2; dashed line a
represents the parameter that exists in the simple true-population model but was omitted from both simple misspecified models,
1 and 2; dashed line b represents the parameter that exists in the simple true-population model but was omitted from simple
Misspecified Model 2 only. Solid lines (including solid line e) and dashed lines (a and b) represent parameters that exist in the
complex true-population model and both complex misspecified models, 1 and 2; dashed and dotted line c represents the
parameter that exists in the complex true-population model but was omitted from both complex misspecified models, 1 and
2; dashed and dotted line d represents the parameter that exists in the complex true-population model but was omitted from
Complex Misspecified Model 2 only. V = observed variable.

uncorrelated but dependent on each other. Because of program (SAS Institute, 1993), the various fit indices
the dependence, asymptotic robustness of normal- based on ML, GLS, and ADF estimation methods
theory statistics was not to be expected under Distri- were computed in each sample. 3
butional Conditions 5-7. To provide some idea about
the degree of nonnormality of the factors and unique Specification of Models and Procedure
variates in Distributional Conditions 5-7 after the di-
For each type of model (i.e., simple or complex),
vision, the empirical univariate kurtoses of the latent
one true-population model and two misspecified mod-
variables were computed across 5,000 x 200 =
els were used to examine the degree of sensitivity to
1,000,000 observations. In Distributional Condition 5,
model misspecification of various fit indices.
the empirical kurtoses for the factors were 5.1, 6.0,
True-population model. The performance of four
and 5.5. The empirical kurtoses for the unique variates
types of fit indices, derived from ML, GLS, and ADF
were 4.9, 6.0, 4.7, 4.5, 4.9, 6.1, 5.7, 5.2, 4.3, 4.8, 5.9,
estimation methods, were examined under the above-
4.8, 5.1, 4.8, and 5.1. In Distributional Condition 6,
mentioned seven distributional conditions. A sample
the empirical kurtoses for the factors were 5.1, 6.0,
size was drawn from the population, and the model
and 5.5. The empirical kurtoses for the unique variates
was estimated in that sample. The results were saved,
were 2.6, 7.5, 10.4, 14.0, 19.3, 3.2, 9.5, 11.6, 15.1,
and the process was repeated for 200 replications.
19.9, 4.4, 8.2, 14.2, 19.2, and 28.3. In Distributional
This process was repeated for sample sizes 150, 250,
Condition 7, the empirical kurtoses for the factors
500, 1,000, 2,500, and 5,000. In all, there were 7
were 2.5, 18.0, and 2.14. The empirical kurtoses for
(distributions) x 6 (sample sizes) x 200 (replications)
the unique variates were 2.6, 7.5, 10.4, 14.0, 19.3, 3.2,
= 8,400 samples. The fit indices based on ML, GLS,
9.5, 11.6, 15.1, 19.9, 4.4, 8.2, 14.2, 19.2, and 28.3.
and ADF methods were calculated for each of these
Note that the empirical kurtoses for factors and unique
variates in Distributional Conditions 1--4 were very
close to the true kurtoses specified in these distribu-
tional conditions. By means of modified simulation 3 BL86, BL89, RNI, gamma hat, CAK, CK, Mc, CN, and
procedures in EQS (Bentler & Wu, 1995b) and SAS RMSEA were computed by SAS programs.
434 HU AND BENTLER

samples. This procedure was conducted for simple among fit indices derived from ML, GLS, and ADF
and complex models separately. methods also were obtained, to determine empirically
Misspecified models. Although both underparam- which subset of fit indices might have similar char-
eterized and overparameterized models were consid- acteristics. Results are shown in Table 3. A series of
ered as incorrectly specified models, our study only analyses of variance (ANOVAs) were conducted for
examined the sensitivity of fit indices to underparam- each fit index obtained for the simple and complex
eterization. For a simple model, the covariances models. The "qEs, indicating the proportion of variance
among the three factors in the correctly specified in each fit index accounted for by each predictor vari-
population model (true-population model) were non- able or interaction term, are presented in Tables 4
zero (see Figure 1). The covariance between Factors 1 through 9. Note that the .q2 reported in this article is
and 2 (Covariance a in Figure 1) was fixed to zero for equivalent to R E (Hays, 1988, p. 369) and was calcu-
Simple Misspecified Model 1. The covariances be- lated by dividing the Type 3 sum of squares for a
tween Factors 1 and 2, as well as between Factors 1 given predictor or interaction term by the corrected
and 3 (Covariances a and b) were fixed to zero for total sum of squares (i.e., corrected total variance). 5 In
Simple Misspecified Model 2. For a complex model, addition, a statistical summary of the mean value and
three observed variables loaded on two factors in the standard deviation of each fit index across the 200
true-population model: (a) The first observed variable replications and the empirical rejection frequency (for
loaded on Factors 1 and 3, (b) the fourth observed all but C A K and CK) based on rules of thumb were
variable loaded on Factors 1 and 2, and (c) the ninth tabulated by distribution, sample size, and estimation
observed variables loaded on Factors 2 and 3 (see method. Tables for the statistical summary for all fit
Figure 1). Complex Misspecified Model 1, the first indices are included in our technical report (Hu &
observed variable loaded only on Factor 1 (Omitted Bentler, 1997).
Path c), whereas the rest of the model specification
remained the same as the complex true-population
model. In Complex Misspecified Model 2, the first
and fourth observed variables loaded only on Factor 4 The overall coefficient of variation, which is defined as
the mean of a distribution divided by its standard deviation,
1: Omitted Paths c and d.
also was calculated for each fit index derived from ML,
Using the design parameters specified in either the
GLS, and ADF estimation methods. The conclusions re-
simple or complex true-population model, a sample garding the performance of fit indices based on the mean
size was drawn from the population, and each of the distance and coefficients of variation were similar. How-
misspecified models was estimated in that sample. ever, the overall mean distance provided a much better in-
That is, the data for a given sample size were gener- dex when compared across fit indices with different ex-
ated based on the structure specified by a true- pected values (i.e., 0 and 1) for a true-population model and
population (correct) model, and then the goodness-of- thus is reported in this article.
fit between a misspecified model and the generated 5 We calculated ,q2 values to determine the relative con-
data was tested. For each misspecified model, there tribution of each main effect and interaction term. Given the
were 7 (distributions) x 6 (sample sizes) × 200 (rep- very large sample size, significance tests would not be in-
formative. Although our mixed-model ANOVA designs in-
lications) = 8,400 samples. The fit indices based on
cluded a repeated measure (i.e., model misspecification or
ML, GLS, and ADF methods were calculated for each
estimation method), we always used the total variance as the
of these samples. denominator in our calculations, so that all effects were in a
common metric and are therefore directly comparable. This
Results approach can underestimate the effect sizes for the repeated
measures effects in mixed-model designs (Dodd & Schultz,
The adequacy of the simulation procedure and the 1973), and alternative approaches have been suggested
(e.g., Dodd & Schultz, 1973; Dwyer, 1974; Kirk, 1995;
characteristics specified in each distributional condi-
Vaughan & Corballis, 1969); however, these approaches
tion were verified by Hu et al. (1992), and thus are not make comparison of between- and within-subjects estimates
discussed here. The overall mean distances (OMDs) difficult because they are in different metrics. In our study,
between observed fit index values and the correspond- the error components were extremely small, and the sample
ing expected fit index values for the true-population size was very large, so that any advantage of using one of
models were calculated for each fit index and are these alternative approaches would be negligible (see
tabulated in Table 2. 4 Separate correlation matrices Sechrest & Yeaton, 1982).
SENSITIVITY OF FIT INDICES TO MISSPECII:rlCATION 435

Table 2
Overall Mean Distances Between Observed Fit-Index Values and the Corresponding True Values for Each Fit Index
Under Simple and Complex True-Population Models
Simple model Complex model
Fit index ML GLS ADF ML GLS ADF
NFI .058 .237 .187 .047 .227 .175
BL86 .069 .284 .223 .058 .281 .216
TLI .035 .132 .125 .029 .131 .115
BL89 .028 .102 .101 .023 .096 .090
RNI .029 .110 .105 .023 .105 .093
CFI .029 .106 .105 .023 .101 .093
GFI .054 .050 .058 .052 .048 .054
AGFI .075 .069 .079 .074 .069 .077
Gamma h~ .026 .016 .046 .025 .016 .042
CAK .660 .585 .869 .663 .591 .832
CK .681 .606 .890 .687 .614 .855
Mc .092 .059 .156 .088 .057 .141
SRMR .038 .053 .110 .035 .049 .114
RMSEA .035 .028 .047 .034 .028 .045

Note. Mean distance = ~/{[2(observed fit-index value - true fit-index value)2]/(no, observed fit indexes)}. ML = maximum likelihood;
GLS = generalized least squares; ADF = asymptotic distribution-free method; NFI = normed fixed index; TLI = Tucker-Lewis Index
(1973); BL86 = fit index by Bollen (1986); BL89 = fit index by Bollen (1989); RNI = relative noncentrality index; CFI = comparative
fit index; GFI = goodness-of-fit index; AGFI = adjusted goodness-of-fit index; CAK = a rescaled version of Akaike's information criterion;
CK = cross-validation index; Mc = McDonald's centrality index; CN = critical N; SRMR = standardized root-mean-square residual;
RMSEA = root-mean-square error of approximation. Smallest value in each column is italicized. CN methods were not applicable.

Overall M e a n Distance under some unusual conditions such as small sample


size). Table 2 contains the O M D s between the ob-
The O M D s between observed fit-index values and served fit-index values and the corresponding ex-
the corresponding expected fit-index values for the pected fit-index values. Overall, the values of the ML-
simple and complex true-population models were cal- based TLI, BL89, RNI, CFI, gamma hat, SRMR, and
culated for each fit index derived from ML, GLS, and R M S E A were much closer to their corresponding true
A D F estimation methods. For example, the mean dis- values than the other ML-based fit indices. The values of
tance for M L - b a s e d N F I o f the simple true-population the GLS- or ADF-based GFI, gamma hat, and R M S E A
model was equal to the square root o f { [E(observed as well as the GLS-based Mc and S R M R also were
fit-index value - 1)2]/8,400}. The smaller the mean closer to their corresponding true values than the other
distance, the better the fit index. The purpose for cal- GLS- or A D F - b a s e d fit indices. The distances for
culating the O M D was to gauge how likely and how C A K and C K were always unacceptable.
much each fit index might depart from its true value
under a correct model. Theoretically, these fit indices Similarities in Performance o f Fit Indices
would equal their true values under correct models,
and thus any departure from their values would indi- Separate correlation matrices among fit indices de-
cate instability resulting from small sample size or rived from ML, GLS, and A D F methods for simple
violation of other underlying assumptions. F o r ex- and complex models were obtained, to determine
ample, TLI or RNI would behave as a normed fit which fit indices might behave similarly. Each corre-
index asymptotically, but it could fall outside the 0--1 lation matrix was calculated by collapsing across
range when sample size was small or other underlying sample sizes, distributions, and model misspecifica-
assumptions were violated. Thus, the O M D was a fair tions, to determine if fit indices derived from ML,
criterion for comparing the performance o f fit indices GLS, or A D F method for simple or complex models
under true-population (correct) models, although one b e h a v e d similarly along three m a j o r dimensions:
might argue that it was an unfair comparison because sample size, distribution, and model misspecification.
the ranges o f fit indices differ (in fact, this only occurs The resulting patterns of correlations were identical;
436 HU AND BENTLER

I I II

°°!I
t'-I

II II

II

L~

r~

~I~I~ ~ ~ ~

t'N
i~~I~I~'~'~
ee~eelel ~

.=
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 437

~ o

, ~ .~ ~

' c51~ c5 c ~ l l m , . ~ o

0
,~1 d l d d ~ l i " "2 ~ .~'

I~, ~.
~ ~ ~ ,~- ~I- ~ t ii ~ ~
II II I ll|~ o ~

• II

,,

~d l~ ~d o4 o4 ~d -d ~4 m4 I._~~ =
~.
I I I I ~l •

~ ° ~ .~
~o ;.- E

I°°
II II
s~
~g
II II

I ~ ~1 ~-, 0

~,.~= II
ly~ o~8
I , ".7, ~"~ ,
IrO~ ..~
I.-II ~ ' ~
I ~ ~ ~,
•. ~ ,.~

I I~,~ ~''~
I,,=~
t "~ ,.,L, " o "=

,,m
I ~ ~
[- I~ ~.~
438 HU AND BENTLER

thus, we further calculated separate overall correlation GLS-based fit indices derived for simple models (see
matrices across simple and complex models for ML, Tables 4 and 5), an extremely large proportion of
GLS, and ADF methods. Table 3 contains the corre- variance in SRMR (-q2s = .914 and .859, respec-
lations. Inspection of the correlation matrix for the tively) and a moderate proportion of variance in TLI,
ML-based fit indices revealed that there were two BL89, RNI, CFI, gamma hat, Mc, and RMSEA were
major clusters of correlated fit indices. NFI, BL86, accounted for by model misspecification ('q2s ranged
GFI, AGFI, CAK, and CK were clustered with high from .309 to .487). Inspection of the cell means sug-
correlations. Another cluster of high intercorrelations gested that the mean values of these fit indices derived
included TLI, BL89, RNI, CFI, Mc, and RMSEA. CN from the two simple misspecified models were sub-
and SRMR were found to be least similar to the other stantially different from those derived from the simple
ML-based fit indices. The same pattern was observed true-population model. Thus, these fit indices, espe-
for the GLS-based fit indices. Finally, three clusters of cially SRMR, were more sensitive to simple misspeci-
ADF-based fit indices were observed in the correla- fled models than the rest of the other fit indices.
tion matrix. The first cluster included NFI, BL86, Model misspecification accounted for a substantial
TLI, BL89, RNI, and CFI. The second cluster in- amount of variance (,q2 = .608) in the ADF-based
cluded CAK, CK, gamma hat, Mc, and RMSEA. The SRMR and a moderate amount of variance ('q2s
last cluster included GFI and AGFI. As with ML and ranged from .389 to .516) in the ADF-based NFI,
GLS, CN and SRMR seemed to be less similar to the BL86, TLI, BL89, RNI, and CFI; thus, these ADF-
other ADF-based fit indices. based fit indices were more sensitive to simple mis-
Sensitivity to Underparameterized Model specified models than the other fit indices (see Ta-
ble 6).
Misspecification and Effects of Sample Size
and Distribution Furthermore, sample size accounted for a substan-
tial amount of variance (Ti2s ranged from .605 to .882)
Our preliminary analyses indicated that values of in the ML- and GLS-based NFI, BL86, GFI, AGFI,
most fit indices vary across different estimation meth- CAK, and CK, after controlling for the effects of dis-
ods; thus, we performed a series of ANOVAs sepa- tribution, model misspecification, and their interac-
rately for fit indices based on ML, GLS, and ADF tion terms. Distribution accounted for a relatively
methods, to determine if different patterns of effects small proportion of variance in any of the ML- and
of model misspecification, sample size, and distribu- GLS-based indices. Sample size accounted for a large
tion existed among the three estimation methods. Spe- proportion of variance ('q2s ranged from .674 to .877)
cifically, to examine the potential additive or multi- in the ADF-based gamma hat, CAK, CK, Mc, and
plicative effects of model misspecification (i.e., RMSEA. Sample size also accounted for a moderate
sensitivity to underparameterized model misspecifica- proportion of variance (-q2s = .343) in the ADF-based
tion) to the effect of sample size and distribution on fit CN. Distribution exerted a moderate effect on the
indices, we performed a series of 6 x 7 x 3 (Sample ADF-based GFI and AGFI (-q2s = .373 and .382,
Size x Distribution x Model Misspecification) respectively). Also, a moderate interaction effect be-
ANOVAs on each of the ML-, GLS-, and ADF-based tween sample size and model misspecification on the
fit indices. Separate analyses were performed for ML-, GLS-, and ADF-based CN (Ti2s ranged from
simple and complex models, to determine if different .340 to .390) indicated that the sample-size effect was
types of model misspecification (i.e., models with more substantial for the simple true-population model
misspecified factor covariance[s] and models with than for the two complex misspecified models.
misspecified factor loadings) exerted differential ef- Analyses for complexmodels. For the ML- and
fects on fit indices derived from ML, GLS, and ADF GLS-based fit indices derived for complex models
methods. The larger the amount of variance accounted (see Tables 4 and 5), a relatively large proportion of
for by model misspecification and the smaller the variance in TLI, BL89, RNI, CFI, gamma hat, Mc,
amount of variance accounted for by sample size and and RMSEA (Ti2s ranged from .699 to .766) was ac-
distribution, the better the fit index was considered to counted for by model misspecification. A moderate
be. Tables 4 through 6 display the TIE f o r each pre- amount of variance in ML- and GLS-based NFI and
dictor variable and interaction term derived from the BL86 and the ML-based GFI and AGFI (-q2s ranged
ANOVA performed on each fit index. from .454 to .549) was accounted for by model mis-
Analyses for simple models. For the ML- and specification. Model misspecification accounted for a
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 439

x x o =
~='~

~ "~'~

x~
~'~

N.~

x~
~g

.=~

~3

-==
~5

x'~
x~
c/)
~.~.-..,..-7.,--:...-..~.-.~..

..=
440 HU AND BENTLER

"=~
%

~D
%

x.~
,g

°.~
×

r~
.= %
×~
t',,~

e.

.,...

x ~,
~D

×~

o
e-.
"O

e~
~,
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 441

× ×.~

H
<
==.~ r.~

-d

×..~

II

5
H
Z
r..)
×~

~5

o..--,

× ~.=

II
442 HU AND BENTLER

small-to-moderate amount of variance in the GLS- size, distribution, estimation method, and various in-
based GFI and AGFI ('I]2S = .331 and .320, respec- teraction terms derived from each ANOVA. Note that
tively). It accounted for a moderate to relatively large the smaller the effects of sample size, distribution, and
amount of variance in the ML- and GLS-based SRMR estimation method, the better was the fit index.
0q2s = .653 and .588, respectively). Model misspeci- Analyses for simple and complex true-population
fication accounted for a moderate to relatively large models. The 6 x 7 x 3 (Sample Size x Distribution
amount of variance ('q2s ranged from 5.93 to .667) in x Estimation Method) ANOVAs performed on the fit
the ADF-based NFI, BL86, TLI, BL89, RNI, and CFI indices derived for the two types of true-population
(see Table 6). Overall, all types of fit indices (except models revealed that sample size accounted for a sub-
SRMR) seemed more sensitive in detecting the com- stantial amount of variance in each of the following fit
plex misspecified models (i.e., models with misspeci- indices (see Table 7): NFI, BL86, GFI, AGFI, CAK,
fled factor loading[s]) than the simple misspecified CK, and CN ('q2s ranged from .480 to .888). A small-
models (i.e., models with misspecified factor covari- to-moderate amount of variance was observed also for
ance[s]). 6 SRMR was more sensitive in detecting the the other fit indices. The interaction between sample
simple than the complex misspecified models, al- size and estimation method accounted for relatively
though the ability to detect complex misspecified small amounts of variance in NFI, BL86, TLI, BL89,
models for the ML- and GLS-based SRMR remained RNI, CFI, gamma hat, Mc, and RMSEA (-q2s ranged
reasonably high. from. 102 to .266). Inspection of cell means revealed
Sample size accounted for a small-to-large propor- that NFI, BL86, TLI, BL89, RNI, and CFI behaved
tion of variance in the ML- and GLS-based NFI,
differently across estimation methods at small sample
BL86, GFI, AGFI, CAK, and CK ('qZs ranged from
sizes, but they behaved consistently across estimation
.293 to .792). Sample size also accounted for a sub-
methods at large sample sizes. Gamma hat, Mc, and
stantial amount of variance in the ADF-based gamma
RMSEA also behaved less consistently across estima-
hat, CAK, CK, Mc, and RMSEA ('q2s ranged from
tion methods at small sample sizes. In addition, dis-
.541 to .827). Distribution accounted only for a mod-
tribution accounted for a relatively small proportion
erate amount of variance in the ADF-based GFI and
of variance in TLI, BL89, RNI, CFI, GFI, AGFI, and
AGFI ('q2s = .409 and .422, respectively). A moder-
RMSEA ('tIEs ranged f r o m . 116 t o . 160). Estimation
ate interaction effect between sample size and model
method accounted for a small proportion of variance
misspecification on the ML-, GLS-, and ADF-based
in NFI and BL86 ('qEs ranged from .242 to .264).
CN ('q2s ranged from .352 to .401) also was observed,
indicating that the sample-size effect was more sub- Analysis for simple and complex misspecified mod-
stantial for the complex true-population model than els 1 and 2. A s e r i e s o f 6 x 7 x 3 (Sample S i z e x
for the two complex misspecified models. Distribution x Estimation Method) ANOVAs were
conducted on the fit indices derived from the simple
Effects of Estimation Method, Distribution, and and complex misspecified models. The results were
Sample Size on Fit Indices similar for all the misspecified models; however, the
effect of estimation method was slightly increased as
To determine the importance of the additive and the degree of model misspecification increased (see
multiplicative effects of sample size, distribution, and Tables 8 and 9). Sample size was found to account for
estimation method on fit indices, we conducted a se- a relatively small proportion of variance in NFI and
ries of ANOVAs on fit indices derived from each of BL86 ('tiEs ranged from .144 to .206) and a moderate-
the simple and complex true-population models and to-substantial amount of variance in GFI, AGFI,
misspecified models. These analyses were performed gamma hat, CAK, CK, Mc, CN, and RMSEA ('qEs
separately for simple and complex true-population
models and misspecified models, to determine if the
effect of estimation method after controlling for the
effects of sample size and distribution varied as a 6 Results from a five-way ANOVA (Sample Size x Dis-
function of model quality, as reported by Sugawara tribution x Model Misspecification x Estimation Method x
and MacCallum (1993). The results for simple and Model Type) revealed that there were moderate-to-
complex models were similar and hence are discussed substantial interaction effects between model misspecifica-
together. Tables 7 through 9 contain the proportion of tion and model type (simple vs. complex model) for all fit
variance in each fit index accounted for by sample indices but CN.
Table 7
~2 Derived From a 6 x 7 x 3 Analysis of Variance (Sample Size x Distribution x Estimation Method) Performed Separately on Each Fit Index of the Simple or
Complex True-Population Model
Sample size x
Sample size Sample size x Distribution x distribution x ,~
Sample size Distribution Method x distribution method method method

Fit index Simple Complex Simple Complex Simple Complex Simple Complex Simple Complex Simple Complex Simple Complex
NFI .521 .481 .042 .040 .242 .264 .008 .008 .126 .139 .008 .009 .003 .003 ©
BL86 .519 .480 .044 .043 .242 .263 .008 .008 .126 .139 .008 .010 .003 .003
TLI .237 .213 .131 .134 .075 .079 .063 .063 .111 .102 .101 .115 .045 .050
BL89 .240 .217 .128 .131 .078 .081 .057 .057 .120 .112 .100 .115 .042 .046
RNI .236 .213 .128 .131 .076 .079 .061 .061 .112 .103 .101 .116 .047 .052
CFI .306 .283 .116 .119 .095 .104 .050 .049 .114 .107 .083 .096 .034 .037 t~
rn
GFI .628 .620 .155 .153 .008 .006 .047 .045 .035 .043 .025 .023 .008 .009 r~
AGFI .621 .613 .160 .158 .008 .006 .048 .047 .034 .043 .025 .023 .008 .009
Gamma hat .361 .354 .056 .062 .056 .049 .028 .030 .266 .244 .049 .054 .040 .044
CAK .866 .879 .010 .010 .013 .009 .005 .005 .059 .046 .010 .009 .009 .009
CK .874 .888 .010 .010 .012 .009 .005 .005 .056 .043 .009 .009 .009 .008
Mc .368 .359 .064 .071 .055 .047 .030 .033 .261 .239 .053 .057 .039 .042
CN .814 .815 .033 .032 .007 .007 .046 .046 .017 .017 .017 .016 .022 .022
SRMR .415 .336 .111 .101 .185 .196 .025 .023 .096 .1 04 .086 .093 .018 .021
RMSEA .398 .390 .142 .148 .031 .026 .020 .021 .186 .173 .086 .091 .019 .020 ©
:Z
Note. "tl2 = the proportion of variance accounted for by each predictor variable or interaction term (,q2 was calculated by dividing the Type 3 sum of squares for a given predictor or interaction
term by the corrected total sum of squares). NFI = normed fit index; BL86 = fit index by Bollen (1986); TLI = Tucker-Lewis Index (1973); BL89 = fit index by Bollen (1989); RNI
= relative noncentrality index; CFI = comparative fit index; GFI = goodness-of-fit index; AGFI = adjusted goodness-of-fit index; CAK = a rescaled version of Akaike's of formation
criterion; CK = cross-validation index; Mc = McDonald's centrality index; CN = critical N; SRMR = standardized root-mean-square residual; RMSEA = root-mean-square error of
approximation.
ddd HU AND BENTLER

0 C
.~ o
0 ~ ,.C

.o

× H
<

?5
rn
.o

,g
r~

.N 0 . o o o o o o o o o o o o o ~
"~, ,¢

×
. o o o ~ o o o ~ o o o o q

"~.

"~. H
Z

×
O2

. o o ~ m m ~ o o o o o m o
.== 0

II

g
e-
• o
N
×
o ,>,
.o

U
~.,o
e~

"iZI 0

[-,
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 445

× X

r.~ ,0
~ i~ oo ~ i.~ t.~ ~1. i/~ .~. ~t.I ~ . ~ ~ ~ ¢q o~ ~

,=OII

=~
o'. ,-d r~ .~

o~<~
×

,g
a,

o. e
r~ x
×

~ ~. t-:. ~. ~ t-:. ~. ~. ~ ~. ~. ~. -- ~. ~. ~ Z
II ~ r )
×

~Sw o=

o0 ~ 0", t",l t"-,I t"q ~ ~ ~.~ ¢,,i t'-',.i ~ '.~- ¢ q ,~- ~ "m~

°i
t.,,

~ ,-" ~ ~ ~ 0 t"xl t'xl t",l I'~- r'~ t'N ~ ~ t"xl .0

..~
446 HU AND BENTLER

ranged from .268 to .825) under simple misspecified Recommendations for the Selection of Fit
models 1 and 2, as well as complex misspecified Indices in Practice
model 1. Sample size accounted only for a moderate-
CAK and CK are not sensitive to model misspeci-
to-large proportion of variance in CAK, CK, and CN
fication, estimation method, or distribution but are
('q2s ranged from .332 to .709) under complex mis-
extremely sensitive to sample size. We do not recom-
specified model 2. A small proportion of variance in
mend their use.
GFI and AGFI also was accounted for by distribution
CN is not sensitive to model misspecification, es-
('q2s ranged from .153 to .264). Estimation method
timation method, or distribution but is very sensitive
had a moderate-to-substantial effect on NFI, BL86,
to sample size. We do not recommend its use.
TLI, BL89, RNI, CFI, and SRMR ('q2s ranged from
NFI and BL86 are not sensitive to simple model
.292 to .673) derived from simple and complex mis-
misspecification but are moderately sensitive to com-
specified models. A relatively small estimation-
plex model misspecification. Although a slight effect
method effect ('q2s ranged from .226 to .263) was
of estimation method under true-population models
observed for gamma hat, Mc, and RMSEA derived
and a substantial estimation-method effect under mis-
from complex misspecified model 2. Furthermore,
specified models were observed for NFI and BL86,
there were also relatively small-to-moderate interac-
they are not sensitive to distribution. ML- and GLS-
tion effects between sample size and estimation
based NFI and BL86 are sensitive to sample sizes.
method ('qEs ranged from .222 to .345) on gamma hat,
The ADF-based NFI and BL86 are less sensitive to
Mc, and RMSEA derived from simple and complex
sample size, but they substantially underestimate true-
misspecified models. Inspection of cell means re-
population values. We do not recommend their use.
vealed that these three fit indices behaved less con-
GFI and AGFI are not sensitive to model misspeci-
sistently at small sample sizes than at large sample
fication and estimation method. ML- and GLS-based
sizes. Under the complex misspecified model 2, there
GFI and AGFI are not sensitive to distribution but are
were a small distribution effect and a small interaction
sensitive to sample size. ADF-based GFI and AGFI
effect between distribution and estimation method on
are sensitive to distribution but are not sensitive to
GFI and AGFI. Inspection of cell means suggested
sample size. We do not recommend their use.
that GFI and AGFI derived from complex misspeci-
TLI, BL89, RNI, and CFI are moderately sensitive
fied model 2 behaved less consistently across estima-
to simple model misspecification but are very sensi-
tion methods under Distributional Conditions 1, 3,
tive to complex model misspecification. They are not
and 4. Finally, inspection of Tables 7 through 9
influenced by e s t i m a t i o n method under true-
yielded a systematic decrease in the magnitude of
population models but are substantially influenced by
estimation-method effect as a result of a decrease in
estimation method under misspecified models. These
quality of models. 7
fit indices are less sensitive to distribution and sample
size. We recommend these fit indices be used in gen-
Discussion
eral; however, ML-based TLI, BL89, RNI, and CFI
Our findings suggest that the performance of fit are more preferable when sample size is small (e.g., N
indices is complex and that additional research with a ~< 250), because the GLS- and ADF-based TLI,
wider class of models and conditions is needed, to BL89, RNI, and CFI underestimate their true-
provide final answers on the relative merits of many population values and have much larger variances
of these indices. In spite of this complexity, there are than those based on ML at small sample size.
enough clear-cut results from this study to permit us ML- and GLS-based gamma hat, Mc, and RMSEA
to make some very specific recommendations for are moderately sensitive to simple model misspecifi-
practice. We do this in a sequential manner, first mak-
ing suggestions about which indices not to use, then
concluding with suggestions about indices to use. A
good fit index should have a large model misspecifi-
cation effect accompanied with trivial effects of 7 Four-way ANOVAs (Sample Size x Distribution x
sample size, distribution, and estimation method. Model Misspecification x Estimation Method) revealed that
Summary tables and detailed description of various there are substantial interaction effects between model mis-
sources of effects on fit indices are presented in our specification and estimation method for Type 1, Type 2, and
technical report (Hu & Bentler, 1997). Type 3 incremental fit indices and SRMR.
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 447

cation and are very sensitive to complex model mis- sample size is small. Note that Marsh et al. (1996)
specification. These fit indices based on the ADF have proposed a normed version of TLI, to reduce the
method are less sensitive to both simple and complex variance of TLI, and have suggested that the normed
model misspecification. Estimation method exerts an version of TLI may be more preferable when sample
effect on gamma hat, Mc, and RMSEA at small size is small.
sample sizes but exerts no effect at large sample sizes. With the ADF method, we recommend the definite
ML- and GLS-based gamma hat, Mc, and RMSEA use of SRMR, supplemented with one of the follow-
are less sensitive to distribution and sample size. The ing indices: TLI, BL89, RNI, or CFI. However, we do
fit indices based on the ADF method are not sensitive not recommend the use of any ADF-based fit indices
to distribution but are very sensitive to sample size. when sample size is small, because they depart sub-
ML- and GLS-based gamma hat, Mc, and RMSEA stantially from their true-population values and tend
performed equally well, and we recommended their to overreject their true-population models (see also
use. However, we do not recommend that the ADF- Hu et al., 1992). Better results may be observed with
based gamma hat, Mc, and RMSEA be used in prac- new approaches that attempt to improve ADF estima-
tice. tion in small samples, s
Among all the fit indices studied, SRMR is most Finally, most of the fit indices (except gamma hat,
sensitive to simple model misspecification and is Mc, and RMSEA, which perform equally well under
moderately sensitive to complex model misspecifica- ML and GLS methods) obtained from ML perform
tion. SRMR is not sensitive to estimation method un- much better (less likely to be influenced by various
der true-population models but is sensitive to estima- sources of irrelevant effects and less likely to depart
tion method under misspecified models. SRMR is less from their true-population values) than those obtained
sensitive to distribution and sample size. At small from GLS and ADF and should be preferred indica-
sample sizes, GLS-based SRMR has a slight tendency tors for model selection and evaluation.
to overestimate true-population values, and ADF-based
SRMR substantially overestimates true-population Other General Observations
values. We recommend the ML-, GLS-, and ADF-
based SRMR be used in general, but ML-based The ability to discriminate well-fitting from badly
SRMR is preferable when sample size is small (e.g., N fitting models for the ML-, GLS-, and ADF-based
~< 250). The average absolute standardized residual SRMR is substantially superior to that of any other fit
computed by EQS, not studied here, has an identical index under simple misspecified models, but it is
rationale and should perform the same as SRMR. slightly less sensitive to complex model misspecifi-
On the basis of these results, with ML and GLS cation than several above-mentioned fit indices. One
methods, we recommend a two-index presentation possible explanation for this finding is that the load-
strategy for researchers. This would include definitely ings of the observed indications on a given factor
using SRMR and supplementing this with one of the become biased due to the misspecification of the co-
following indices: TLI, BL89, RNI, CFI, gamma hat,
Mc, or RMSEA. By using cutoff criteria for both
SRMR and one of the supplemented indices, research-
ers should be able to identify models with underpa- s Under the ADF method, there was a substantial sample-
rameterized factor covariance(s), underparameterized size effect on the three noncentrality-based absolute-fit in-
factor loading(s), or a combination of both types of dices. Because these absolute-fit indices rely very heavily
underparameterization. These alternative indices per- on the quality of the ADF chi-square statistic and because
form interchangeably in all distributional conditions this statistic simply cannot be trusted at smaller sample sizes
(see Table 3) except when sample size is small (e.g., (e.g., Benfler & Dudgeon, 1996; Hu et al., 1992), we are
N ~< 250). At small sample size, (a) the range of TLI optimistic that the finite sample improvements in the ADF
(or NNFI) tends to be large (e.g., Bentler, 1990); (b) tests made, for example, by Yuan and Bentler (1997) will
remove this performance problem in the near future. In
Mc tends to depart substantially from its true-
general, these indices also have good sensitivity to model
population values; and (c) RMSEA tends to overreject misspecification. This does break down with ADF estima-
substantially true-population models. Therefore a cau- tion, and it is possible that this breakdown also will be
tious interpretation of model acceptability based on prevented with the Yuan-Benfler ADF test. Future work
any of these three fit indices is recommended when will have to evaluate this suggestion.
448 HU AND BENTLER

variance between two factors and thus the average of Our results on absolute indices are mixed. The Type 2
squared residuals is more likely to capture this type of and Type 3 incremental fit indices and the noncen-
misspeciflcation as a result of a greater number of trality-based absolute-fit indices, in general, outper-
biased parameter estimates obtained. Our findings are form the Type 1 incremental and the rest of the ab-
consistent with La Du and Tanaka's (1989) findings solute-fit indices. The underestimation of perfect fit
that ML-based NFI is more sensitive to the underpa- by the fit indices studied here, which is evident at the
rameterized model misspecification than the ML- and smaller sample sizes, becomes trivially small at the
GLS-based GFI. However, in contrast to the results of two largest sample sizes (i.e., 2,500 and 5,000). This
Maiti and Mukherjee (1991), we have found GFI to be is consistent with the theoretically predicted asymp-
quite insensitive to various types of underparameter- totic properties and has been noted previously in sev-
ized model misspecification. Because they found GFI eral other studies (e.g., Bearden et al., 1982; Bentler,
to be sensitive as their newly proposed indices of 1990; La Du & Tanaka, 1989).
structural closeness (ISC), we suspect that ISC also Our findings on the effect of estimation method on
would not have performed well in our study. How- all three types of incremental fit indices are more
ever, ISC possesses, under some circumstances, an optimistic than those of Sugawara and MacCallum
excellent property of going to an extremely small (1993). Sugawara and MacCallum have reported that
value under extreme misspecification, which they call values of incremental fit indices such as NFI, BL86,
specificity. Certainly this feature, and the ISC indices, BL89, and TLI varied substantially across estimation
require further evaluation under conditions of extreme methods and that this phenomenon held for both poor-
model misfit. and well-fitting methods. However, our results indi-
A major effort in prior research on fit indices has cated that Type 2 and Type 3 incremental as well as
been to examine sensitivity of fit indices to sample absolute-fit indices behave relatively consistently
size. Virtually all of this research has been conducted across the three estimation methods under both types
under the true models (e.g., Anderson & Gerbing, of true-population models (especially when sample
1984; Anderson et al., 1985; Bollen, 1986, 1989, size is relatively large), although Type 1 incremental
1990; Marsh et al., 1988). To test the generality of fit indices seem to behave less consistently across
previous findings, we examined the effect of sample estimation methods under both true-population and
size on fit indices under both true-population and mis- misspecified models. These inconsistent findings may
specified models. The means of the empirical sam- be due to the differences in the range of sample sizes
piing distributions for Type 2 and Type 3 incremental and quality of models used in each of the studies, for
indices varied with sample size to a lesser extent than example, (a) small sample-size-to-model-size ratios
was found for Type 1 incremental fit indices. In keep- and (b) the use of good-fitting models instead of true-
ing with the findings of Marsh et al. (1988), Type 1 population models by Sugawara and MacCallum.
incremental fit indices tended to underestimate their Under both simple and complex misspecified mod-
asymptotic values and overreject true models at small els, all three types of incremental fit indices behave
sample sizes. This was especially true for indices ob- less consistently across ML, GLS, and ADF methods.
tained from GLS and ADF. Obviously, Type 1 incre- These findings are consistent with those of Sugawara
mental indices are influenced by the badness of the and MacCallum (1993). Sugawara and MacCallum
null model as well as the goodness of fit of the target have suggested that the effect of estimation methods
model. Among the absolute-fit indices, GFI, AGFI, on fit is tied closely to the nature of the weight ma-
CAK, and CK derived from ML and GLS methods, as trices used by the methods. According to them, incre-
well as CAK, CK, and the noncentrality-based abso- mental fit indices, which use the discrepancy function
lute-fit indices derived from the ADF method, were value for the null model in their calculation, tend to
substantially influenced by sample size. The quality behave erratically across estimation methods, because
of models does not have a substantial effect on the the discrepancy function values for a null model vary
relationship between the sample size and the mean as a function of the weight matrices defined in various
values of most of the fit indices studied here (CN is estimation methods. They also suggest that this phe-
the only exception). The pattern of association be- nomenon will occur even for a model that is quite
tween the mean values of all three types of fit indices consistent with the observed data. Our findings sug-
and sample size for the two misspecified models are gest that their proposition cannot be generalized to
quite similar to that for the true-population model. various situations (e.g., when there is dependence
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 449

among latent variates or when a true-population models than under the true-population model. All the
model is analyzed). For example, Type 2 and Type 3 fit indices behave more consistently across estimation
incremental fit indices for the true-population model methods under the true-population model than under
behave consistently at moderate or large sample sizes the two misspecified models. In keeping with Suga-
under the independence condition. It seems that when wara and MacCallum's (1993) findings, the extent of
more information is used for deriving a fit-index consistent performance across estimation methods for
value, the influence of weight matrices (and hence the absolute-fit indices depends on the quality of mod-
estimation methods) on the performance of incremen- els. One relevant and interesting question is how the
tal fit indices (e.g., Type 2 and Type 3 incremental fit extent of model misspecification may affect the per-
indices) decreases. This is evident from our findings formance of the noncentrality-based Type 3 incre-
that Type 2 and Type 3 incremental fit indices behave mental and absolute-fit indices. As suggested, a test
much more consistently across estimation methods statistic T can be approximated in large samples by
than Type 1 incremental fit indices. This is especially the noncentral ×2 (dr X) distribution with true or not
true when the sample size is large, the model is cor- extremely misspecified models and distributional as-
rectly specified, and the conditions for asymptotic ro- sumptions. It is likely that the degree of model mis-
bustness theory are satisfied. In addition, estimation specification will influence the performance of these
method has no effect on GFI, AGFI, CAK, and CK noncentrality-based fit indices more than it will affect
derived from simple and complex true-population and the other types of fit indices because of the violation
misspecified models. Estimation method has no effect of assumption underlying the noncentrality-based fit
on CN under simple models, but it exerts small effect indices (i.e., they may not be distributed as a noncen-
on CN under complex models when sample size is tral chi-square variate under extremely misspecified
small, especially when there is dependence among models). Future research needs to further address this
latent variates. Estimation method has a relatively issue.
small effect on SRMR under both simple and complex The only important remaining issue is the cutoff
true-population models, whereas it has a moderate-to- value for these indices. Considering any model with a
large effect on SRMR under both types of misspeci- fit index above .9 as acceptable (Bentler & Bonett,
fied models. Thus, Sugawara and MacCallum's sug- 1980), and one with an index below this value as
gestion that nonincremental fit indices tend to behave unacceptable, we have evaluated the rejection rates
much more consistently across estimation than do in- for most of the fit indices, except CAK, CK, CN,
cremental fit indices is only partially supported, and SRMR, and RMSEA. A cutoff value of 200 was used
the differential performance among three types of in- for CN (cf., Hoelter, 1983). A cutoff value of .05 was
cremental fit indices need to be emphasized. Further- used for SRMR and RMSEA. Steiger (1989), Browne
more, the interaction effect between sample size and and Mels (1990), and Browne and Cudeck (1993)
estimation method on the noncentrality-based abso- have recommended that values of RMSEA less than
lute-fit indices (i.e., gamma hat, Mc, and RMSEA) .05 be considered as indicative of close fit. Browne
seems to suggest that difference of weight matrices and Cudeck have also suggested that values in the
used for various estimation methods by itself does not range of .05 to .08 indicate fair fit and that values
provide sufficient rationale for explaining the incon- greater than .10 indicate poor fit. MacCallum,
sistent performance of various fit indices across esti- Browne, and Sugawara (1996) consider values in the
mation methods. One of the plausible explanations to range of .08 to .10 to indicate mediocre fit.
this unexpected finding may be that the difference Although it is difficult to designate a specific cutoff
between a sample test statistic T and its degrees of value for each fit index because it does not work
freedom provides a biased estimate of the correspond- equally well with various types of fit indices, sample
ing population noncentrality parameter when sample sizes, estimators, or distributions, our results suggest a
size is small. cutoff value close to .95 for the ML-based TLI, BL89,
The quality of models (degrees of model misspeci- CFI, RNI, and gamma hat; a cutoff value close to .90
fication) seems to be related to the inconsistent per- for Mc; a cutoff value close to .08 for SRMR; and a
formance of all fit indices, although this relationship cutoff value close to .06 for RMSEA, before one can
is much less substantial for GFI, AGFI, CAK, and conclude that there is a relatively good fit between the
CK. In general, they tend to perform less consistently hypothesized model and the observed data. Further-
across estimation methods under the misspecified more, the proposed two-index presentation strategy
450 HU AND BENTLER

(i.e., the use of the ML-based SRMR, supplemented assumptions regarding the independence of latent
by either TLI, BL89, RNI, CFI, gamma hat, Mc, or variates, and estimation methods. Violation of multi-
RMSEA) and the proposed cutoff values for the rec- variate normality assumption alone seems to exert less
ommended fit indices are required to reject reasonable impact on the performance of fit indices. Like chi-
proportions of various types of true-population and square statistics, fit indices are measures of the overall
misspecified models. Finally, the ML-based TLI, Mc, model fit, but it is likely that one may acquire a very
and RMSEA tend to overreject true-population mod- good overall fit of the model while one or more areas
els at small sample sizes (N ~< 250), and are less of local misspecification may remain. Thus, although
preferable when sample size is small. Note that dif- our discussion has been focused on the issues regard-
ferent cutoff values under various conditions (e.g., ing overall fit indices, consideration of other aspects
various sample sizes) are required for GLS- and ADF- such as the adequacy and interpretability of parameter
based fit indices and, hence, no cutoff values for GLS- estimates, model complexity, and many other issues
and ADF-based fit indices are recommended here. We remains critical in deciding on the validity of a model.
present a detailed discussion on the selection of cutoff
values for the ML-based fit indices elsewhere (Hu & References
Bentler, 1997, 1999).
Akaike, H. (1987). Factor analysis and AIC. Psychometrika,
52, 317-332.
Conclusion
Amemiya, Y., & Anderson, T. W. (1990). Asymptotic chi-
Our study has several strengths. First, a wide vari- square tests for a large class of factor analysis models.
ety of fit indices, including several new indices such Annals of Statistics, 18, 1453-1463.
as gamma hat, Mc, and RMSEA, were evaluated un- Anderson, J., & Gerbing, D. W. (1984). The effects of sam-
der various conditions, such as estimation method, pling error on convergence, improper solutions and good-
distribution, and sample size, often encountered in ness-of-fit indices for maximum likelihood confirmatory
practice. Second, we studied performance of fit indi- factor analysis. Psychometrika, 49, 155-173.
ces under various types of correct and misspecified Anderson, J., Gerbing, D. W., & Narayanan, A. (1985). A
models. However, there are also limitations to this comparison of two alternate residual goodness-of-fit in-
study. Although a misspecified model has often been dices. Journal of the Market Research Society, 24, 283-
defined by a nonzero noncentrality parameter (e.g., 291.
MacCallum et al., 1996; Satorra & Saris, 1985), the Anderson, T. W., & Amemiya, Y. (1988). The asymptotic
rationale for model selection or misspecification re- normal distribution of estimators in factor analysis under
mains a weak link in any simulation study, in the general conditions. Annals of Statistics, 16, 759-771.
absence of consensus on the definition of model mis- Bearden, W. D., Sharma, S., & Teel, J. E. (1982). Sample
specification or systematic study of models in the lit- size effects on chi-square and other statistics used in
erature and their likely misspecification. In our view, evaluating causal models. Journal of Marketing Re-
parsimony is a separate issue, and we did not evaluate search, 19, 425-430.
the performance of fit indices against this criterion. Bender, P. M. (1983). Some contributions to efficient sta-
Some fit indices include penalty functions for nonpar- tistics for structural models: Specification and estimation
simonious models (e.g., AGFI, TLI, CAK, CK, of moment structures. Psychometrika, 48, 493-571.
RMSEA), whereas others do not (e.g., NFI, GFI, and Bentler, P.M. (1989). EQS structural equations program
CFI). Finally, our study examined the performance of manual. Los Angeles: BMDP Statistical Software.
fit indices only under correct and underparameterized Bender, P. M. (1990). Comparative fit indexes in structural
confirmatory factor models. Further work should be models. Psychological Bulletin, 107, 238-246.
performed to explore the limits of generalizability in Bender, P.M. (1995). EQS structural equations program
various ways, for example, across types of structural manual. Encino, CA: Multivariate Software.
models and overparameterized models. Bentler, P. M., & Bonett, D. G. (1980). Significance tests
On the basis of the findings from previous studies and goodness of fit in the analysis of covariance struc-
and our Monte Carlo study, we identified several criti- tures. Psychological Bulletin, 88, 588-606.
cal factors that may influence the adequacy of perfor- Bentler, P. M., & Dudgeon, P. (1996). Covariance structure
mance of fit indices. These factors include the degree analysis: Statistical practice, theory, and directions. An-
of sensitivity to model misspecification, sample size, nual Review of Psychology, 47, 541-570.
SENSITIVITY OF FIT INDICES TO MISSPECIFICATION 451

Bentler, P. M., & Wu, E. J. C. (1995a). EQSfor Macintosh tural equation modeling: Issues, concepts, and applica-
user's guide. Encino, CA: Multivariate Software. tions (pp. 37-55). Newbury Park, CA: Sage.
Bentler, P. M., & Wu, E. J. C. (1995b). EQSfor Windows Chou, C.-P., Bentler, P. M., & Satorra, A. (1991). Scaled
user's guide. Encino, CA: Multivariate Software. test statistics and robust standard errors for nonnormal
Bollen, K. A. (1986). Sample size and Bentler and Bonett's data in covariance structure analysis: A Monte Carlo
nonnormed fit index. Psychometrika, 51, 375-377. study. British Journal of Mathematical and Statistical
Bollen, K. A. (1989). A new incremental fit index for gen- Psychology, 44, 347-357.
eral structural equation models. Sociological Research Cudeck, R., & Browne, M. W. (1983). Cross-validation of
and Methods, 17, 303-316. covariance structures. Multivariate Behavioral Research,
Bollen, K.A. (1990). Overall fit in covariance structure 18, 147-167.
models: Two types of sample size effects. Psychological Curran, P.J., West, S.G., & Finch, J.F. (1996). The ro-
Bulletin, 107, 256-259. bustness of test statistics to nonnormality and specifica-
Bollen, K A., & Liang, J. (1988). Some properties of tion error in confirmatory factor analysis. Psychological
Hoelter's CN. Sociological Research and Methods, 16, Methods, 1, 16-29.
492-503. de Leeuw, J. (1983). Models and methods for the analysis of
correlation coefficients. Journal of Econometrics, 22,
Browne, M. W. (1974). Generalized least squares estimators
113-137.
in the analysis of covariance structures. South African
Ding, L., Velicer, W. F., & Harlow, L. L. (1995). Effects of
Statistical Journal, 8, 1-24.
estimation methods, number of indicators per factor, and
Browne, M.W. (1982). Covariance structures. In D.M.
improper solutions on structural equation modeling fit
Hawkins (Ed.), Topics in applied multivariate analysis
indices. Structural Equation Modeling, 2, 119-144.
(pp. 72-141). Cambridge, England: Cambridge Univer-
Dodd, D. H., & Schultz, R. F. (1973). Computational pro-
sity Press.
cedures for estimating magnitude of effect for some
Browne, M.W. (1984). Asymptotically distribution-free
analysis of variance designs. Psychological Bulletin, 79,
methods for the analysis of covariance structures. British
391-395.
Journal of Mathematical and Statistical Psychology, 37,
Dwyer, J. H. (1974). Analysis of variance and the magni-
62-83.
tude of effects: A general approach. Psychological Bul-
Browne, M. W. (1987). Robustness of statistical inference letin, 81, 731-737.
in factor analysis and related models. Biometrika, 74,
Gerbing, D.W., & Anderson, J.C. (1993). Monte Carlo
375-384.
evaluations of goodness-of-fit indices for structural equa-
Browne, M. W., & Arminger, G. (1995). Specification and tion models. In K. A. Bollen & J. S. Long (Eds.), Testing
estimation of mean- and covariance-structure models. In structural equation models (pp. 40-65). Newbury Park,
G. Arminger, C. C., Clogg, & M. E. Sobel (Eds.), Hand- CA: Sage.
book of statistical modeling for social and behavioral Gierl, M. J., & Mulvenon, S. (1995). Evaluation of the ap-
science (pp. 185-249). New York: Plenum. plication of fit indices to structural equation models in
Browne, M. W., & Cudeck, R. (1989). Single sample cross- educational research: A review of literature from 1990
validation indices for covariance structures. Multivariate through 1994. Paper presented at the annual meeting of
Behavioral Research, 24, 445-455. the American Educational Research Association, San
Browne, M. W., & Cudeck, R. (1993). Alternative ways of Francisco.
assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Goffin, R. D. (1993). A comparison of two new indices for
Testing structural equation models (pp. 136-162). New- the assessment of fit of structural equation models. Mul-
bury Park, CA: Sage. tivariate Behavioral Research, 28, 205-214.
Browne, M. W., & Mels, G. (1990). RAMONA user's guide. Hays, W. L. (1988). Statistics. New York: Holt, Rinehart &
Unpublished report, Department of Psychology, Ohio Winston.
State University, Columbus. Hoelter, J. W. (1983). The analysis of covariance structures:
Browne, M. W., & Shapiro, A. (1988). Robustness of nor- Goodness-of-fit indices. Sociological Methods and Re-
real theory methods in the analysis of linear latent variate search, 11, 325-344.
models. British Journal of Mathematical and Statistical Hoyle, R. H. (Ed.). (1994). Structural equation modeling in
Psychology, 41, 193-208. clinical research [Special section]. Journal of Consulting
Chou, C.-P., & Bentler, P. M. (1995). Estimates and tests in and Clinical Psychology, 62, 427-521.
structural equation modeling. In R. Hoyle (Ed.), Struc- Hu, L., & Bentler, P. M. (1995). Evaluating model fit. In
452 HU AND BENTLER

R. H. Hoyle (Ed.), Structural equation modeling: Issues, Goodness-of-fit indices in confirmatory factor analysis:
concepts, and applications (pp. 76-99). Newbury Park, Effects of sample size. Psychological Bulletin, 103, 391-
CA: Sage. 411.
Hu, L., & Bentler, P. M. (1997). Selecting cutoff criteria for McDonald, R. P. (1989). An index of goodness-of-fit based
fit indexes for model evaluation: Conventional criteria on noncentrality. Journal of Classification, 6, 97-103.
versus new alternatives (Technical report). Santa Cruz, McDonald, R. P., & Marsh, H. W. (1990). Choosing a mul-
CA: University of California. tivariate model: Noncentrality and goodness of fit. Psy-
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indi- chological Bulletin, 107, 247-255.
ces in covariance structure analysis: Conventional versus Mooijaart, A., & Bentler, P. M. (1991). Robustness of nor-
new alternatives. Structural Equation Modeling, 6, 1-55. mal theory statistics in structural equation models. Sta-
Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics tistica Neerlandica, 45, 159-171.
in covariance structure analysis be trusted? Psychological Mulaik, S.A., James, L. R., Van Alstine, J., Bonnett, N.,
Bulletin, 112, 351-362. Lind, S., & Stillwell, C.D. (1989). An evaluation of
James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal goodness of fit indices for structural equation models.
analysis: Models, assumptions, and data. Beverly Hills, Psychological Bulletin, 105, 430-445.
CA: Sage. Muthen, B., & Kaplan, D. (1992). A comparison of some
JSreskog, K. G. (1969). A general approach to confirmatory methodologies for the factor analysis of nonnormal Likert
maximum likelihood factor analysis. Psychometrika, 34, variables: A note on the size of the model. British Journal
183-202. of Mathematical and Statistical Psychology, 45, 19-30.
Jrreskog, K.G. (1978). Structural analysis of covariance
SAS Institute. (1993). SAS/STAT user's guide. Cary, NC:
and correlation matrices. Psychometrika, 43, 443--477.
Author.
Jrreskog, K. G., & Srrbom, D. (1981). LISREL V: Analysis
Satorra, A., & Bentler, P. M. (1990). Model conditions for
of linear structural relationships by the method of maxi-
asymptotic robustness in the analysis of linear relations.
mum likelihood. Chicago: National Educational Re-
Computational Statistics & Data Analysis, 10, 235-249.
sources.
Satorra, A., & Bentler, P. M. (1991). Goodness-of-fit test
Jrreskog, K. G., & SOrbom, D. (1984). LISREL VI user's
under IV estimation: Asymptotic robustness of a NT test
guide (3rd ed.). Mooresville, IN: Scientific Software.
statistic. In R. Gutierrez & M. J. Valderrama (Eds.). Ap-
Jtireskog, K. G., & S/Srbom, D. (1993). LISREL 8: Struc-
plied stochastic models and data analysis (pp. 555-567).
tural equation modeling with the SIMPLIS command lan-
Singapore: World Scientific.
guage. Hillsdale, NJ: Erlbaum.
Satorra, A., & Saris, W. E. (1985). Power of the likelihood
Kirk, R. E. (1995). Experimental design: Procedures for the
ratio test in covariance structure analysis. Psychometrika,
behavioral sciences. Pacific Grove, CA: Brooks/Cole.
50, 83-90.
La Du, T.J., & Tanaka, S.J. (1989). The influence of
sample size, estimation method, and model specification Sechrest, L., & Yeaton, W. H. (1982). Magnitudes of ex-
on goodness-of-fit assessments in structural equation perimental effects in social science research. Evaluation
models. Journal of Applied Psychology, 74, 625-636. Review, 6, 579~500.
MacCallum, R.C., Browne, M.W., & Sugawara, H.M. Sobel, M. E., & Bohrnstedt, G. W. (1985). Use of null mod-
(1996). Power analysis and determination of sample size els in evaluating the fit of covariance structure models. In
for covariance structure modeling. Psychological Meth- N.B. Tuma (Ed.), Sociological methodology (pp. 152-
ods, 1, 130-149. 178). San Francisco: Jossey-Bass.
Maiti, S. S., & Mukherjee, B. N. (1991). Two new good- Steiger, J. H. (1989). EzPATH: A supplementary module for
ness-of-fit indices for covariance matrices with linear SYSTAT and SYGRAPH. Evanston, IL: SYSTAT.
structures. British Journal of Mathematical and Statisti- Steiger, J. H., & Lind, J. C. (1980, May). Statistically based
cal Psychology, 28, 205-214. tests for the number of common factors. Paper presented
Marsh, H. W., Balla, J. R., & Hau, K.-T. (1996). An evalu- at the annual meeting of the Psychometric Society, Iowa
ation of incremental fit indices: A clarification of math- City, IA.
ematical and empirical properties. In G. A. Marcoulides Sugawara H.M., & MacCallum, R.C. (1993). Effect of
& R.E. Schumacker (Eds.), Advanced structural equa- estimation method on incremental fit indexes for covari-
tion modeling: Issues and techniques (pp. 315-353). ance structure models. Applied Psychological Measure-
Mahwah, NJ: Erlbaum. ment, 17, 365-377.
Marsh, H.W., Balla, J.R., & McDonald, R.P. (1988). Tanaka, J. S. (1987). How big is big enough? Sample size

Potrebbero piacerti anche