Sei sulla pagina 1di 27

Applied Psychological Measurement

http://apm.sagepub.com/

Methodology Review: Assessing Unidimensionality of Tests and ltenls


John Hattie
Applied Psychological Measurement 1985 9: 139
DOI: 10.1177/014662168500900204

The online version of this article can be found at:


http://apm.sagepub.com/content/9/2/139

Published by:

http://www.sagepublications.com

Additional services and information for Applied Psychological Measurement can be found at:

Email Alerts: http://apm.sagepub.com/cgi/alerts

Subscriptions: http://apm.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Citations: http://apm.sagepub.com/content/9/2/139.refs.html

>> Version of Record - Jun 1, 1985

What is This?

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


Methodology Review:
Assessing Unidimensionality
of Tests and ltenls
John Hattie
University of New England, Australia

Various methods for determining unidimensionality mathematical measurement models. Further, to make
are reviewed and the rationale of these methods is as-
sessed. Indices based on answer patterns, reliability,
psychological sense when relating variables, or-
components and factor analysis, and latent traits are
dering persons on some attribute, forming groups
reviewed. It is shown that many of the indices lack a on the basis of some variable, or making comments

rationale, and that many are adjustments of a previous about individual differences, the variable must be
index to take into account some criticisms of it. After unidimensional; that is, the various items must
reviewing many indices, it is suggested that those measure the same ability, achievement, attitude,
based on the size of residuals after fitting a two- or
or other psychological variable. As an example, it
three-parameter latent trait model may be the most
useful to detect unidimensionality. An attempt is made seems desirable that a test of mathematical ability
to clarify the term unidimensional, and it is shown be not confounded by varying levels of verbal abil-
how it differs from other terms often used inter- ity on the part of persons completing the test. Yet
changeably such as reliability, internal consistency, despite its importance, there is not an accepted and
and homogeneity. Reliability is defined as the ratio of
effective index of the unidimensionality of a set of
true score variance to observed score variance. Inter-
nal consistency denotes a group of methods that are items. Lord (1980) contended that there is a great
intended to estimate reliability, are based on the vari- need for such an index, and Hambleton, Swami-
ances and the covariances of test items, and depend on nathan, Cook, Eignor, and Gifford (1978) argued
only one administration of a test. Homogeneity seems that testing the assumption of unidimensionality
to refer more specifically to the similarity of the item
takes precedence over other goodness-of-fit tests
correlations, but the term is often used as a synonym
for unidimensionality. The usefulness of the terms in- under a latent trait model.
ternal consistency and homogeneity is questioned. Uni- Initially identified as a desirable property of tests
dimensionality is defined as the existence of one latent in the 1940s and 1950s (the earliest mention was
trait underlying the data. in Walker, 1931), unidimensionality was used in
the same way as was homogeneity and internal
One of the most critical and basic assumptions consistency until the recent increase of interest in
latent trait models constrained a clearer and more
of measurement theory is that a set of items forming
an instrument all measure just one thing in com-
precise definition. One aim of this paper is to make
the boundaries of the terms less fuzzy by detailing
mon. This assumption provides the basis of most
variations in the use of the terms and by proposing
definitions that seem consistent with usage by many
authors (or at least, what the authors appear to have
intended when they used the terms). A further aim
is to review the many indices that have at various

139

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


140

times been proposed for unidimensionality, with ious authors have suggested rules to determine this
particular emphasis on their rationale. It is argued cutoff point. Guttman determined his coefficient
that most proposers do not offer a rationale for of reproducibility by counting the number of re-
their choice of index, even fewer assess the per- sponses that could have been predicted wrongly for
formance of the index relative to other indices, and each person on the basis of his/her total score,
hardly anyone has tested the indices using data of dividing these errors by the total number of re-
known dimensionality. sponses, and subtracting the resulting fraction from
As a working definition of unidimensionality, a 1. Guttman has suggested reproducibility indices
set of items can be said to be unidimensional when of .80 (1944) and .90 (1945) as acceptable ap-
it is possible to find a vector of values <I> = (~Z) proximations to a perfect scale.
such that the probability of correctly answering an Guttman’s coefficient has a lower bound that is
item is 7r¡g fg(~;) and local independence holds
=
a function of the item difficulties and can be well
for each value of <1>. The first section, on methods above .5 when the item difficulties depart from .5.
based on answer patterns, assumes fg to be a step Jackson (1949) proposed a Plus Percentage Ratio
function. The second section, on methods based (PPR) coefficient that was free from the effect of
on reliability, assumesf, to be linear. The next two difficulty values. His method is very cumbersome
sections, on methods based on principal compo- and time consuming, and has many of the problems
nents and factor analysis, assumefg to be generally of the coefficient of reproducibility (e.g., the prob-
linear (though a nonlinear function is considered). lem of determining the cutoff). To overcome these
The final section, on methods based on latent traits, deficiencies, Green (1956) proposed approxima-
assumes different forms for f~. tion formulas for dichotomous items and reported
Each section details various indices that have an average discrepancy between his and Guttman’s s

been proposed for unidimensionality. Given the formula of .002 (see Nishisato, 1980).
large number of indices, it is not possible to discuss Loevinger (1944, 1947, 1948) argued that tests
each one fully or to cite every researcher who has of ability are based on two assumptions: (1) scores
used it. Table 1 lists the indices discussed in this at different points of the scale reflect different lev-
paper and cites references to the authors who have els of the same ability, and (2) for any two items
used or recommended each index. in the same test, the abilities required to complete
one item may help or may not help but they will
not hinder or make less likely adequate perfor-
Indices Based on Answer Patterns mance on the other items. These assumptions led

Several indices of unidimensionality are based Loevinger to define a unidimensional test as one
such that, if A’s score is greater than B’s score,
on the idea that a perfectly unidimensional test is
then A has more of some ability than B, and it is
a function of the amount by which a set of item
the same ability for all individuals A and B who
responses deviates from the ideal scale pattern. The
ideal scale pattern occurs when a total test score may be selected.

equal to n is composed of correct answers to the Loevinger (1944) developed what she termed an
index of homogeneity. When all the test items are
n easiest questions, and thereafter of correct an-
swers to no other questions. arranged in order of increasing difficulty, the pro-
Guttman (1944, 1950) proposed a reproducibil- portion of examinees who have answered correctly
both items, Pij, is calculated for all pairs of items.
ity coefficient that provides ‘ ‘a simple method for From this, the theoretical proportion, (P;P/), who
testing a series of qualitative items for unidimen- would have answered correctly both items had they
sionality&dquo; (1950, p. 46). This index of reprodu- been independent, is subtracted. These differences
cibility is a function of the number of errors; these are summed over the n(n - 1) pairs of items:
errors are defined as unity entries that are below a
cutoff point and zero entries that are above it. Var-

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


141

Table 1

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


142

Table

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


143

A test made up of completely independent items is given to a group of 10 children, each of which
would have a value for S of 0, but S does not have is an average student at each grade level from 1 to
an upper limit of unity when the test is perfectly 10. A perfect scale very probably would result.
homogeneous. The upper limit is fixed by the pro- Thus, perfect reproducibility may not necessarily
portion of examinees answering correctly the more imply that the items are unidimensional. It seems
difficult item in each pair (Pj): that using these examples, as have Loevinger (1947)
and Humphreys (1949) among others, confuses the
method of assessing dimensionality with the iden-
The index of homogeneity thus proposed by Loe-
tification of the dimension(s) measured. It is pos-
vinger (1944) was the ratio of the Equations 1 and sible to say that the items in the above example
2 (I~ = 5’/S~&dquo;aX). The coefficient is unity for a per-
measure one characteristic, which could be labeled
fectly homogeneous test and zero for a perfectly development. Saying that a test is unidimensional
nonhomogeneous test. Yet for some sets of items, does not identify that dimension, in the same way
such as those that allow guessing, the lower bound
that saying that a test is reliable does not determine
may not necessarily be zero. (Incidentally, Loe- what it is reliably measuring.
vinger’s method of identifying errors is the same Because of these criticisms, there was a decline
as Green’s (1956), but Green sums over those item
in the use of scaling methods to index unidimen-
pairs whose members are adjacent in difficulty level sionality. Over the past decade the indices have
and not over all pairs of items, as does Loevinger.)
reemerged under different names. Loevinger’s
( 1944) ~ has been tenned C¡3 by Cliff (1977) and
Comments Reynolds (1981). A number of simulations (Wise,
1982, 1983) have compared various reproducibility
There have been many criticisms of indices of indices (or order statistics as they are now called)
unidimensionality based on answer patterns. The and rediscovered many of the above problems. Wise
most serious objection is that these methods can (1983), for example, compared Loevinger’s N and
only achieve their upper bounds if the strong as- ci~ described by Cliff (1977) and Reynolds (1981)
sumption of scalability (i.e., a perfect scale) is using six simulated and three real data sets. He
made (Lumsden, 1959). Another major criticism found that C¡l was a poor index of dimensionality
is that there is nothing in the methods that enables for data sets in which there were items of similar
a test of just one trait to be distinguished from a difficulty and in which items were of disparate
test composed of an equally weighted composite difficulty but did not belong to the same factor.
of abilities. Loevinger (1944), who recognized this Loevinger’s (1944) index also did not perform well
objection, suggested that in such cases methods of in data sets in which there were correlations be-
factor analysis were available that could help in tween factors. In these cases, Loevinger’s index
determining whether there were many abilities tended to overestimate the dimensionality and was
measured or only one. Guilford (1965) argued that not able to distinguish between items loading on
it is possible to say that a test that measures, say, different factors.
two abilities may be considered a unidimensional
test provided that each and every item measures
Indices Based on Reliability
both abilities. An example is an arithmetic-reason-
ing test or a figure-analogies test. Perhaps the most widely used index of unidi-
A further objection is that it is possible to con- mensionality has been coefficient alpha (or KR-
struct a set of items that forms a perfect scale yet 20). In his description of alpha Cronbach (1951)
would appear not to be unidimensional. For ex- proved (1) that alpha is the mean of all possible
ample, consider a set of 10 items testing different split-half coefficients, (2) that alpha is the value
abilities, one item at a level of difficulty appro- expected when two random samples of items from
priate to each grade from 1 through 10. The test a pool like those in the given test are correlated,

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


144

and (3) that alpha is a lower bound to the proportion agonal matrix that, on being subtracted from the
of test variance attributable to common factors among variance-covariance matrix, reduces it to unit rank.
the items. Cronbach stated that &dquo;for a test to be Unfortunately, there is no systematic relationship
interpretable, however, it is not essential that all between the rank of a set of variables and how far
items be factorially similar. What is required is that alpha is below the true reliability. Alpha is not a
a large proportion of the test variance should be monotonic function of unidimensionality.
attributable to the principal factor running through Green et al. (1977) observed that though high
the test&dquo; (p. 320). He then stated that alpha &dquo;es- &dquo;internal consistency&dquo; as indexed by a high alpha
timates the proportion of the test variance due to results when a general factor runs through the items,
all common factors among the items,&dquo; and indi- this does not rule out obtaining high alpha when
cates &dquo;how much the test score depends upon the there is no general factor running through the test
general and group, rather than on the item specific, items. As an example, they used a 10-item test that
factors&dquo; (p. 320). Cronbach claimed that this was occupied a five-dimensional common factor space.
true provided that the inter-item correlation matrix They used orthogonal factors and had each item
was of unit rank, otherwise alpha was an under- loading equally (.45) on two factors in such a way
estimate of the common factor variance. (The rank that no two items loaded on the same pair of com-
of a matrix is the number of positive, nonzero char- mon factors. The factors were also well determined

acteristic roots of the correlation matrix.) However, with four items having high loadings on each fac-
Cronbach suggested that this underestimation was tor. Each item had a communality of .90. Green
not serious unless the test contained distinct clus- et al. calculated alpha to be .811, and pointed out
ters (whereas Green, Lissitz, & Mulaik, 1977, that &dquo;commonly accepted criteria&dquo; would lead to
demonstrated that with distinct clusters alpha can the conclusion that theirs was a unidimensional
be overestimated). These statements led Cronbach, test. But this example is far from unidimensional.
and subsequently many other researchers, to make On another criterion, it can be determined that 15
the further claim that a high alpha indicated a ‘ ‘high of the 45 distinct inter-item correlations are zero.
first factor saturation&dquo; (p. 330), or that alpha was This should be a cause for concern.
an index of &dquo;common factor concentration&dquo; (p. In a monte carlo simulation, Green et al. (1977)
331), and the implication was that alpha was related found: ( Y ) that alpha increases as the number of
to dimensionality. items (r~) increases; (2) that alpha increases rapidly
Underlying the intention to use alpha as an index as the number of parallel repetitions of each type
of unidimensionality is the belief that it is somehow of item increases; (3) that alpha increases as the
related to the rank of a matrix of item intercorre- number of factors pertaining to each item increases;
lations. Lumsden (1957) argued that a necessary (4) that alpha readily approaches and exceeds .80
condition for a unidimensional test (i.e., &dquo;a test in when the number of factors pertaining to each item
which all items measure the same thing,&dquo; p. 106) is two or greater and ~a is moderately large (ap-
is that the matrix of inter-item correlations is of proximately equal to 45); and (5) that alpha de-
unit rank (see also Lumsden, 1959, p. 89). He creases moderately as the item communalities de-
means that the matrix fits the Spearman case of the crease. They concluded that the chief defect of
common factor model. From a more systematic set alpha as an index of dimensionality is its tendency
of proofs, Novick and Lewis (1967) established to increase as the number of items increase.
that alpha is less than or equal to the reliability of Green et al. (1977) were careful to note that
the test, and equality occurs when the off-diagonal Cronbach realized the effect of the number of items
elements of the variance-covariance matrix of ob- and that he recommended the average inter-item
served scores are all equal. This implies the weaker correlation. Yet in their monte carlo study, Green
property of unit rank. (However, the converse is et al. found that the average inter-item correlation
not true, i.e., unit rank does not imply that the off- is unduly influenced by the communalities of the
diagonals are all equal.) Thus, there exists a di- items and by negative inter-item correlations.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


145

Index/Index-h4ax Formulas ency could be obtained by applying the Spearman-


Brown formula to the alpha for the total test, thereby
Because alpha is based on product-moment cor-
estimating the mean correlation between items. The
relations, it cannot attain unity unless the items are derivation of this mean correlation is based on the
all of equal difficulty. This limitation has led other
case in which the items have equal variances and
authors to modify alpha in various ways, and to
equal covariances, and then is applied more gen-
propose other indices of unidimensionality. Gen-
erally. Cronbach recommended the formula as an
erally, these modifications have involved dividing overall index of internal consistency. If the mean
the index by its maximum value (index/index-max).
correlation is high, alpha is high; but alpha may
Loevinger’s (1944) I~ is one such ratio. Horst (1953), be high even when items have small intercorrela-
who reconceptualized Loevinger’s index in terms
tions-it depends on the spread of the item inter-
of intercorrelations between items, argued that in-
correlations and on the number of items. Cronbach
stead of using Loevinger’s method of estimating
pointed out that a low mean correlation could be
average item intercorrelation corrected for disper- indicative of a nonhomogeneous test and recom-
sion of item difficulties, it is more &dquo;realistic&dquo; to
mended that when the mean correlation is low, only
estimate average item reliability corrected for dis-
a study of the correlations among items would in-
persion of the item difficulties. The problem in dicate whether a test could be broken into more
such an index is to obtain plausible estimates of
item reliability.
homogeneous subtests.
Armor (1974) argued that inspecting the inter-
Both Loevinger (1944) and Horst (1953) defined
item correlations for patterns of low or negative
the maximum test variance given that the item dif-
correlations was usually not done, and was prob-
ficulties remain the same, or alternatively, that the
item covariances be maximum when the item dif-
ably the most important step since it contained all
information needed to decide on the dimensionality
ficulties are fixed. Terwilliger and Lele (1979) and
(also see Mosier, 1936). Armor claimed that by
Raju (1980) instead maximized the item co- assessing the number of intercorrelations close to
variances when test variances were fixed but the
zero, it was possible to avoid a major pitfall in
item difficulties were allowed to vary. (Thus, these
indices are not estimates of the classical definition
establishing unidimensionality. That is, it becomes
of reliability, i.e. , as the ratio of true score variance
possible to assess whether more than one indepen-
dent dimension is measured.
to observed score variance.)
Another problem in using an average inter-item
Although his work predated these criticisms, correlation is that if the distribution of the corre-
Cronbach (1951) was aware of the problems of
lations is skewed, then it is possible for a test to
alpha, and reported that the difference between have a high average inter-item correlation and yet
alpha and indices controlling for the dispersion of have a modal inter-item correlation of zero (Mo-
item difficulties was not practically important.
sier, 1940). Clearly, despite its common usage as
an index of unidimensionality, alpha is extremely

Indices Based on Correcting


suspect.
for the Number of Items
Indices Based on Principal Components
Alpha is dependent on the length of the test, and
this leads to a new problem-and to more indices. Defining a unidimensional test in terms of unit
Conceptually, the unidimensionality of a test should rank leads to certain problems, the most obvious
be independent of its length. To use Cronbach’s of which is how to determine statistically when a
(1951) analogy: A gallon of homogenized milk is sample matrix of inter-item correlations has unit
no more homogeneous than a quart. Yet alpha in- rank. Some of the estimation issues relate to whether
creases as the test is lengthened and so Cronbach component or factor analysis should be used, how
proposed that an indication of inter-item consist- to determine the number of factors, the problem

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


146

of communalities, the role of eigenvalues, the choice of the ratio indicate unidimensional tests. Low val-
of correlations, and the occurrence &reg;f &dquo;difficulty&dquo; ues suggest multidimensionality&dquo; (p. 15).
factors. Lord (1980) argued that a rough procedure for
Since the first principal component explains the determining unidimensionality was the ratio of first
maximum variance, then this variance, usually ex- to second eigenvalues and an inspection as to whether
pressed as the percentage of total variance, has the second eigenvalue is not much larger than any
been used as an index of unidimensionality. The of the others. A possible index to operationalize
implication is that the larger the amount of variance Lord’s criteria could be the difference between the
explained by the first component, the closer the set first and second eigenvalues divided by the differ-
of items is to being unidimensional. An obvious ence between the second and third eigenvalues.

problem is how &dquo;high&dquo; does this variance need to Divgi (1980) argued that this index seemed rea-
be before concluding that a test is unidimensional. sonable because if the first eigenvalue is relatively
Carmines and Zeller (1979), without any rationale, larger than the next two largest eigenvalues, then
contended that at least 40% of the total variance this index will be large; whereas if the second ei-
should be accounted for by the first component genvalue is not small relative to the third, then
before it can be said that a set of items is measuring regardless of the variance explained by the first
a single dimension. Reckase (1979) recommended component and despite the number of eigenvalues
that the first component should account for at least greater than or equal to 1.0, this index will be
20% of the variance. However, it is not difficult small. It is not difficult, however, to construct cases
to invent examples in which a multidimensional set when this index must fail. For example, given four
of items has higher variance on the first component common factors, if the second and third eigenval-
than does a unidimensional test. ues are nearly equal, then the index could be high.
Since many of the components probably have But in a three-factor case, if the difference between
much error variance and/or are not interpretable, the second and third eigenvalues is large, then the
there have been many attempts to determine how index would be low. Consequently, the index would
many components should be &dquo;retained.&dquo; A com- identify the four-factor case as unidimensional, but
mon strategy is to retain only those components not the three-factor case!
with eigenvalues greater than 1.0 (see Kaiser, 1970, The sum of squared residuals, or sum of the
for a justification, though there are many critics of absolute values of the residuals after removing one
his argument, e.g., Gorsuch, 1974; Horn, 1965, component, has been used as an index of unidi-
1969; Linn, 1968; Tucker, Koopman, & Linn, mensionality. Like many other indices, there is no
1969). The number of eigenvalues greater than 1.0 established criterion for how small the residuals
has been used as an index of unidimensionality. should be before concluding that the test is uni-
Lumsden (1957, 1961), without giving reasons, dimensional. It has been suggested that the absolute
suggested that the ratio of the first and second ei- size of the largest residual or the average squared
genvalues would give a reasonable index of uni- residual are useful indices of the fit. Thurstone
dimensionality, though he realized that besides (1935), Kelley (1935), and Harman (1979) argued
having no fixed maximum value, little is known that all residuals should be less than (~ - 1)1/2,
about the extent to which such an index may be where N is the sample size (i.e. the standard error
affected by errors of sampling or measurement. of a series of residuals).
Hutten (1980) also assessed unidimensionality on
the basis of the ratio of the first and second largest
Indices Based on Factor Analysis
eigenvalues of matrices of tetrachoric correlations.
Without citing evidence, Hutten wrote that the ratio Factor analysis differs from components analysis
criterion was &dquo;a procedure which has been used primarily in that it estimates a uniqueness for each
extensively for this purpose&dquo; and that’high values item given a hypothesis as to the number of factors.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


147

There are other differences between the two meth- trivial departures lead to rejection of the hypoth-
ods and Hattie (1979, 1980) has clearly demon- esis. J6reskog and S6rbom (1981) suggested that
strated that contrary conclusions can result from the chi-square measures are better thought of as
using the two methods. When using the maximum goodness-of-fit measures rather than as test statis-
likelihood estimation method, assuming normality, tics.
the hypothesis of one common factor can be tested Instead of using chi-squares, R4cDonaId (1982)
in large samples by a chi-square test. Using binary has recommended that the residual covariance ma-
data, it clearly cannot be assumed that there is trix supplies a nonstatistical but very reasonable
multivariate normality. Fuller and Hemmerle ( 1966) basis for judging the extent of the misfit of the
assessed the effects on chi-square when the as- model of one factor to the data. Further, h4cDonald
sumption of normality was violated. They used argued that in practice the residuals may be more
uniform, normal, truncated normal, Student’s t, important than the test of significance for the hy-
triangular, and bimodal distributions in a monte pothesis that one factor fits the data, since the hy-
carlo study. They concluded that the chi-square pothesis &dquo;like all restrictive hypotheses, is surely
was relatively insensitive to departures from nor- false and will be proved so by a significant chi-
mality. The Fuller and Hemmerle study was con- square if only the sample size is made sufficiently
fined to sample sizes of 200, five items, and two large. If the residuals are small the fit of the hy- &dquo;

factors, and it is not clear what effect departure pothesis can still be judged to be ’satisfactory’
from normality have on the chi-square for other (p. 385). This raises the issue that it is probably
sets of parameters, particularly when binary items more meaningful to ask the degree to which a set

are used. of items departs from unidimensionality than to ask


It should be noted, however, that the chi-square whether a set of items is unidimensional.
from the maximum likelihood method is propor- Tucker and Lewis (1973) provided what they
tional to the negative logarithm of the determinant called a &dquo;goodness-of-fit&dquo; test based on the ratio
of the residual co variance matrix, and is therefore of the amount of variance associated with one fac-
one reasonable measure of the nearness of the re- tor to total test variance. The suggestion is that for
sidual co variance matrix to a diagonal matrix. This the one-factor case this &dquo;reliability&dquo; coefficient
property is independent of distributional assump- may be interpreted as indicating how well a factor
tions and justifies for the present purpose the use model with one common factor represents the co-
of the chi-square applied to the matrix of tetra- variances among the attributes for a population of
chorics from binary data. (However, this does not objects. Lack of fit would indicate that the relations
justify use of the probability table for chi-square.) among the attributes are more complex than can
Further, it is possible to investigate whether two be represented by one common factor. The sam-
factors provide better fit than one factor by the pling distribution of the Tucker-Lewis coefficient
difference in chi-squares. Joreskog (1978) conjec- is not known, and there is no value suggested from
tured that the chi-squares from each hypothesis (for which to conclude that a set of items is unidimen-
one factor and for two factors) are independently sional. It is also not clear why the authors did not
distributed as a chi-square with (c~f2 - 4~) degrees condone using the statistic to indicate how well the
of freedom. Given that many of the chi-square val- hypothesized number of factors explains the inter-
ues typically reported are in the tails of the chi- relationships, yet they did claim that it summarizes
square distribution, it is not clear what effect this the quality of interrelationships (see Tucker & l~~ris9
has on testing the difference between two chi-square 1973, p. 9).
values. Given the important effect of sample size
in determining the chi-square values in factor anal-
Maximizing Reliability
ysis, the chi-square values are commonly very large
relative to degrees of freedom. Consequently, even An index based on the largest eigenvalue (X,),

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


148

that has been often rediscovered is maximized-al- Should Factor Analysis


pha (Armor, 1974; Lord, 1958): Be Used on Binary I~e s?
maximized-alpha = [nl(n-1)](1 -1/A,) . (3) .

Many of the tests for which indices of unidi-


Armor compared maximized-alpha and alpha for mensionality have been derived are scored correct
the single factor case and concluded that maxi- or incorrect. A problem in using the usual factor

mized-alpha and alpha differ in a substantial way model on binary-scored items is the presence of
only when some items have consistently lower cor- &dquo;difficulty factors.&dquo; This problem has a history
relations with all the remaining items in a set. This dating back to Spearman (1927) and Hertzman
led Armor to propose what he termed &dquo;factor scal- (1936), and often is cited as a reason against per-
ing,&dquo; which involves dropping items that have lower forming factor analysis on dichotomous data (Gor-
factor loadings. He suggested dropping any item such, 1974). Guilford (1941), in a factor analysis
with factor loadings less than .3. Then, since alpha of the Seashore Test of Pitch Discrimination, ob-
and maximized-alpha do not differ by much when tained a factor that was related to the difficulty of
only high loadings are retained, Armor proposed the items. That is, the factor loadings showed a
that rather than bothering to get the set of weights tendency to change as a linear function of item
that would lead to maximized-alpha, the test de- difficulty. Three possibilities were suggested to ac-
veloper could use unit weights (which leads to al- count for the presence of the difficulty factor: (1) it
pha). Armor also suggested that maximized-alpha may have something to do with the choice of a
could be used to discover unidimensionality, though measure of association; (2) it may be that difficulty
he did not specify how this was done. The obvious factors are related to distinct human abilities; or
interpretation of Armor’s remarks is that a large (3) it may have something to do with chance or
maximized-alpha is an indication of a uni- guessing. From each account various indices of
dimensional test, whereas a low value indicates a unidimensionality have been proposed.
multidimensional test. Armor argued that maxi- Choice of a measure of association. Wherry
mized-alpha was better than alpha, but in his ex- and Gaylord (1944) argued that difficulty factors
amples, the differences between these two coef- were obtained because phi and not tetrachoric cor-
ficients were .05, .02, .002, and 0.0, none of which relations were used. This was, they argued, be-
justifies his claim of &dquo;substantial&dquo; differences. cause tetrachorics would be 1.0 in cases of items

measuring the same ability regardless of differ-


ences in difficulty, whereas the sizes of phis are
Omega
contingent upon difficulty. They contended that if
McDonald (1970) and Heise and Bohmstedt difficulty factors were found even when tetrachor-
(1970) independently introduced another coeffi- ics were used (as in Guilford’s, 1941, case), then
cient that has been used by other researchers as an this must be considered disproof of the unidimen-
index of unidimensionality. This index, called theta sional claim. Empirically, the sample matrix of
by McDonald and omega by Heise and Bohmstedt, tetrachorics is often not positive-definite (i.e., non-
is based on the factor analysis model and is a lower Gramian). This may be a problem of using (perhaps
bound to reliability, with equality only when the without justification) a maximum likelihood com-
specificities are all zero. Omega is a function of puter program as opposed to least squares methods.
the item specificities and these can be satisfactorily Further, Lord and Novick (1968) have contended
estimated only by fitting the common factor model that tetrachorics cannot be expected to have just
efficiently. Approximations from components one common factor except under certain normality

analysis would generally yield underestimates of assumptions, whereas such distributional consid-
the uniquenesses (and therefore of the specificities) erations are irrelevant for dimensionality defined
and thus lead to spuriously high values of omega. in terms of latent traits.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


149

Carroll (1945) has further demonstrated that te- The notion of &dquo;factors due to difficulty&dquo; should
trachorics can be affected by guessing, and that be dropped altogether and could reasonably be re-
the values of tetrachorics tend to decrease as the placed by &dquo;factors due to nonlinearity.&dquo;
items become less similar in difficulty. Lord (1980) McDonald (1979) described a method of non-
was emphatic that tetrachorics should not be used linear factor analysis using a fixed factor score model
when there was guessing. which, unlike the earlier random model (Mc-
Gourlay (1951) was aware of the effect of guess- Donald, 1967c), obtains estimates of the parame-
ing on tetrachorics yet argued that a more funda- ters of the model by minimizing a loss of function
mental problem was that the test items often vio- based either on a likelihood ratio or on a least
lated the assumption of normality. This hint led to squares function. Etezadi and McDonald (1983)
an important discovery by Gibson (1959, 1960) have investigated numerical methods for the mul-
who recognized that difficulty factors were caused tifactor cubic case of this method with first-order
by nonlinear regressions of tests on factors. How- interactions.
ever, a general treatment of nonlinearity was not Although nonlinear factor analysis is concep-
given until McDonald’s series of publications tually very appealing for attempting to determine
(1962a, 1965a, 1965b, 1967a, 1967b, 1967c, 1976). dimensionality, it has been used, unfortunately,
Nonlinear factor analysis. l~cl~onald (1965a, relatively infrequently. Other than work done by
1967c) used a form of nonlinear factor analysis to McDonald (1967c), McDonald and Ahlawat (1974),
derive a theory for difficulty factors. Using a factor and Etezadi (1981), it was possible to find only
analysis of subtests of Raven’s Progressive Mat- one use of the method. Lam (1980) found that one

rices, McDonald demonstrated that a difficulty fac- factor with a linear and a quadratic term provided
tor emerged in that the loadings of the subtests much better fit to Raven’s Progressive Matrices
were highly correlated with the subtest means. He than a one- or two-factor linear model. If nonlin-
showed that this factor corresponded to the quad- earities are common when dealing with binary data,
ratic term in a single-factor polynomial model, and then a nonlinear factor analysis seems necessary.
hence argued that the difficulty factor corresponded Moreover, if a nonlinear factor analysis speci-
to variations in the curvatures of the regressions of fying a one-factor cubic provides good fit to a set
the subtests on a single factor. A major conse- of data, then the rank of the inter-item correlation
quence of using linear factor analysis on binary matrix is three. Hence, the claim that unit rank is
items is to distort the loadings of the very easy and a necessary condition for unidimensionality is in-

very difficult items and to make it appear that such correct.


items do not measure the same underlying dimen-
sion as the other items. Communality Indices
McDonald and Ahlawat (1974), in a monte carlo
Green et al. (1977) have suggested two indices
study, convincingly demonstrated (1) that if the
based on communalities. The first they called u,
regressions of the items on the latent traits are lin-
ear, there are no spurious factors; (2) that a factor
whose loadings are correlated with difficulty need
not be spurious; (3) that binary variables whose where h2 is communality from a principal com-
a

regressions on the latent trait are quadratic curves ponents analysis. Unlike many other proposers of
may yield &dquo;curvature&dquo; factors, but there is no nec- indices, Green et al. did offer a rationale for these
essary connection between such &dquo;curvature&dquo; ef- indices. When there is a single common factor among
fects and item difficulty; and (4) that binary vari- the items, the loadings on this factor equal the
ables that conform to the normal ogive model yield, square roots of their respective communalities. Also,
in principle, a series of factors due to departure the correlation between any two items equals the
from a linear model. The conclusion was clear: product of their respective factor loadings or the

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


150

square root of the product of the respective com- Second-Order Factor Analysis
munalities. For any particular pair of items a and
Lumsden (1961 ) claimed that it is possible that
j, Green et al. suggested that the inequality <
Iriil variance in important group factors may be ob-
(h? hJ)’~2 holds. Equality is attained for items oc-
scured if the group factors are each measured by
cupying a single common factor space and the in-
only a single item in the set that was tested for
equality is strict for items occupying more than one unidimensionality. He gave as an example four
dimension. When there is one factor, u equals 1;
items whose ideal factor constitutions were:
when there are more factors, u takes on values less
than 1 and has a lower limit somewhere above 0.
Green et al. (1977) calculated the value of u in
a monte carlo study and found u to be relatively

independent of both the number of items and the


communality of the items. Although u did increase
as the number of factors loading on an item in-

creased, Green et al. reported that u did not increase


as much as alpha. From an inspection of their sum- where G is a general factor, S a specific factor,
and E is error. The correlation between these items
mary statistics it appears that, contrary to the claim
of Green et al., u is affected as much as alpha, will be determined by the product of the G load-
though the values are much lower than alpha. Fur- ings, and Lumsden stated that the remainder of the
ther, although u does seem to distinguish between variance would be treated as error. He then wrote
a one-factor solution and a more-than-one-factor that the &dquo;matrix of item intercorrelations for these
solution, it does not relate in any systematic way four items will be of unit rank. A verbal analogy,
to the number of extra factors involved. a number series, a matrix completion and a me-

A serious problem with u is that it requires chanical problem are not, however, measuring the
same thing and unit rank is not, therefore, a suf-
knowledge of the communalities of the items, which
ficient condition &reg;f unidirr~ensi&reg;r~ality’9 (1957, pp.
depends on knowledge of the correct dimensional-
ity. In the simulation of Green et al. (1977), they 107-108).
It can be argued that these items may measure
provided the communalities, but in practice these
are not known and r~ can be (and nearly always is) a unidimensional trait (e.g., intelligence). It is quite
reasonable to find a second-order factor underlying
larger than 1, as Ir ijl > (h? h~)1I2 (because the com-
munalities are usually underestimated). The use- a set of correlations between first-order factors and

fulness of r~ is therefore most questionable. then make claims regarding unidimensionality. Hattie
Green et al. (1977) suggested a further index. (1981), for example, used a second-order unre-
Given that ~° is affected by the communalities of stricted maximum likelihood factor analysis to in-
the items, one aim would be to counter this effect. vestigate the correlations between four primary fac-
If the correlations are first corrected for commun- tors he identified (and cross-validated) on the
Personal Orientation Inventory (Shostrom, 1972).
ality by dividing them by the product of the square
roots of their respective communalities in the same The hypothesis of one second-order factor could
manner as a correction for attenuation would be not be rejected in 9 of the 11 data sets and Hattie

calculated, and the resulting values averaged, then concluded that &dquo;It thus seems that there is a uni-
an index not affected by the communalities is ob- dimensional construct underlying the major factors
tained. Green et al. contended that such an index of the POI&dquo; (p. ~0), which could be identified as
also takes on values 0 to 1, with 1 indicating uni- self-actualization. Any method for assessing di-
dimensionality. This index also depends on knowl- mensionality does not necessarily identify the na-
ture of the unidimensional set. The naming of the
edge of the communalities and is affected by the
number of factors determining a variable. dimension is an independent task.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


151

Indices Based on Latent Trait Models The most critical and fundamental assumption
of the latent trait models is that of local independ-
Before outlining indices based on latent trait
ence (Anderson, 1959; McDonald, 1962b, 1981).
models, it is necessary to present some basic the-
Lord (1953) has stated that this is almost indis-
ory. This basic theory also provides a clear defi-

nition for the concept of unidimensionality. pensable for any theory of measurement. The prin-
A theory of latent traits is based on the notion
ciple of local independence requires that any two
items be uncorrelated when 0 is fixed and does not
that responses to items can be accounted for, to a
substantial degree, by defining k characteristics of require that items be uncorrelated over groups in
which 0 varies. Lord and Novick (1968) gave the
the examinees called latent traits, which can be
definition of local independence more substantive
denoted by 0 = (01, 82, ..., 0,,). The vector 0 is
a k-tuple that is interpreted geometrically as a point
meaning by writing that
an individual’s performance depends on a sin-
in k-dimensional space. The dimensionality of the
latent space is one in the special case of a unidi-
gle underlying trait if, given his value on that
mensional test. The regression of item score on 0
trait, nothing further can be learned from him
that can contribute to the explanation of his
is called the item characteristic function. For a di-
chotomous item, the item characteristic function is performance. The proposition is that the latent
trait is the only important factor and, once a
the probability P(8), of a correct response to the
item. For a one-dimensional case a common as-
persons’ value on the trait is determined, the
behavior is random, in the sense of statistical
sumption is that this probability can be represented independence. (p. 538)
by a three-parameter logistic function: McDonald (1962b) argued that the statement of
the principle of local independence contained the
mathematical definition of latent traits. That is, &reg;,,
or alternatively by a normal ogive function I 6~ are latent traits, if and only if they are
quantities characterizing examinees such that, in a
subpopulation in which they are fixed, the scores
of the examinees are mutually statistically inde-
The difference between Equations 5 and 6 is less pendent. Thus, a latent trait can be interpreted as
than .01 for every set of parameter values when d a quantity that the items measure in common, since
is chosen as 1.7 (Haley, 1952). it serves to explain all mutual statistical depend-
The parameter in Equations 5 and 6 represents encies among the items. Since it is possible for two
the discriminating power of the item, the degree items to be uncorrelated and yet not be entirely
to which the item response varies with 0 level. statistically independent, the principle is more
Parameter b is a location or difficulty parameter stringent than the factor analytic principle that their
and determines the position of the item character- residuals be uncorrelated. If the principle of local
istic curve along the 0 scale. The parameter c is independence is rejected in favor of some less re-
the height of the lower asymptote, the probability strictive principle then it is not possible to retain
that a person infinitely low on 0 will answer the the definition of latent traits, since it is by that
item correctly. It is called the guessing or pseudo- principle that latent traits are defined. McDonald
chance level. If an item cannot be answered cor- pointed out, however, that it is possible to reject
rectly by guessing, then c 0. The latent trait 0
=
or modify assumptions as to the number and dis-

provides the scale on which all item characteristic tribution of the latent traits and the form of the
curves have some specified mathematical form, for regression function (e.g., make it nonlinear instead
example, the logistic or normal ogive. A joint un- of linear), without changing the definition of latent
deridentifiability of 0, c~, b, and c, is removed by traits.
choosing an origin and unit for 0. It is not correct to claim that the principle of

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


152

local independence is the same as the assumption tions of the test items zero when the latent traits
of unidimensionality (as have Hambleton et a]., (which are the same as factor scores) are partialled
1978, p. 487; Gustafsson, 1980, p. 218). The prin- out, but also the distinct items are then mutually
ciple of local independence holds (by virtue of the statistically independent and their higher joint mo-
above argument), for 1, 2, ... , k latent traits. It ments are products of their univariate moments. A
even holds for the case of zero latent traits, that weaker form, commonly used, is to ignore mo-
is, for a set of ~a items that are mutually statistically ments beyond the second order and test the di-
independent. It holds for two latent traits, yet such mensionality of test scores whether
by assessing
a case would not be considered unidimensional, the residual covariances are (see
zeroalso Lord &
that is, measuring the same construct. Unidimen- Novick, 1968, pp. 225, 544-545). Under the as-
can be rigorously defined as the existence sumption of multivariate normality, the weaker form
of one latent trait underlying the set of items. of the principle implies the strong form, as well as
At this point the working definition of unidi- conversely. McDonald (1979, 1981) argued that
mensionality presented in the opening paragraph this weakening of the principle does not create any
can be clarified. It was claimed that a set of items important change in anything that can be said about
can be said to be unidimensional when it is possible the latent traits, though strictly it weakens their
to find a vector of values 4) = (~,) such that the definition.
probability of correctly answering an item is
~tg fg(~;) and local independence holds for each
=

Indices Based On the


value of 4). This definition is not equating unidi-
One-Parameter Model
mensionality with local independence, because it
can further require that it is necessary to condition Using a combination of difficulty, discrimina-
only on one 0 dimension and that the probabilities tion, and guessing, various latent trait models have
~~~ can be expressed in terms of only one 0 di- been proposed. The most used and the most re-
mension. searched model involves only the estimation of the
Lord (1953) in an early specification of the as- difficulty parameters. It is often called the Rasch
sumption (his restriction IV) suggested a heuristic model after the pioneering work of Rasch (1960,
for assessing whether the assumption of local in- 1961, 1966a, 1966b, 1968, 1977). It is assumed
dependence is met in a set of data. First, take all (1) that there is no guessing, (2) that the discrim-
examinees at a given score level and apply a chi- ination parameter a is constant, and (3) that the
square test (or an exact test) to determine if their principle of local independence applies.
responses to any two items are independent. Then, One of the major advantages of the Rasch model
because the distribution of combined chi-squares often cited is that there are many indices of how
is ordinarily also a chi-square (with df df, + =
adequately the data &dquo;fits&dquo; the model. In one of the
dj2 + ... + df&dquo; degrees of freedom), the resulting earliest statements, Wright and Panchapakesan
combined chi-square may be tested for signifi- (1969) contended that &dquo;if a given set of items fit
cance. If the combined chi-square is significant, the (Rasch) model this is evidence that they refer
then Lord argued it must be considered that the test to a unidimensional ability, that they form a con-
is not unidimensional. When assessing how this formable set&dquo; (p. 25). These sentiments have often
statistic behaved (for use in Hattie, 1984a), it soon been requoted and an earnest effort made to find
became obvious that it does not matter whether a and delete misfitting items and people. Rentz and
chi-square or an exact test is used, since in both Rentz (1979) stated that &dquo;the most direct test of
cases the probability is nearly always close to 1.0. the unidimensionality assumption is the test of fit
McDonald (1981) outlined ways in which it is to the model that is part of the calibration process&dquo;
possible to weaken the strong principle, in his ter- (p. 5).
minology, of local independence. The strong prin- From the one-parameter model there has been a
ciple implies that not only are the partial correla- vast array of indicessuggested (e.g.9 see Table 2),

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


153

but they appear to lack theoretical bases, little is distribution of the indices is only an approximation
known of their sampling distributions, and they to chi-square, it is not clear how valid the methods
seem to be ad hoc attempts to understand the be- are for evaluating items (see George, 1979, for a

havior of items. Their developers have paid insuf- critical evaluation of this approximation). The tests
ficient attention to helping psychometricians un- are motivated primarily by a desire to assess the

derstand the assumptions, the methods of calculation, various assumptions, particularly the unidimen-
and the justification for them (see Rogers, 1984 sionality assumption.
and below, for a detailed account of the derivations To illustrate the various &dquo;fit statistics,&dquo; consider
and behavior of these indices). One of the obvious the following indices that are typically used. The
problems of many of the tests of fit is that with Between-Fit t has been called the &dquo;most natural&dquo;
large samples a chi-square is almost certain to be in that it &dquo;is derived directly from the ’sample-
significant. For this reason, and also because the free’ requirements of the model&dquo; (Wright, Mead,

Table 2
Some Fit Statistics Based On the One-Parameter IR’T
Model Commonly Used to Detect Departures
From Assumptions (Including Unidimensionality)

Xij =
observed response of person i to item j
Pij =
probability of a correct response for person i to item j, obtained
from the model equation using estimates of the person and item
parameters
N = number of persons
rn =
number of score groups (usually 6)
Oij =
observed proportion of examinees in group ~c who
correctly
answer

item j
Eii =
expected proportion of examinees in group rrc who answer correctly
item j
nri
=
number of correct responses at each score level K for each item j
Enrj = expected number of correct responses at each score level k for each
item j

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


154

& Bell, 1979, p. 10). The statistic, however, is on the basis of the chi-square test of the good-
sample-dependent. The sample is divided into ness of fit of the items to the model is, by
subgroups based on score level according to esti- itself, a very hazardous way to proceed. (p.
mated 0, and then, the observed proportion of suc- 101)
cesses on each item in each estimated 0 subgroup Gustafsson and Lindblad (1978) found that tests
is compared with that predicted from the estimates of fit, like those used by Wright et al. (1979), did
of the item difficulties given by the total sample. not lead to rejection of the one-parameter model
Wright et al. derived a standardized mean-square even for data generated according to an orthogonal

statistic that has an expected value of 0 and variance two-factor model. As an alternative, Gustafsson
of 1. A more general statistic, a Total-Fit t, eval- (1980) proposed tests of fit based on the person
uates the general agreement between the variable characteristic curve rather than the item character-
defined by the item and the variable defined by all istic curve, but he pointed out that prior to using
other items over the whole sample. Again, the ex- these tests &dquo;it does seem necessary to use factor
pected value is 0 and variance is 1. analysis to obtain information about the dimen-
Thus, the Total-Fit t summarizes overall item fit sionality of the observations&dquo; (p. 217).
from person to person and the Between-Fit t fo- Van den Wollenberg (1982a, 1982b) has dem-
cuses on variations in item responses between the onstrated that many of the commonly used indices
various subgroups. Wright et al. (1979) noted that of unidimensionality based on the Rasch model are
any t value than 1.5 to be examined insensitive to violations of the unidimensionality
for response irregularities and that values greater axiom. Instead he proposed two new indices, one
than 2.0 &dquo;are noteworthy’ 9 (p. 13). They also rec- of which (his <3)) is easy to compute but seems to
ommended that a &dquo;within-group&dquo; mean-square be lack sensitivity, and the other (Q,) requires much
calculated that summarizes the degree of misfit re- computer time to calculate but does seem to be
maining within ability groups after the between sensitive to violations of unidimensionality under
some conditions. Van den Wollenberg discussed
group misfit has been removed from the total.
There have been critics of the claims made by the conditions when Q2 seems to be useful, and he
the Rasch proponents. Hambleton (1969) included promised more systematic inquiries Of 62. Q2 is,
five items in 1 ~-, 30-, ~nd 45-item simulations that however, more a global test of the fit of the one-
were constructed to measure a second ability or- parameter model rather than a specific index of
thogonal to the first ability. (He did not specify unidimensionality. If Q2 is large relative to its ex-
how these items were generated.) In the simula- pected value (yet to be accurately determined), it
tions the items were constrained to have equal dis- could be because of violations of other assumptions
criminations and no guessing. The main aim was such as equal discriminations and/or no guessing.
to investigate whether the Rasch model was robust Rogers (1984) investigated the performance of
with respect to this kind of violation of its as- the Between-t and Total-t for persons and items
sumptions. Hambleton found that the model did and the Mean-Square Residual for persons and items,
not provide a good fit, and further that the fit for along with many other one-parameter indices to
the other items was also affected. He wrote that detect unidimensionality (including all those listed
in the simulation where the proportion of items in Table 2). She varied test length, sample size,
measuring a second ability was 33 (5 out of dimensionality, discrimination, and guessing. Rogers
15), the attempt to fit the data failed so com- used the programs BICAL (Wright et al. 1979) and
pletely that nearly every item was rejected by NOHARM (a method that fits the unidimensional
the model. Since, 67% of the simulated items and multidimensional normal ogive one-, two-, and
could be regarded as having been simulated three-parameter models by finding the best ap-
to satisfy the assumptions of the model, this, proximation to a normal ogive in terms of the Her-
result suggests that rejecting items from a test mite-Tchebycheff polynomial series, using har-

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


155

monic analysis; see Fraser, 1981; McD&reg;a~~ld, 1982, across all items
which, for example, would require
for details). When either BICAL or NOHARM one- 32,768 possible patterns for a 15-item test, and
parameter methods were used, all indices were in- more than 107 million response patterns for a 30-
sensitive to multidimensionality, except for Total- item test.
t for persons and Mean-Square for items when the Nishisato (1966, 1970a, 1970b, 1971; see also
dimensions were close to orthogonal. Rogers con- Lim, 1974; Svoboda, 1972) and Christoffersson
cluded that the inability of all indices &dquo;to detect (1975) demonstrated that very little efficiency in
multidimensionality under the one-parameter model estimation is lost using only information from the
restricts their use to conditions where unidimen- first- and second-order joint probabilities of binary-
sionality is assured&dquo; (p. 92). scored items compared to using all possible 2&dquo;

Indices Based On the proportions, as in the Bock and Lieberman (1970)


Two-Parameter 1~~~~~ approach. Muth6n (1978) developed an estimation
method that was computationally faster than Chris-
The two-parameter model allows for estimation toffersson’s. These methods solve many of the
of difficulty and discrimination and assumes zero computation problems, and the resulting indices of
guessing. Bock and Lieberman (1970) detailed a fit, after fitting one factor (based on the size of
method of estimating these two parameters by con- residuals), are most promising in that they are based
sidering the pattern of a person’s item responses on sound theoretical considerations.
across all items in the test. When compared to the Bock and Aitkin (1981) pointed out the practical
more usual method of factoring tetrachoric corre- limitations of the Bock and Lieberman (1970) work,
lations, Bock and Lieberman reported trivial dif- but stated that from a statistical point of view the
ferences in the estimates of item difficulties and Christoffersson (1975) and Muthen (1978) method
discrimination, but there were differences in the 6 ‘is also objectionable because it assumes that the
assessment of dimensionality. Using two sets of form of the distribution of ability effectively sam-
data, they reported a sharp break between the size pled is known in advance. Since item calibration
of the first eigenvalue and that of the remaining studies are typically carried out on arbitrarily se-
eigenvalues, and they concluded that’ ’these results lected samples, it is difficult to specify a priori the
would ordinarily be taken to support the assump- distribution of ability in the population effectively
tion of unidimensionality&dquo; (p. 191). The usual chi- sampled&dquo; (p. 444). Instead, Bock and Aitkin pro-
square tests of one factor from a maximum like- posed to estimate the item parameters by integrat-
lihood factor analysis were 25.88 and 53.74, each ing over the empirical distribution. The Bock and
with 5 degrees of freedom for the two data sets Aitkin method was applied to the same data as was
employed. the Bock and Lieberman, Christoffersson, and Mu-
Clearly in both cases, a one-factor solution must th6n methods (see Bock & Aitkin, 1981). The dif-
be rejected. The chi-square approximations based ferences in the estimates were very small. It is also
on Bock and Lieberman’s (1970) two-parameter debatable whether the observed distribution of abil-
solution were 21.28 and 31.59, each based on 21 ity, with the usual sampling errors and errors of
degrees of freedom. In these cases there is evidence measurement, is the best distribution to work from.
for a unidimensional solution. Bock and Lieberman The fit statistics listed in Table 2 can be made
concluded that the &dquo;test of unidimensionality pro- applicable to the two- and three-parameter models.
vided by maximum likelihood factor analysis can- Rogers (1984) derived an appropriate formula for
not be relied upon when tetrachoric correlations are each index and assessed their sensitivity to the same
used&dquo; instead of their estimates (p. 191). factors listed above as for the one-parameter model.
A major obstacle to using the Bock and Lieber- For all indices there was an increased sensitivity
man (1970) method is a practical one. The method to multidimensionality as more parameters were

requires the use of 2n possible response patterns fitted. It appears that the restrictiveness and con-

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


156

sequent &dquo;superficiality&dquo; of the one-parameter model sponses. Then local independence implies that
&dquo;may allow violations of the unidimensionality as- for every function h(Z), the responses in Y
sumption to go undetected, while the greater detail are associated given h(Z) = a for all a; that

and accuracy of prediction provided by the two- is, for all nondecreasing functions g)(Y) and
parameter model effectively exposes its pres- g2(Y)5 cov lgi(Y)5 g2(~)I~(~) _ a] ~ 0. (p.
ence.... The three-parameter model, at least as 427)
fitted under NOHARM, appears to offer little im- Rosenbaum provided many illustrations of ap-
provement on the two-parameter model&dquo; in as- plications of his theorem. For example, from a test
sessing unidimensionality (Rogers, 1984, pp. 93- of 120 multiple-choice items, the responses from
94). two items were assessed using the above theorem.
The 15,982 candidates were divided into 119
Other Approaches
subgroups based on their total score on the re-
Hulin, Drasgow, and Parsons (1983) suggested maining 118 items. By the above theorem, local
a procedure that combined latent trait methods and independence implies a nonnegative population
factoring tetrachoric correlations. First, they obtain correlation (e.g., using the Goodman-Kruskal tau;
the eigenvalues of the matrix of item tetrachorics Goodman & Kruskal, 1979) or equivalently, a pop-
(replacing noncomputable tetrachorics by an ad hoc ulation odds ratio of at least 1 (e.g., using the
approximation; see Hulin, Drasgow, & Parsons, Mantel-Haenszel ratio; Mantel & Haenszel, 1959)
pp. 248-249). Using a two- or three-parameter within each class.
estimation program, they then estimate the item Given the sound theoretical bases of Rosen-
parameters from this correlation matrix. Next, they baum’s (1984) procedure, it is expected that his
generate a truly unidimensional item pool by using procedure will become widely used. It must be
the estimated item parameters as the parameters of noted, however, that Rosenbaum’s theorem relates
the simulated items (having the same number of to assessing the assumption of local independence,
simulated examinees and items as the real data set). which is not necessarily the same as the assumption
The eigenvalues of the matrix of tetrachoric cor- of unidimensionality (see above). Other problems
relations obtained from this synthetic data set are of the method relate to the choice of sample statistic
then computed. Finally, and most importantly, the (e.g., the Goodman-Kruskal tau tends to overes-
second eigenvalues of the real and synthetic data timate a relationship; see Reynolds, 1977, pp. 74-
sets are compared. If the difference is large, this 75), the existence of not too many cells with small
suggests a nonunidimensional set. Suggested mag- frequencies, the adequacy of the sample, and the
nitudes of differences are provided in Drasgow and use of multiple significance tests.
Lissak (1983). The guidelines are for some very Generally, it seems that though latent trait theory
limited cases. If further simulations support the provides a precise definition of unidimensionality,
method, it may prove very useful. there is still debate as to the efficacy of the many
Rosenbaum (1984) presented a theorem that seems
proposed methods for determining decision criteria
critical in assessing unidimensionality. This theo- for unidimensionality.
rem is based on monotone nondecreasing functions
of latent variables. These nondecreasing functions
Conclusions and Recommendations
include most applications, such as total number
correct and positively weighted scores. That is, any At the outset it was argued that unidimension-
score which, if an additional correct response is ality was a critical and basic assumption of mea-
added, does not decrease the previous sum of items. surement theory. There have been two major issues
Rosenbaum’s theorem is: that have pervaded this review. First, there is a
Let (Y, Z) be a partition of the item responses paucity of understanding as to the meaning of the
in X into two nonoverlapping groups of re- term unidimensionality and how it is distinguished

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


157

from related terms. Second, there are too many many authors desire high homogeneity and he com-
indices that have been developed on an ad hoc basis mented that aiming for high homogeneity leads to
with little reference to their rationale or behavior scales in which the same question is rephrased a
and little, if any, comparison with other indices. dozen different ways. He argued that a test that
includes many items that are almost repetitions of
each other can cause an essentially &dquo;narrow spe-
cific&dquo; to be blown up into a &dquo;bloated specific&dquo; or
Definition
pseudo-general factor. In Cattell and Tsujioka’s
A major problem in assessing indices of unidi- colorful words:
mensionality has been that unidimensionality has the bloated specific will then ’sit on top’ of
been confused and used interchangeably with other the true general personality factor as firmly as
terms such as reliability, internal consistency, and a barnacle on a rock and its false variance will

homogeneity. Consequently, an index is developed be inextricably included in every attempted


from some estimate of reliability, and then it is prediction from the general personality factor.
claimed that the index relates to unidimensionality. Moreover, the ’crime’ will be as hard to de-
It is important that the meaning of these terms is tect, without a skillful factor analysis, as it is
clarified. insidious in its effects, for the intense pursuit
Reliability is classically defined as the ratio of of homogeneity has ended in a systematically
true score variance to observed score variance. There biased measure. (p. 8)
are various methods for estimating reliability, such Internal consistency relates more to a set of
as test-retest, parallel forms, and split-half meth- methods and seems of limited usefulness. Homo-
ods. The internal consistency notion always in- geneity has been used in two senses, one as a syn-
volves an internal analysis of the variances and onym for unidimensionality and the other as a mea-
covariances of the test items and depends on only sure of equality of item intercorrelations. In the
one test administration. Methods of internal con- first sense, the term homogeneity is redundant and
sistency at least include split-half coefficients, al- may be confusing, and in the second sense it may
pha, and I~R-21. Yet, there are methods that satisfy not be desirable. Whether internal consistency and
these criteria that have not been classified as in- homogeneity are meaningful terms in describing
teralal consistency measures (e.g., omega). It seems attributes of items and/or tests remains questiona-
that internal consistency is defined primarily in terms ble.
of certain methods that have been used to index it. Unidimensionality can be defined as the exis-
Homogeneity has been used in two major ways. tence of one underlying the data. This
latent trait
Lord and Novick (1968) and McDonald ( 19~ 1 ), for definition is based on latent trait theory and is a
example, used homogeneity as a synonym for uni- specific instance of the principle of local inde-
dimensionality, whereas others have used it spe- pendence, though it is not synonymous with it. As
cifically to refer to the similarity of the item inter- a consequence of this definition, it is probable that
correlations. In the latter case a perfectly homo- indices based on the goodness-of-fit of data to a

geneous test is one in which all the items comprehensive latent trait model may be effective
intercorrelate equally. That is, the items all mea- indices of unidimensionality. The problems of such
sure the construct or constructs equally. Thus, indices relate to ensuring that the correct latent trait
homogeneity is often a desirable quality, but there model is chosen and that the parameters of the
have been authors who have advocated that test model are satisfactorily estimated. Rosenbaum’s
constructors should not aim for high homogeneity. (1984) theorem is also worth further investigation
Cattell (1964, 1978; Cattell & Tsujioka, 1964) has because it is so soundly based. Thus, a unidimen-
been principal adversary of aiming for high hom-
a sional test is one that has one latent trait underlying
ogeneity, in this latter sense. He has noted that the data, and such a test may be or may not nec-

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


158

essarily be reliable, internally consistent, or ho- ate, but the present problems appear to relate to
mogeneous. efficient computer programs for estimating the pa-
rameters and a lack of understanding of the be-
havior of indices based on nonlinear methods. Re-
The Indices
cent research and computer programs by Etezadi
Altogether, over 30 indices of (1981), however, could change this situation.
have been identified and these were grouped into The indices based on latent trait methods seem
five sections: methods based on (1) answer pat- to be more justifiable. Yet, if the incorrect model
te~-cls, (2) reliability, (3) principal components, (4) is used, then the resulting indices must fail. It seems
factor analysis, and (5) latent traits. Some indices unlikely that indices based on the one-parameter
must fail (e.g., ratio of eigenvalues), some are or Rasch model will prove useful. The increasing

clearly suspect (e.g., alpha), others seem more ap- number of indices based on the Rasch model is of
propriate in specific conditions (e.g., fit statistics concern, particularly since most seem to lack a
from factor analyzing tetrachoric correlations), while clear (if any) rationale and there is no supporting
others look promising (e.g., fit statistics from the evidence, such as simulation studies, to demon-
Christoffersson, 1975, and Muthén, 1978, meth- strate their effectiveness. The methods of Chris-
ods). The major reasons for many indices not being toffersson (1975), Muthen (1978), and McDonald
adequate indices of unidimensionality are that unit (1982) are based on a weaker form of the principle
rank is desired and/or they are based on linear of local independence and it is likely that some
models. function of the residuals after estimating the pa-
It has been argued above that if a one-factor rameters may serve as adequate indices of unidi-
cubic provides good fit (from a nonlinear factor mensionality.
analysis), then the rank of the inter-item correlation Yet, there are still no known satisfactory indices.
or covariance matrix is three. The claim that unit None of the attempts to investigate unidimension-
rank is a necessary condition for unidimensionality ality has provided clear decision criteria for deter-
is incorrect. Of the numerous methods based on mining it. What is needed is a monte carlo simu-
unit rank, alpha has been used most often as an lation to assess the various indices under known
index of unidimensionality. There is, no conditions. Such a simulation is outlined in Hattie
systematic relationship between the rank of a set (1984a). The simulation assessed the adequacy of
of variables and how far alpha is below the true most of the indices cited in this review. Data with
reliability. Further, alpha can be high even if there known dimensionality were generated using a three-
is no general factor, since (1) it is influenced by parameter latent trait model. Altogether, there were
the number of items and parallel repetitions of items, 36 models: two levels of difficulty ( - 2 to 2, - 1
(2) it increases as the number of factors pertaining to 1), thre~ levels of guessing (all .0, all .2, and
to each item increases, and (3) it decreases mod- a mixture of .0, .1 and .2), and six levels of di-

erately as the item comnunalities increase. It seems mensionality (one factor with mixed discrimina-
that modifications of alpha also suffer the same tion, one factor with discrimination all 1, two fac-
problems. Despite the common use of alpha as an tors intercorrelated .1, two factors intercorrelated
index of unidimensionality, it does not seem to be .5, five factors intercorrelated .1, and five factors
justified. intercorrelated .5).
Beside the nonnecessity of unit rank, a further Only four of the indices could consistently dis-
problem of many procedures is that a linear model tinguish one-dimensional from more than one-di-
cannot be assumed. When items are scored dicho- mensional data sets. The four indices were the sum
tomously, then the use of a linear factor model and of (absolute) residuals after fitting a two- or three-
the use of phi or tetrachoric correlations are not parameter latent trait model using NOHARM
appropriate since they assume linearly related var- (Fraser, 1981; McDonald, 1982) or FADIV (An-
iables. Nonlinear factor analysis may be appropri- dersson, Christoffersson, & Muthen, 1974). The

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


159

advantages of using 1~1&reg;I-IARM are that it is com- not be possible. The use of second-order factor
putationally faster and it can handle large data sets. analysis specifying one second-order factor may be
In subsequent simulations, Hattie (1984b) has necessary to detect such unidimensional sets of items.
investigated decision rules based on these four in- The existence of more than one second-order factor
dices. It seems that if the sum of (absolute) resid- is convincing evidence of a multidimensional data
uals after specifying one dimension is reasonably set. Certainly, identifying a unidimensional set of
small, and if the sum of residuals after specifying items and labeling the set are separate processes.
two dimensions is not much smaller, then it can Throughout this review the issue has been the
be confidentally assumed that the set of items is unidimensionality of an item set. Obviously, it must
unidimensional. be remembered that dimensionality is a joint prop-
erty of the item set, or the pool of which it is a
Final Caveat sample, and a particular sample of examinees from
its underlying population. Much recent research
Finally, it must be considered that it may be has indicated that the same set of test items may
unrealistic to search for indices of unidimension- be attacked by persons using different cognitive
ality or sets of unidimensional items. It may be strategies (l~lich & Davidson, 1984). Although
that unidimensionality could be ignored and other agreeing that this occurs, it still seems worthwhile
desirable qualities of a set of items sought, such to devise tests that measure similar content and/or
as whether the estimates are consistent, or whether styles and thus aim for more dependable infor-
the estimates provide a useful and effective sum- mation about individual differences.
mary of the data. Certainly, psychological mea- Further, it may be that an act of judgment and
surement has so far done without these indices. not an index is required. Kelly (1942) argued that
Moreover, it may be that a set of items will not be embodied in such concepts as unidimensionality is
unidimensional except for the most simple vari- a belief or point of view of the investigator such

ables, yet it seems reasonable to claim that uni- that an act of judgment is demanded when a re-
dimensional tests can be factoiially complex. Maybe searcher asserts that items measure the same thing.
it is meaningful to quest after an index if the ques- Thus, not only may it be possible to recognize by
tion is rephrased from ’ ’ Is a test unidimensional or inspection whether one test appears to be unidi-
not?&dquo; t&reg; &dquo;Are there decision criteria that determine mensional when compared to another, but also even
how close a set of items is to being a unidimen- if there is an index, then judgment must still be
sional set?&dquo; used when interpreting it, particularly as the sam-
It may be that multidimensional tests can be con- pling distribution for most indices is not known.
fused as unidimensional tests if the multiple di- An index must therefore be seen as only part, but
mensions have proportional contributions to each probably a very important part, of the evidence
item. In such a case, scores on a test would rep- used to determine the degree to which a test is
resent a weighted composite of the many under- unidimensional.
lying dimensions. In some cases the problem is
theoretical in that the labeling of such a test is the
References
major concern. For example, with an arithmetic-
reasoning test it can be argued that such a test could Andersson, C. G., Christoffersson, A., & Muth&eacute;n, B.
be unidimensional, whereas others may wish to (1974). FADIV: A computer program for factor anal-
argue that it is multidimensional. This is a concern ysis of dichotomized variables (Report No. 74-1).
for the test developer and user. It would be ad- Uppsala, Sweden: Uppsala University, Statistics De-
partment.
vantageous for an index of unidimensionality to Anderson, T. W. (1959). Some scaling models and es-
distinguish between tests involving one and more timation procedures in the latent class model. In U.
than one latent trait, but if the two dimensions Grenander (Ed.), Probability and statistics (pp. 9-
contribute equally to the item variances, this might 38). New York: Wiley.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


160

Andrich, D., & Godfrey, J. R. (1978-1979). Hierar- Etezadi, J. (1981). A general polynomial model for non-
chies in the skills of Davis’ reading comprehension linear factor analysis. Dissertation Abstracts Inter-
test, Form D: An empirical investigation using a latent national, 42, 4342a.
trait model. Reading Research Quarterly, 2, 183- Etezadi, J., & McDonald, R. P. (1983). A second gen-
200. eration nonlinear factor analysis. Psychometrika, 48,
Armor, D. J. (1974). Theta reliability and factor scaling. 315-342.
In H. L. Costner (Ed.), Sociological methodology (pp. Fraser, C. (1981). NOHARM: A FORTRAN program for
17-50). San Francisco CA: Jossey-Bass. non-linear analysis by a robust method for estimating
Bejar, I. I. (1980). A procedure for investigating the the parameters of 1-, 2-, and 3-parameter latent trait
unidimensionality of achievement tests based on item models. Arrnidale, Australia: University of New En-
parameter estimates. Journal of Educational Statis- gland, Centre for Behavioural Studies in Education.
tics, 17, 282-296. Freeman, F. S. (1962). Theory and practice of psycho-
Bentler, P. (1972). A lower-bound method for the logical testing (3rd ed.). New York: Henry Holt.
dimension-free measurement of internal consistency. Fuller, E. L., & Hemmerle, W. J. (1966). Robustness
Social Science Research, 1, 343-357. of the maximum-likelihood estimation procedure in
Birenbaum, M., & Tatsuoka, K. (1982). On the dimen- factor analysis. Psychometrika, 31, 255-266.
sionality of achievement test data. Journal of Edu- George, A. A. (1979, April). Theoretical and practical
cational Measurement. 19, 259-266. consequences of the use of standardized residuals as
Bock, R. D., & Aitkin, M. (1981). Marginal maximum Rasch model fit statistics. Paper presented at the an-
likelihood estimation of item parameters: An appli- nual meeting of the American Educational Research
cation of the EM algorithm. Psychometrika, 46, 443- Association, San Francisco CA.
459. Gage, N. L., & Damrin, D. E. (1950). Reliability, hom-
Bock, R. D., & Lieberman, M. (1970). Fitting a re- ogeneity, and number of choices. Journal of Educa-
sponse model for n dichotomously scored items. Psy- tional Psychology, 41, 385-404.
chometrika, 35, 179-197. Gibson, W. A. (1959). Three multivariate models: Fac-
Carmines, E. G., & Zeller, R. A. (1979). Reliability tor analysis, latent structure analysis, and latent profile
and validity assessment. Beverly Hills CA: Sage. analysis. Psychometrika, 24, 229-252.
Carroll, J. B. (1945). The effect of difficulty and chance Gibson, W. A. (1960). Nonlinear factors in two dimen-
success on correlation between items or between tests. sions. Psychometrika, 25, 381-392.
Psychometrika, 10, 1-19. Goodman, L. A., & Kruskal, W. H. (1979). Measures
Cattell, R. B. (1964). Validity and reliability: A pro- of association for cross classifications. New York:
posed more basic set of concepts. Journal of Edu- Springer-Verlag.
cational Psychology, 55, 1-22. Gorsuch, R. L. (1974). Factor analysis. Philadephia PA:
Cattell, R. B. (1978). The scientific use offactor analysis Saunders.
in behavioral and life sciences. New York: Plenum. Gourlay, N. (1951). Difficulty factors arising from the
Cattell, R. B., & Tsujioka, B. (1964). The importance use of the tetrachoric correlations in factor analysis.
of factor-trueness and validity, versus homogeneity British Journal of Statistical Psychology, 4, 65-72.
and orthogonality, in test scales. Educational and Psy- Green, B. F. (1956). A method of scalogram analysis
chological Measurement, 24, 3-30. using summary statistics. Psychometrika, 21, 79-88.
Christoffersson, A. (1975). Factor analysis of dichotom- Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977).
ized variables. Psychometrika, 40, 5-32. Limitations of coefficient alpha as an index of test
Cliff, N. (1977). A theory of consistency or ordering unidimensionality. Educational and Psychological
generalizable to tailored testing. Psychometrika, 42, Measurement, 37, 827-838.
375-399. Greene, V. C., & Carmines, E. G. (1980). Assessing
Cronbach, L. J. (1951). Coefficient alpha and the in- the reliability of linear composites. In K. F. Schues-
ternal structure of tests. Psychometrika, 16, 297-334. sler (Ed.), Sociological methodology (pp. 160-175).
Divgi, D. R. (1980, April). Dimensionality of binary San Francisco CA: Jossey-Bass.
items: Use of a mixed model. Paper presented at the Guilford, J. P. (1941). The difficulty of a test and its
annual meeting of the National Council on Measure- factor composition. Psychometrika, 6, 67-77.
ment in Education. Boston MA. Guilford, J. P. (1965). Fundamental statistics in psy-
Drasgow, F., & Lissak, R. I. (1983). Modified parallel chology and education (4th ed.). New York: McGraw-
analysis: A procedure for examining the latent- Hill.
dimensionality of dichotomously scored item re- Gustafsson, J. E. (1980). Testing and obtaining fit of
sponses. Journal of Applied Psychology, 68, 363-373. data to the Rasch model. British Journal of Mathe-
Dubois, P. H. (1970). Varieties of psychological test matical and Statistical Psychology, 33, 205-233.
homogeneity. American Psychologist, 25, 532-536. Gustafsson, J. E., & Lindblad, T. (1978). The Rasch

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


161

model for dichotomous items: A solution of the con- Heise, D. R., & Bohrnstedt, G. W. (1970). Validity,
ditional estimation problem for long tests and some invalidity, and In E. F. Borgatta &
reliability. G. W.
thoughts about item screening procedures (Report No. Bohrnstedt (Eds.), Sociological methodology (pp. 104-
67). G&ouml;teborg, Sweden: University of G&ouml;teborg, In- 129). San Francisco CA: Jossey-Bass.
stitute of Education. Hertzman, M. (1936). The effects of the relative diffi-
Guttman, L. (1944). A basis for scaling qualitative data. culty of mental tests on patterns of mental organiza-
American Sociological Review, 80, 139-150. tion. Archives of Psychology, No. 197.
Guttman, L. (1945). A basis for analyzing test-retest Hoffman, R. J. (1975). The concept of efficiency in item
reliability. Psychometrika, 10, 255-282. analysis. Educational and Psychological Measure-
Guttman, L. (1950). The principal components of scale ment, 35, 621-640.
analysis. In S. S. Stouffer (Ed.), Measurement and Horn, J. L. (1965). A rationale and test for the number
prediction (pp. 312-361). Princeton NJ: University of factors in factor analysis. Psychometrika, 30, 179-
Press. 185.
Haley, D. C. (1952). Estimation of dosage mortality Horn, J. L. (1969). On the internal consistency and re-
relationships when the dose is subject to error (Report liability of factors. Multivariate Behavioral Research,
No. 15). Stanford CA: Stanford University, Applied 4, 115-125.
Mathematics and Statistics Laboratory. Horrocks, J. E., & Schoonover, T. I. (1968). Measure-
Hambleton, R. K. (1969). An empirical investigation of ment for teachers. Columbus OH: Merrill.
the Rasch test theory model. Unpublished doctoral Horst, P. (1953). Correcting the Kuder-Richardson re-
dissertation, University of Toronto, Canada. liability for dispersion of item difficulties. Psycholog-
Hambleton, R. K. (1980). Latent ability scales: Inter- ical Bulletin, 50, 371-374.
pretations and uses. New Directions for Testing and Hulin, C. L., Drasgow, F., & Parsons, C. (1983). Item
Measurement, 6, 73-97. response theory: Applications to psychological mea-

Hambleton, R. K., Swaminathan, H., Cook, L. L., Eig- surement.Homewood IL: Dow & Jones Irwin.
nor, D. R., & Gifford, J. A. (1978). Developments Humphreys, L. G. (1949). Test homogeneity and its
in latent trait theory: Models, technical issues, and measurement. American Psychologist, 4, 245. (Ab-
applications. Review of Educational Research, 48, 467- stract).
510. Humphreys, L. G. (1952). Individual differences. An-
Hambleton, R. K., & Traub, R. E. (1973). Analysis of nual Review of Psychology, 3, 131-150.
empirical data using two logistic latent trait models. Humphreys, L. G.
(1956). The normal curve and the
British Journal of Mathematical and Statistical Psy- attenuation paradox in test theory. Psychological Bul-
chology, 26, 195-211. letin, 53, 472-476.
Harman, H. H. (1979). Modern factor analysis (3rd ed.). Hutten, L. (1979, April). An empirical comparison of
Chicago: University of Chicago Press. the goodness of fit of three latent trait models to real
Hattie, J. A. (1979). A confirmatory factor analysis of test data. Paper presented at the annual meeting of
the Progressive Achievement Tests: Reading Com- the American Educational Research Association, San
prehension, reading vocabulary, and listening com- Francisco CA.
prehension tests. New Zealand Journal of Educational Hutten, L. (1980, April). Some empirical evidence for
Studies, 14, 172-188. (Errata, 1980, 15, 109.) latent trait model selection. Paper presented at the
Hattie, J. A. (1980). The Progressive Achievement Tests annual meeting of the American Educational Research
revisited. New Zealand Journal of Educational Stud- Association, Boston MA.
ies, 15, 194-197. Jackson, J. M. (1949). A simple and more rigorous tech-
Hattie, J. A. (1981). A four stage factor analytic ap- nique for scale analysis. In
A manual of scale analysis,
proach to studying behavioral domains. Applied Psy- Part II. Montreal: McGill University.
chological Measurement, 5, 77-88. J&ouml;reskog, K. G. (1978). Structural analysis of covari-
Hattie, J. A. (1984a). An empirical study of various ance and correlation matrices. Psychometrika, 43 , 443-
indices for determining unidimensionality. Multivar- 447.
iate Behavioral Research, 19, 49-78. J&ouml;reskog, K. G., & S&ouml;rbom, D. (1981). LISREL V:
Hattie, J. A. (1984b). Some decision rules for deter- Analysis of linear structural relationships by the method
mining unidimensionality. Unpublished manuscript, of maximum likelihood. Chicago: National Educa-
University of New England, Centre for Behavioural tional Resources.
Studies, Armidale, Australia. Kaiser, H. F. (1968). A measure of the average inter-
Hattie, J. A., & Hansford, B. F. (1982). Communica- correlation. Educational and Psychological Measure-
tion apprehension: An assessment of Australian and ment, 28, 245-247.
United States data. Applied Psychological Measure- Kaiser, H. F. (1970). A second generation little jiffy.
ment, 6, 225-233. Psychometrika, 35, 401-415.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


162

Kelley, T. L. (1935). Essential traits of mental life. Har- tests. Psychological Bulletin, 58, 122-131.
vard Studies in Education, 26, 146-153. Mantel, N., & Haenszel, W. (1959). Statistical aspects
Kelley, T. L. (1942). The reliability coefficient. Psy- of the analysis of data from retrospective studies of
chometrika, 7, 75-83. disease. Journal of the National Cancer Institute, 22,
Klich, L. Z., & Davidson, G. R. (1984). Toward a rec- 719-748.
ognition of Australian Aboriginal Competence in cog- McDonald, R. P. (1962a). A general approach to non-
nitive formation. In J. Kirby (Ed.), Cognitive strat- linear factor analysis.
Psychometrika, 27, 397-415.
egies and educational performance (pp. 155-202). McDonald, R. P. (1962b). A note on the derivation of
New York: Academic Press. the general latent class model. Psychometrika, 27,
Koch, W. R., & Reckase, M. D. (1979, September). 203-206.
Problems in application of latent trait models to tai- McDonald, R. P. (1965a). Difficulty factors and non-
lored testing (Research Report 79-1). Columbia MO: linear factor analysis. British Journal of Mathematical
University of Missouri, Educational Psychology De- and Statistical Psychology, 18, 11-23.
partment, Tailored Testing Research Laboratory. McDonald, R. P. (1965b). Numerical polynomial models
Kuder G. F., & Richardson, M. W. (1937). The theory in nonlinear factor analysis (Report No. 65-32).
of the estimation of test reliability. Psychometrika, 2, Princeton NJ: Educational Testing Service.
151-160. McDonald, R. P. (1967a). Factor interaction in nonlin-
Laforge, R. (1965). Components of reliability. Psy- ear factor analysis. British Journal of Mathematical
chometrika, 30, 187-195. and Statistical Psychology, 20, 205-215.
Lam, R. (1980). An empirical study of the unidimen- McDonald, R. P. (1967b). Numerical methods for po-
sionality of Ravens Progressive Matrices. Unpub- lynomial models in non-linear factor analysis. Psy-
lished master’s thesis, University of Toronto, Canada. chometrika, 32, 77-112.
Lim, T. P. (1974). Estimation of probabilities of dicho- McDonald, R.P. (1967c). Nonlinear factor analysis.
tomous response patterns using a simple linear model. Psychometric Monographs (No. 15).
Unpublished doctoral dissertation, University of To- McDonald, R. P. (1970). The theoretical foundations of
ronto, Canada. principal factor analysis, canonical factor analysis,
Linn, R. L. (1968). A Monte Carlo approach to the and alpha factor analysis. British Journal of Mathe-
number of factors problem. Psychometrika, 33, 37- matical and Statistical Psychology, 23, 1-21.
71. McDonald, R. P. (1976, April). Nonlinear and non-
Loevinger, J. (1944). A systematic approach to the con- metric common factor analysis. Paper presented to the
struction of tests of ability. Unpublished doctoral dis- Psychometric Society, Murray Hill SC.
sertation, University of California. McDonald, R. P. (1979). The structural analysis of mul-
Loevinger, J. (1947). A systematic approach to the con- tivariate data: A sketch of general theory. Multivariate
struction and evaluation of tests of ability. Psycho- Behavioral Research, 14, 21-38.
logical Monograph, 61 (No. 4, Whole No. 285). McDonald, R. P. (1981). The dimensionality of tests
Loevinger, J. (1948). The technique of homogeneous and items. British Journal of Mathematical and Sta-
tests. Psychological Bulletin, 45, 507-529. tistical Psychology, 34, 100-117.
Lord, F. M. (1953). The relation of test score to the trait McDonald, R. P. (1982). Linear versus nonlinear models
underlying the test. Educational and Psychological in item response theory. Applied Psychological Mea-
Measurement, 13, 517-548. surement, 6, 379-396.
Lord, F. M. (1958). Some relations between Guttman’s McDonald, R. P., & Ahlawat, K. S. (1974). Difficulty
principal components of scale analysis and other psy- factors in binary data. British Journal of Mathematical
chometric theory. Psychometrika, 23 ,291-296. and Statistical Psychology, 27, 82-99.
Lord, F. M. (1980). Applications of item response the- McDonald, R. P., & Fraser, C. (1985).A robustness
ory to practical testing problems. New York: Erlbaum study comparing estimation of the parameters of the
Associates. two-parameter latent trait model. Unpublished man-
Lord, F. M., & Novick, M. R. (1968). Statistical the- uscript.
ories of mental test scores. Reading MA: Addison- Mosier, C. I. (1936). A note on item analysis and the
Wesley. criterion of internal consistency. Psychometrika, 1,
Lumsden, J. (1957). A factorial approach to unidimen- 275-282.
Australian Journal of Psychology, 9, 105-
sionality. Mosier, C. I. (1940). Psychophysics and mental test the-
111. ory: Fundamental postulates and elementary theorems.
Lumsden, J. (1959). The construction of unidimensional Psychological Review, 47, 355-366.
tests. Unpublished master’s thesis, University of Muth&eacute;n, B. (1978). Contributions to factor analysis of
Western Australia, Perth, Australia. dichotomous variables. Psychometrika, 43, 551-560.
Lumsden, J. (1961). The construction of unidimensional Muth&eacute;n, B. (1981). Factor analysis of dichotomous var-

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


163

iables: American attitudes toward abortion. In D. J. ophy, 14, 58-94.


Jackson & E. F. Borgatta (Eds.), Factor analysis and Reckase, M. D. (1979). Unifactor latent trait models
measurement in sociological research: A multi- applied to multifactor tests: Results and implications.
dimensional perspective (pp. 201-214). London: Sage. Journal of Educational Statistics, 4, 207-230.
Muth&eacute;n, B., & Christoffersson, A. (1981). Simultane- Rentz, R. R., & Rentz, C. C. (1979). Does the Rasch
ous factor analysis of dichotomous variables in several model really work? NCME Measurement in Educa-
groups. Psychometrika, 46, 407-419. tion, 10, 1-11.
Nishisato, S. (1966). Minimum entropy clustering of Reynolds, H. T. (1977). The analysis of cross-
test-items. Dissertation Abstracts International, 27, classifications. New York: Free Press.
2861B. Reynolds, T. (1981). Ergo: A new approach to multi-
Nishisato, S.(1970a). Structure and probability distri- dimensional item analysis. Educational and Psycho-
bution of dichotomous response patterns. Japanese logical Measurement, 41, 643-660.
Psychological Research, 12, 62-74. Rogers, H. J. (1984). Fit statistics for latent trait models.
Nishisato, S. (1970b). Probability estimates of dicho- Unpublished master’s thesis, University of New En-
tomous response patterns by logistic fractional fac- gland, Armidale, Australia.
torial representation. Japanese Psychological Re- Rosenbaum, P. R. (1984). Testing the conditional in-
search, 12, 87-95. dependence and monotonicity assumptions of item re-
Nishisato, S. (1971). Information analysis of binary re- sponse theory. Psychometrika, 49, 425-435.
sponse patterns. In S. Takagi (Ed.), Modern psy- Ryan, J. P. (1979, April). Assessing unidimensionality
chology and quantification method. Tokyo: University in latent trait models. Paper presented at the annual
of Tokyo Press. meeting of the American Educational Research As-
Nishisato, S. (1980). Analysis of categorical data: Dual sociation, San Francisco CA.
scaling and its applications. Toronto: University of Shostrom, E. L. (1972). Personal Orientation Inven-
Toronto Press, Canada. tory : An inventory for the measurement of self-
Novick, M. R., & Lewis, C. (1967). Coefficient alpha actualization. San Diego CA: EdITS.
and the reliability of composite measurements. Psy- Silverstein, A. B. (1980). Item intercorrelations, item-
chometrika, 32, 1-13. test correlations and test reliability. Educational and
Nunnally, J. C. (1970). Introduction to psychological Psychological Measurement, 40, 353-355.
measurement. New York: McGraw-Hill. Smith, K. W. (1974a). Forming composite scales and
Payne, D. A. (1968). The specification and measure- estimating their validity through factor analysis. Social
ment of learning. Waltham MA: Blaisdell. Forces, 53, 169-180.
Raju, N. S. (1980, April). Kuder-Richardson Formula Smith, K. W. (1974b). On estimating the reliability of
20 and test homogeneity. Paper presented at the Na- composite indexes through factor analysis. Sociolog-
tional Council for Measurement in Education, Boston ical Methods and Research, 4, 485-510.
MA. Spearman, C. (1927). The abilities of man: Their nature
Rasch, G. (1960). Probabilistic models for some intel- and measurement. London: Macmillan.
ligence and attainment tests. Copenhagen: Nielson & Svoboda, M. (1972). Distribution of binary information
Lydiche. in multidimensional space. Unpublished master’s the-
Rasch, G. (1961). On general laws and the meaning of sis, University of Toronto, Canada.
measurement in psychology. Proceedings of the Fourth Terwilliger, J. S., & Lele, K. (1979). Some relation-
Berkeley Symposium on Mathematical Statistics and ships among internal consistency, reproducibility, and
Probability, 4, 321-333. homogeneity. Journal of Educational Measurement,
Rasch, G. (1966a). An individualistic approach to item 16, 101-108.
analysis. In P. Lazarsfeld & N. V. Henry (Eds.), Thurstone, L. L. (1935). The vectors of the mind. Chi-
Readings in mathematical social science (pp. 89-108). cago : University of Chicago Press.
Chicago: Science Research Association. Tucker, L. R., &kappa;oopman, R. F., & Linn, R. L. (1969).
Rasch, G. (1966b). An item analysis which takes indi- Evaluation of factor analytic research procedures by
vidual differences into account. British Journal of means of simulated correlation matrices. Psychome-
Mathematical and Statistical Psychology, 19, 49-57. trika, 34, 421-459.
A mathematical theory of objectivity
Rasch. G. (1968). Tucker, L. R., & Lewis, C. (1973). A reliability coef-
and its consequences for model construction. Paper ficient for maximum likelihood factor analysis.
Psy-
read at the European Meeting of Statistics, Econo- chometrika, 38, 1-10.
metrics and Management Science, Amsterdam. van den Wollenberg, A. L. (1982a). A simple and ef-

Rasch, G. (1977). On specific objectivity: An attempt fective method to test the dimensionality axiom of the
at formalizing the request for generality and validity Rasch model. Applied Psychological Measurement,
of scientific statements. Danish Yearbook of Philos- 6, 83-91.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012


164

van den Wollenberg, A. L. (1982b). Two new test sta- Department of Education, Statistical Laboratory.
cago,
tistics for the Rasch model. Psychometrika, 42, 123- Wright, B. D., & Panchapakesan, N. (1969). A pro-
140. cedure for sample-free item analysis. Educational and
Walker, D. A. (1931). Answer-pattern and score scatter Psychological Measurement, 29, 23-48.
in tests and examinations. British Journal of Psy- Wright, B. D., & Stone, M. H. (1979). Best test design.
chology, 22, 73-86. Chicago: Mesa Press.
Watkins, D., & Hattie, J. A. (1980). An investigation
of the internal structure of the Bigg’s study process
questionnaire. Educational and Psychological Mea- Acknowledgments
surement, 40, 1125-1130.
Wherry, R. J., & Gaylord, R. H. (1944). Factor pattern The cauthor gratefully acknowledges the contributions of
of test items and tests as a function of the correlation R. P. McDonald, S. Nishisato, R. Traub, M. Wahls-
coefficient: Content, difficulty, and constant error fac- trom, W. Olphert, and reviewers whose criticisms greatly
tors. Psychometrika, 9, 237-244. improved this paper.
White, B. W., & Saltz, E. (1957). Measurement of re-
producibility. Psychological Bulletin, 54, 81-99.
Wise, S. L. (1982, March). Using partial orders to de- Author’s Address
termine unidimensional item sets appropriate for item
response theory. Paper presented at the annual meet- Send requests for further information to John A. Hattie,
ing of the National Council for Measurement in Ed- Centre for Behavioural Studies, University of New En-
ucation, New York. gland, Armidale, NSW 2351, Australia.
Wise, S. L. (1983). Comparisons of order analysis and
factor analysis in assessing the dimensionality of bi-
nary data. Applied Psychological Measurement, 7,
311-312. Reprints
Wright, B. D. (1977). Solving measurement problems
with the Rasch model. Journal of Educational Mea- Reprints of this article may be obtained prepaid for $2.50
surement, 14, 97-116. (U.S. delivery) or $3.00 (outside U.S.; payment in U.S.
Wright, B. D., Mead, R. J., & Bell, S. R. (1979). BI- funds drawn on a U.S. bank) from Applied Psychological
CAL: Calibrating items with the Rasch model (Re- Measurement, N658 Elliott Hall, University of Minne-
search Report 23-B). Chicago IL: University of Chi- sota, Minneapolis, MN 55455, U.S.A.

Downloaded from apm.sagepub.com at Information Links on November 27, 2012

Potrebbero piacerti anche