Jackson 1993 - Stopping Rules in PCA

Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical
Approaches
Author(s): Donald A. Jackson
Source: Ecology, Vol. 74, No. 8 (Dec., 1993), pp. 2204-2214
Published by: Ecological Society of America
Stable URL: http://www.jstor.org/stable/1939574 .
Accessed: 17/08/2013 15:45
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Ecological Society of America is collaborating with JSTOR to digitize, preserve and extend access to Ecology.
http://www.jstor.org
This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

All use subject to JSTOR Terms and Conditions
Ecology, 74(8), 1993, pp. 2204-2214
? 1993 by the Ecological Society of America
STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS:

A COMPARISON OF HEURISTICAL AND
STATISTICAL APPROACHES'
DONALD A. JACKSON2
Departmentof Zoology, Universityof Toronto,Toronto,Ontario,CanadaM5S JA]
Abstract. Approaches to determining the number of components to interpret from

principal components analysis were compared. Heuristic procedures included: retaining
components with eigenvalues (Xs) > 1 (i.e., Kaiser-Guttman criterion); components with
bootstrapped Xs > 1 (bootstrapped Kaiser-Guttman); the scree plot; the broken-stick model;
and components with Xs totalling to a fixed amount of the total variance. Statistical ap-
proaches included: Bartlett's test of sphericity; Bartlett's test of homogeneity of the cor-
relation matrix, Lawley's test of the second X;bootstrapped confidence limits on successive
Xs (i.e., significant differences between Xs); and bootstrapped confidence limits on eigen-
vector coefficients (i.e., coefficients that differ significantly from zero). All methods were
compared using simulated data matrices of uniform correlation structure, patterned ma-
trices of varying correlation structure and data sets of lake morphometry, water chemistry,
and benthic invertebrate abundance. The most consistent results were obtained from the
broken-stick model and a combined measure using bootstrapped Xs and associated eigen-
vector coefficients. The traditional and bootstrapped Kaiser-Guttman approaches over-
estimated the number of nontrivial dimensions as did the fixed-amount-of-variance model.
The scree plot consistently estimated one dimension more than the number of simulated
dimensions. Bartlett's test of sphericity showed inconsistent results. Both Bartlett's test of
homogeneity of the correlation matrix and Lawley's test are limited to testing for only one
and two dimensions, respectively.
Key words: bootstrap;eigenvalues;multivariate;ordination;principalcomponentsanalysis;sta-
tistics;stoppingrules.
INTRODUCTION
after their study demonstrated "significant" results from
Although ecologists have compared the results of randomly collected data.
various ordination methods (see Gauch 1982a, Pielou My study examines the issue of assessing multivar-
1984, Digby and Kempton 1987, Minchin 1987 for iate data dimensionality using both heuristic and sta-
comparisons), few guidelines exist to evaluate how many tistical approaches. I restrict this study to principal
ordination axes should be considered nontrivial and components analysis (PCA) because it represents the
interpretable. An implicit assumption in the use of simplest and most commonly used multivariate meth-
ordination methods is that the experienced ecologist od. In many instances, these results can be extrapolated
can separate meaningful patterns from random noise to related multivariate techniques. I used a parallel
(i.e., ecologically meaningful information vs. sampling analysis of field and simulated data to examine the
variation or measurement error; Gauch 1982b). An implications of different: (1) degrees or strength of in-
ability to distinguish "signal" from "noise" is essential tervariable correlations; (2) numbers of variables; and
and in a statistical sense, these decisions provide "stop- (3) structure within correlation matrices (i.e., blocks of
ping rules." The failure to distinguish between signal correlated variables that are uncorrelated with other
and noise may lead to the rejection of useful infor- variables). These three conditions are important fac-
mation or the interpretation of ecologically meaning- tors in determining the success of a multivariate anal-
less information. In the former case, a loss of infor- ysis. The types of data ecologists analyze often lead to
mation may limit our understanding of ecological different degrees of correlations. For example, vari-
processes. In the latter case erroneous conclusions may ables like the morphology of organisms or lakes often
result as we would be interpreting essentially mean- show strongly correlated variables whereas correlations
ingless patterns (Jackson et al 1992). Rexstad et al. of the abundance of organisms may be weaker. Studies
(1988) questioned the value of multivariate analyses often differ in their ratio of the number of observations
relative to variables in the analysis. Although some
researchers recognize the importance of having a ratio
' Manuscript received 10 March 1992; revised 11 March
1993; accepted 15 March 1993. of 3: 1 or greater to provide a stable solution (see
2 Present address: Department of Zoology, University of Grossman et al. 1991), it is not uncommon to find
Western Ontario, London, Ontario, Canada N6A 5B7. studies having a lower observation: variable ratio than

December 1993 SIGNIFICANT PRINCIPAL COMPONENTS 2205
recommended. Often the implication for this latter sit- divided into groups of four variables each. Within-
uation remains unrecognized. Within data sets, there group correlations were equal to either 0.3 or 0.8
may be groups of variables (e.g., species) that are highly whereas the between-group correlations were equal to
correlated with one another, but uncorrelated with oth- either 0 or 0.3 (Fig. 1 for correlation matrices from
er groups. The within- and between-group structure simulations S-I to S-IV). This approach was used to
also contributes to substantial effects within principal simulate S-I to S-III in order to examine the effect of
components analysis. The use of simulated data per- differing degrees of strength in correlation structure.
mits data to be generated having an a priori underlying To study the effect of groups having different numbers
dimensionality such that the various methods may be of constituent variables, the matrices for S-IV used
compared relative to known values (i.e., the true num- groups containing 5, 4, and 3 variables, respectively.
ber of nontrivial components; see Lambert et al. 1990 Within-group correlations were set to 0.8 and between-
for a discussion). group correlations to 0. From each population, three
replicates of 40 observations were sampled.
METHODS
The simulation resulted in a 3 x 3 x 3 design based
Data matrices-ecological on the number of variables, level of correlation, and
Three ecological data sets were used for comparisons replication for the uniform matrices and 4 matrices x
with the simulated data. The first data set was based 3 replicates for the structured-correlation matrices.
on four lake morphological variables from 40 lakes Matrix names are coded such that the first number
from south-central Ontario. These variables were as- indicates the number of variables and the subsequent
sociated strongly with one another and generally had alphanumeric indicates the degree of correlation within
correlations between 0.6 and 0.9. The second matrix the matrix. For example, 12R3 is a 12-variable matrix
included measurements on 12 chemical elements or with all intervariable correlations equal to 0.3.
compounds from the lakes. Correlations for this matrix
ranged from near 0 to 0.7. The third matrix comprised Statistical analysis
abundance measurements on 32 benthic invertebrate Principal components analyses were conducted us-
taxa from the same 40 lakes. The correlations between ing the PRINCOMP procedure in SAS (SAS 1989).
these taxa varied between 0 and 0.7 with most corre- Methods of component assessment-heuristic ap-
lations between 0.3 and 0.4. All data in these ecological proaches. -1. Kaiser-Guttman. -The most common
data sets were transformed to linearize intervariable stopping rule in principal components analysis (PCA)
relationships and approximate normal distributions (see is based on the average value of the eigenvalues (i.e.,
Jackson 1992 for details). the Kaiser-Guttman criterion; Guttman 1954, Cliff
1988; H. Kaiser, unpublished manuscript). Because
Simulated data matrices- uniform correlation variables are often measured in different units, most
Normally distributed data were simulated to match ecologists use a correlation matrix in PCA, thereby
the number of variables in the three ecological data giving each variable equal weight in the analysis. As a
sets. Population data matrices were constructed with result, the sum of the eigenvalues equals the number
4, 12, and 32 variables and 1000 observations. Ma- of variables. In the Kaiser-Guttman method, eigen-
trices were simulated having three levels of overall values greater than the average eigenvalue (i.e., X >
correlation structure. For each population, the corre- 1.0) are retained because these axes summarize more
lations were uniformly generated to be 0, 0.3, or 0.8 information than any single original variable. There-
for all off-diagonal correlations (i.e., RO, R3, and R8). fore, only components with X > 1.0 are interpreted.
This approach generated matrices having no inter- Unfortunately, a PCA of randomly generated, uncor-
pretable dimensions (i.e., RO),a weak one-dimensional related data will produce eigenvalues exceeding one.
structure (R3), and a strong one-dimensional structure As a result, this method has been criticized (e.g., Karr
(R8). Analyses were done on three replicate samples and Martin 1981, Stauffer et al. 1985, Rexstad et al.
of 40 observations each drawn from the population of 1986, 1988, Grossman et al. 1991); however the Kai-
1000 individuals. This approach was used to assess the ser-Guttman criterion remains the most popular stop-
ability of the methods to correctly resolve the dimen- ping rule in ecology.
sionality of the population from the analysis of a sam- 2. Bootstrapped Kaiser-Guttman. -The bootstrap
ple. This parallels the same problems confronting ecol- resampling technique (Efron 1979) was proposed as a
ogists when analyzing field data. means of determining the interpretability of eigenval-
ues by Lambert et al. (1990). They argued that the
Structured correlation Kaiser-Guttman criterion was arbitrary and it ignored
For comparison, matrices were simulated with three- error associated with each X due to sampling. Conse-
dimensional structure. These three-dimensional ma- quently, eigenvalues of 0.99 would be discarded,
trices also varied as to whether submatrices were un- whereas an eigenvalue of 1.01 would be retained even
correlated or weakly correlated with one another. In though an eigenvalue of 1.01 may have 95% confidence
the first set of simulations, 12-variable matrices were limits ranging from 0.9 to 1.1. As a result, they pro-

2206 DONALD A. JACKSON Ecology, Vol. 74, No. 8
S-I S-Il
1-4 0.8 0.0 1-4 0.3 0.0
5-8 0.8 5-8 0.3
9-12 0.0 0.8 9-12 0.0 0.3
1-4 5-8 9-12 1-4 5-8 9-12
S-l1l S-Iv
1-4 0.8 0.3

1-5 0.8 0.0
5-8 0.8
6-9 0.8
9-12 0.3 0.8

10-12 0.0 0.8
1-4 5-8 9-12 1-5 6-9 10-12

FIG. 1. Correlation structure for the patterned matrices used in S-I to S-IV. The values presented are the off-diagonal
intervariable correlations. For example in S-I, variables 1-4 were correlated with one another at r = 0.8, as were variables
5-8, and 9-12. However, correlations between variables from different submatrices were equal to 0.
posed that the bootstrap should be used to determine to the scree plot (Horn 1965). After analyzing a given
how many eigenvalues had confidence limits encom- data set and plotting the eigenvalues in a traditional
passing the 1.0 criterion (i.e., a bootstrap Kaiser-Gutt- scree plot, numerous matrices of rank equal to the
man approach). observed data, but with uncorrelated variables, are
3. Scree plot.-Another common method (although generated and eigenvalues are calculated. These eigen-
used infrequently by ecologists; e.g., Zebra and Collins values from the random data are tabulated and the
[1992]) is the scree plot. To apply the scree method, mean values plotted on the scree plot of the original
one plots the value of each successive eigenvalue against data. The point where the two lines cross indicates the
the rank order (Fig. 2; the log of the eigenvalues also
can be used with covariance-based PCAs). The smaller
eigenvalues, representing random variation, tend to lie 4
along a straight line. The point where the first few
eigenvalues depart from the line distinguishes the "in- Structured Data
terpretable" and trivial components. Cattell (1966) 3
originally proposed that points to the left of the straight-
line segment should be considered important (i.e., three cm~
components in the structured data of Fig. 2), but sub-
sequently concluded (Cattell and Vogelmann 1977) that
the first eigenvalue to the right of this point should be
included also (i.e., four interpretable components in 1 _ I _ Random Data
Fig. 2). Often the scree approach is complicated by
either the lack of any obvious break or the possibility
of multiple break points.
Horn (1965, Horn and Engstrom 1979) recognized 2 4 6 8 10 12
that with matrices composed of random data, the scree Component Number
plot would show a stable negative slope. Horn argued
FIG. 2. Eigenvalues from a principal components analysis
that distinguishing eigenvalues in scree plots remained of a 12-variable data set of randomly generated, uncorrelated
quite arbitrary. As a result, he proposed a modification data and for a data set with underlying structure.

maximum limit where eigenvalues are considered in- variance matrix and several studies recommend its use
terpretable. Further variations of this method have in- only in covariance-based analyses (Dillon and Gold-
cluded regression or Monte Carlo approaches (e.g., Al- stein 1984, Morrison 1990, Grossman et al. 1991,
len and Hubbard 1986, Lautenschlager 1989). Jackson 1991). However, the test can be used with a
4. Broken-stick. -Frontier (1976) proposed a bro- correlation matrix where such results are considered
ken-stick method that is based on eigenvalues from to be conservative estimates of the number of non-
random data. Frontier's model assumes that if the total trivial components (Pimentel 1979, Kendall 1980).
variance (i.e., sum of the eigenvalues) is divided ran- 7. Bartlett's test of the equality of X,.-Bartlett (1954)
domly amongst the various components, then the ex- also developed a statistical test of whether the first
pected distribution of the eigenvalues will follow a bro- eigenvalue of a correlation matrix is equal to the re-
ken-stick distribution (i.e., the random data in Fig. 2). maining set of eigenvalues (i.e., correlation matrix ho-
Observed eigenvalues are considered interpretable if mogeneity). A modified Bartlett's test (Box 1949, Krza-
they exceed eigenvalues generated by the broken-stick nowski 1988) is calculated as
model. Frontier (1976) and Legendre and Legendre
(1983) provide a table of eigenvalues based on the - 1(2p + 11) InIR 1,
x2=-n
broken-stick distribution, but the solution is easily cal-
culated as:
P 1 where IR Iis the determinant of the correlation matrix,
bk = . and the test has p(p - 1)/2 degrees of freedom. The
i=k 1
test is limited because it only examines the first eigen-
where p is number of variables and bk is the size of the value. However it provides an assessment ofthe overall
eigenvalue for the kth component under the broken- PCA (i.e., if the null hypothesis is not rejected, it is
stick model. pointless to interpret the PCA).
5. Proportion of total variance. -Another simple 8. Lawley's test of X2.-Lawley (1956, 1963) pro-
criterion for estimating the number of nontrivial composed a method to test for the equality of the p - 1
ponents is to include all components up to some ar- eigenvalues (i.e., all but the first eigenvalue). It is based
bitrary proportion of the total variance. This method on the following
typically includes components comprising 95% of the
total variance. Although this method is advocated by X2 X2 (rij - r, u 2; (rk )2,
some statisticians (Jolliffe 1986), Jackson (1991) i=k+1 k=1 k=1
strongly recommended against its application as being where ri is the correlation between variable i and vari-
unfounded and unreliable. able j and
Statistical approaches. -Some data analysts retain
2
components with significant correlations (e.g., P < .05)
-
between the component scores and the original vari- P(P 1) i=k+l k=1
ables. Statistically this approach is flawed because the

X= 1-r,
PCA solution and original variables are not indepen-
dent, and as a result, the attributed significance is in- (p _ 1)2(1 - X2)
appropriate. In addition, components with only a sin- p -
(p -
2)X2
gle "significant" correlation suggest that the axis is not
1 P
a satisfactory multivariate summary. Tk rik, k = ,...p
6. Test of sphericity. -Bartlett's test of sphericity P- 1 i=1,ink
(Cooley and Lohnes 1971, Pimentel 1979) evaluates

with (p + 1) (p -2)/2 degrees of freedom. One limi-
whether each sequential eigenvalue is significantly dif-
tation of this approach is that the test only evaluates
ferent from the remaining eigenvalues. Conceptually,
the second eigenvalue. Subsequent eigenvalues are not
the test attempts to reveal the point where the PCA
compared when the null hypothesis is rejected. This
summarizes a spherical distribution of points. The test
approach has been applied in a recent study of principal
statistic is calculated as
components analysis with ecological data (Grossman
P ~~~~~P
(p - k) In L
,i=k+ I
Xi/(p - k)1 2i,+
i=k+ I
etal. 1991).
9. Bootstrap eigenvalue-eigenvector. -In this study,
each principal components analysis was bootstrapped
where p is the number of variables, k represents a 100 times. Eigenvalues for each bootstrap sample and
specific component, Xiis the eigenvalue of component the associated eigenvector coefficients were retained.
i, and n is the number of observations. If the resultant Means, minima, maxima, and 95% confidence limits
statistic is multiplied by n - k, the product is x2 dis- were calculated from the distribution of the eigenval-
tributed with 0.5(p - k - l)(p - k + 2) degrees of ues. Where the confidence intervals overlapped be-
freedom. This test was originally developed for a co- tween pairs of successive eigenvalues, these eigenval-

ues were considered to be indistinguishable from one Guttman method of interpreting Xs > 1.0 indicated all
another. However, if the ranges did not overlap, the PCAs from S-I to S-IV contained three interpretable
eigenvalues were assumed to be different. This latter components (i.e., within each analysis there were three
condition was considered to represent the break-point eigenvalues exceeding 1.0; Table 4). For the S-I ma-
between "meaningful" or nontrivial components and trices the approach indicated retaining five compo-
those associated with sampling and random noise. nents although only three dimensions were constructed
Similarly, the eigenvector coefficients were evaluated
in the simulations.
using a bootstrap approach. Coefficients that did not 2. Bootstrapped Kaiser-Guttman. -The bootstrap
differ significantly from zero were categorized as trivial
of the eigenvalues using the Kaiser-Guttman approach
or nonsignificant. However, if zero fell outside the 95 %
resulted in only one component being considered non-
confidence limits, then the coefficient was considered trivial with each of the 4-variable matrices (i.e., 4RO,
to be relatively stable and informative. Only boot- 4R3, and 4R8). Four components were retained with
strapped components having two or more coefficients 12RO PCAs and 2-3 components for the 12R3 matri-
different from zero were considered meaningful. Com- ces. The 32RO matrices had 9-10 components retained
ponents with only a single nonzero coefficient repre- and 8 components for the 32R3 matrices. For all R8
sented only a single variable, hence the component matrices (i.e., 4R8, 12R8, and 32R8), only the first
does not provide a true multivariate summary. component was considered nontrivial using this meth-
As a means of evaluating the overall similarity amongod. In the structured matrices, 3-4 components were
the different approaches with the different data sets, aidentified as nontrivial in the low-correlation S-II ma-
multivariate summary was done. The number of non- trices, whereas only two components were retained from
trivial components for each method from each data set S-III, and three components from the other analyses
was used as an input matrix. For example, the Kaiser- (i.e., S-I and S-IV).
Guttman method had a value for each of 4R0-A, 4R0- Both versions of the Kaiser-Guttman approach in-
B,.. .S-IV-C. The number of dimensions that were dicated a single interpretable dimension with the
simulated for each data set was included as an addi- 4-variable matrix of lake morphometry. The PCA based
tional observation. A Euclidean distance matrix was on the 12-variable matrix of water chemistry revealed
calculated between the methods and a principal co- three nontrivial components with Xs > 1.0 retained.
ordinates analysis done on the distance matrix to However, the bootstrapped evaluation suggested that
graphically integrate the differences among approaches only the first eigenvalue was significantly greater than
across all the data sets (e.g., see Jackson and Somers 1.0. Both the traditional and bootstrapped approaches
1991). indicated that nine and eight components, respectively,
were nontrivial in the 32-variable matrix of benthic
RESULTS
invertebrates, similar to results for the 32RO and 32R3
1. Kaiser-Guttman approach (X > 1.0). -For ma- matrices.
trices with correlations of RO or R3, the Kaiser-Gutt- 3. Scree plot. -Results from the scree plot based on
man method retained ; 50% of the components for the 4-variable matrices were difficult to interpret. In some
4- and 12-variable matrices and 30-40% of the com- cases, trends were apparent, but in other cases it is
ponents in the 32-variable matrices (Tables 1-3). For difficult to discern any pattern in the plot because only
each R8 matrix, only one eigenvalue exceeded 1.0, four points were available. Where a trend was apparent,
indicating a single interpretable gradient. The Kaiser- the approach advocated by Cattell and Vogelmann
TABLE 1. Number of nontrivial components indicated by various methods. Simulated data matrices have four variables and
40 observations with uniform correlations as follows: 4R0 has uniform correlation structure of r = 0, 4R3 has uniform
correlation structure of r = 0.3, and 4R8 has uniform correlation structure of r = 0.8. The letters A-C indicate replicates
drawn from a simulated population having that correlation structure. The morphology data set comprised four lake
morphometric variables.
4R0 4R3 4R8 Mor-

A B C A B C A B C phology
Kaiser-Guttman 2 2 2 1 1 1 1 1 1 1
Bootstrap Kaiser-Guttman 1 1 1 1 1 1 1 1 1 1
Scree plot 0 2 2 2 2 2
Broken-stick 0 0 0 0 1 1 1 1 1 1
95% variance 4 4 4 4 4 4 3 2 3 3
Sphericity test 0 0 0 1 1 1 1 1 1 3
Bartlett's first eigenvalue 0 0 0 1 1 1 1 1 1 1
Lawley's <2 <2 <2 <2 <2 <2 <2 <2 <2 2+
Bootstrap eigenvalue 0 0 0 1 1 1 1 1 1 2
Bootstrap eigenvector 0 0 0 1 1 1 1 1 1 3
Known 0 0 0 1 1 1 1 1 1

TABLE 2. Number of nontrivial components indicated by various methods. Data matrices having 12 variables and 40
observations with uniform correlations as follows: 12R0 has uniform correlation structure of r = 0, 12R3 has uniform
drawn from a simulated population having that correlation structure. The chemistry data set comprised 12 lake water
chemistry variables.
12R0 12R3 12R8

Chem-
A B C A B C A B C istry
Kaiser-Guttman 5 6 6 3 5 3 1 1 1 3
Scree plot 4 6 7 2 2 3 2 2 2 4
Broken-stick 0 0 0 1 1 1 1 1 1 3
95% variance 11 10 11 10 11 10 6 7 7 8
Sphericity test 0 0 0 1 1 1 1 1 1 3
Lawley's <2 <2 <2 <2 <2 2+ 2+ 2+ 2+ 2+
Known 0 0 0 1 1 1 1 1 1
(1977) suggested that two components were nontrivial. 4. Broken-stick model. -The broken-stick method
With the 12RO analyses, the scree indicated 4-7 com- correctly identified the dimensionality of all uniform-
ponents should be interpretable. The number of com- correlation matrices (a single exception being one of
ponents dropped to 2-3 when 12R3 matrices were used the replicates from 4R3). For the RO matrices, the
and 2 components would be retained with 12R8 ma- method indicated that the underlying dimensionality
trices (Fig. 3). In the 32RO analyses, the scree plot was 0, and one component as nontrivial with R3 or
results suggested from 5 to 15 components should be R8 matrices. This method revealed three interpretable
interpreted, 2-6 components with 32R3, and 2 com- components for matrices from S-I and S-IV, a single
ponents with 32R8 analyses. component from S-Il, and 2-3 components from S-Ill.
For the structured matrices, the scree plot suggested A single component would be retained from the lake
that there were four nontrivial components in the S-I, morphometry data, three from the water chemistry
S-Ill, and S-IV matrices. For PCAs based on S-Il, scree data, and two components from the benthic inverte-
results indicated that between two and four compo- brate data. (The application of the broken-stick model
nents would be considered interpretable. As with the to the eigenvalues presented in Rexstad et al.'s [1988]
simulated 4-variable matrices, no estimate of the num- criticism of PCA showed no nontrivial components in
ber of dimensions for the lake morphometry data could contrast to 7 of the 15 being considered useful from
be made because no obvious trend was apparent. With the Kaiser-Guttman method.)
the water chemistry data, there were three interpretable 5. 95% of the total variance. -The approach of re-
components, but a second break is evident (Fig. 3). If taining components until 95% of the total variance was
this latter point was considered, then a total of five achieved would result in all components being inter-
components were nontrivial. preted for the 4RO or 4R3 analyses, and 2-3 compo-
TABLE 3. Number of nontrivial components indicated by various methods. Data matrices having 32 variables and 40
observations with uniform correlations as follows: 32R0 has uniform correlation structure of r = 0, 32R3 has uniform
drawn from a simulated population having that correlation structure. The benthic invertebrate data set comprised abun-
dances for 32 lake benthic invertebrate taxa.
32R0 32R3 32R8 Benthic

inverte-
A B C A B C A B C brates
Kaiser-Guttman 12 14 13 10 10 11 1 1 1 9
Scree plot 9 15 5 3 2 6 2 2 2 4
Broken-stick 0 0 0 1 1 1 1 1 1 2
95% variance 22 22 22 21 21 21 14 14 14 19
Sphericity test 0 0 0 1 1 11 1 2 6 19
Lawley's <2 <2 <2 2+ 2+ 2+ 2+ 2+ 2+ 2+
Known 0 0 0 1 1 1 1 1 1

nents for the 4R8 analyses. Application of the method 10

with the 12-variable PCAs led to 10-11 eigenvalues
being considered important for 12RO and 12R3 ma- r =0
8
trices, and 6-7 eigenvalues for the 12R8 matrices. A 8__ _ _ _ _ _ _. r = 0.3
total of 22 components would be retained from the ......... r = 0.8
32RO analyses, 21 from the 32R3, and 14 components M 6 .......... Chemistry
cdg
from the 32R8 PCAs. The fixed-variance approach
also led to more components being retained with the
S-I to S-IV matrices than any other method. In all Mj
4
\
cases, the method would lead to 7-1 1 components be-
ing considered important and interpreted. With the
lake morphometry data, this method identified three
components as nontrivial and eight components from
the water chemistry data. This procedure retained three-
2 4 6 8 10 12
quarters of the components from each PCA. Nineteen
of a possible 32 components would be kept from the Component Number
benthic invertebrate PCA. FIG. 3. Eigenvalues from principal components analyses
(PCA) from data sets that contained uncorrelated, weakly
6. Bartlett's test of sphericity. -Bartlett's test cor-
correlated, or strongly correlated variables. Each matrix com-
rectly identified the dimensionality of the PCAs based prised 12 variables and 40 observations. Eigenvalues from a
on four variables (i.e., 0 for RO, 1 for R3 and R8). The PCA of 12 lake water chemistry variables are plotted also.
method correctly estimated the dimensionality of the
12-variable matrices, but was more erratic with the 32-
variable matrices. The method correctly indicated no the test erroneously indicated AXin one 12RO PCA and
significant eigenvalues with 32RO. However, one of the all of the 32RO PCAs as being significantly different
32R3 matrices led to 11 eigenvalues being identified from the remaining eigenvalues. With the patterned
as significant and two PCAs from 32R8 matrices led matrices (i.e., S-I to S-IV) the test indicated at least
to two and six components being retained. The test one significant eigenvalue in each PCA. Bartlett's test
indicated three significant components for S-I and S-Ill indicated that there was at least one significant com-
matrices, two components for the S-I analyses, and ponent in each of the lake morphometry, chemistry,
3-4 for those from S-IV although all these matrices and invertebrate PCAs.
were constructed with three-dimensional structure. 8. Lawley's test ofX,2-Lawley's test for correlation-
Bartlett's test of sphericity yielded 3 significant eigen- based PCAs correctly identified a maximum of one
values from the 4-variable lake morphometry data and underlying dimension with the four-variable solutions.
the 12-variable water chemistry, plus 19 significant With 12RO matrices, the test indicated <2 significant
eigenvalues from the PCA of lake benthic invertebrate Xs for the PCAs. However, with 12R3 matrices, one
data. result suggested a minimum of two or more significant
7. Bartlett's test of the equality ofA,. Bartlett's meth- eigenvalues, and all 12R8 matrices had two or more
od testing whether the first X from a correlation matrix significant eigenvalues. Results based on 32RO matri-
is equal to all others led to correct identification of the ces led to <2 eigenvalues being considered interpret-
minimum number of interpretable dimensions in all able, whereas all 32R3 and 32R8 analyses indicated
4R data sets. With the 12- and 32-variable data sets, >-2 significant Xs. Lawley's method indicated that each
TABLE 4. Number of nontrivial components indicated by various methods. Patterned data matrices having 12 variables and
40 observations. The intervariable correlations were generated following Fig. 1. Letters A-C represent replicate samples
drawn from each simulated population.
S-I S-u1 S-I11 S-IV

A B C A B C A B C A B C
Kaiser-Guttman 3 3 3 5 5 5 3 3 3 3 3 3
Bootstrap Kaiser-Guttman 3 3 3 4 3 3 2 2 2 3 3 3
Scree plot 4 4 4 3 4 2 4 4 4 4 4 4
Broken-stick 3 3 3 1 1 1 3 2 2 3 3 3
95% variance 7 7 8 10 10 10 7 7 8 8 7 8
Sphericity test 3 3 3 2 2 2 3 3 3 3 3 4
Bartlett's first eigenvalue 1 1 1 1 1 1 1 1 1 1 1 1
Lawley's 2+ 2+ 2+ 2+ 2+ 2+ 2+ 2+ 2+ 2+ 2+ 2+
Bootstrap eigenvalue 3 3 3 0 0 0 3 3 1 3 3 2
Bootstrap eigenvector 3 3 3 0 2 1 0 3 3 3 3 3
Known 3 3 3 3 3 3 3 3 3 3 3 3

PCA from S-I to S-IV contained a minimum of two 12

Scree
significantly different eigenvalues. Although neither .e KG
Lawley's nor Bartlett's test is capable of accurately as-
sessing the true dimensionality of these matrices (i.e., BKG
three dimensions), they can establish that a minimal
0
number of components is interpretable (i.e., either one BVec
or two components, respectively). Lawley's test indi-
LA no Bck
cates that a minimum of two components should be
Known
considered interpretable in each of the field data sets.
9. Bootstrapped eigenvalue-eigenvector.-When the Sphere 95%
bootstrap approach was used to determine overlap be-
tween eigenvalues, the number of components retained -12 l l
was lower. For the 4R, 12R, and 32R matrices, the -16 0 16 32 48
Axis 1 (84.4%)
method correctly identified that no components were
interpretable with ROand only one component for PCAs FIG. 4. Principalcoordinatesanalysisbased on a Euclid-
based on R3 and R8 matrices. Identical results oc- ean distance matrix. Data in the analysis are based on the
numberof nontrivialcomponentsresultingfromeachmethod
curred with the estimations based on the bootstrapped with each data set. Codes are: KG is the Kaiser-Guttman;
eigenvector coefficients. When the bootstrap was used BKG is the bootstrappedKaiser-Guttman;Screeis the scree
to distinguish between overlapping eigenvalues with plot; BStickis the broken-stickmodel; 95%is the 95%of the
the patterned matrices, results suggested that three total variancecriterion;Sphereis Bartlett'stest of sphericity;
BVal is the bootstrappedeigenvalue method; BVec is the
components be retained for S-I, no components from bootstrappedeigenvectorcoefficientmethod; and Known is
S-Il, 1-3 components from S-Ill, and 2-3 components the truenumberof dimensionsincorporatedinto the datasets
from S-IV. Bootstrapped eigenvector coefficients in- duringsimulation.
dicated 3 interpretable components for S-I and S-IV,
0-2 components for S-Il, and 0-3 components for S-Ill.
These approaches suggested two and three dimensions, are found in morphometric analyses, ecologists often
respectively, for the PCA of lake morphometric data. analyze matrices of more limited structure (e.g., weakly
For the water chemistry matrix, the bootstrapped ei- correlated data). Jolliffe (1972, 1986) suggested that
genvector coefficients indicated that two components the choice of X > 1.0 was too conservative and that
were interpretable, although the bootstrapped-eigen- components with X > 0.7 should be considered useful.
value version indicated only a single nontrivial com- Clearly, this level of selection would only exacerbate
ponent. With the benthic invertebrate data, only a sin- a bad situation. Given that the Kaiser-Guttman ap-
gle component would be retained based on either proach is frequently used by ecologists, it would be
approach. wise to use an alternative method to choose nontrivial
The multivariate summary using ordination analysis components. The use of eigenvalues in a modified Kai-
provided an overall assessment of the similarity ser-Guttman approach does not change the results sub-
amongst the methods. Results from the 4-variable stantially. The bootstrapped Kaiser-Guttman did re-
analyses were not included because some of the eigen- duce the number of nontrivial components, but the
values could not be interpreted. Results from Bartlett's number still exceeded the number of underlying di-
test of X1 and Lawley's test of X2 were not included mensions except for the R8 matrices.
because they are limited to evaluating only one or two The scree plot (as applied as Cattell and Vogelmann
eigenvalues, respectively. The first PCoA Axis sum- [1977]) provided poor resolution of the underlying di-
marized most of the variation among the methods (Fig. mensionality. The method invariably overestimated
4). The 95% approach differed most from the other the number of interpretable components. If Cattell's
methods. The Kaiser-Guttman, bootstrapped Kaiser- (1966) original criterion was used, the method was
Guttman, and scree plot provided relatively similar more conservative. This means including the eigen-
results, but differed from the true number of dimen- values up to, but not including, the first eigenvalue on
sions in each data set. The bootstrapped eigenvalue, the straight-line portion of the plot (Fig. 2). However
bootstrapped eigenvector, and broken-stick methods this modification still overestimates the number of in-
all led to similar numbers of dimensions for each data terpretable components in analyses of matrices of
set, and each was very similar to the number of un- weakly correlated data. Surprisingly, the original ap-
derlying simulated dimensions. proach (Cattell 1966) provided a better estimate of the
correct number of dimensions with the simulated data
DISCUSSION matrices.
The standard Kaiser-Guttman approach of inter- The broken-stick method correctly assessed the di-
preting X > 1.0 led to the retention of too many com- mensionality of the data matrices. It did underestimate
ponents except with matrices having strong correlation the number of interpretable components for S-I ma-
structure. Although such correlation structures often trices. This method provided a good combination of

simplicity of calculation and accurate evaluation of simulations, three dimensions were created by having
dimensionality relative to the other statistical ap- three sets of four variables, each set having identical
proaches. correlations. Due to this condition and chance selection
The 95%-variance-threshold method provided un- of observations in the bootstrap, any specific dimen-
satisfactory results. Although the choice at 95% of the sion could be expressed on the first component of one
total variance is relatively high, any level is arbitrary. PCA, but on the second or third component from an-
No matter what cumulative percentage level is select- other PCA. This is similar to the re-ordering of com-
ed, this approach does not appear promising because ponents or solution instability found by Oksanen (1988)
there is the high risk that many of the components that with detrended correspondence analysis. When the or-
are retained will summarize noise or nontrivial com- der of expression of the underlying dimensions varies
ponents will not be included. between components for different analyses, the eigen-
Bartlett's test of sphericity correctly identified the vector coefficient approach will fail. With these same
dimensionality in many of the data sets, but in some data characteristics, the first three eigenvalues also
cases indicated up to 11 significant eigenvalues al- overlap in their 95% confidence limits, but are signif-
though only a single dimension was simulated (Table icantly different from the fourth eigenvalue. The prob-
3). Despite the statement by Kendall (1980) that this lem with the eigenvector coefficients is particularly ev-
test is overly conservative when applied to correlation ident when the initial correlation structure is weak (e.g.,
matrices, it appears to correctly identify the number S-Il). However, if the dimensions differ in: (1) the
of dimensions with many data sets, but it is too liberal strength of correlation structure (i.e., several high cor-
a test with matrices having a low observation-to-vari- relations vs. low or medium correlations); (2) the num-
able ratio (e.g., less than the 3: 1 ratio advocated by ber of constituent variables; or (3) have strong corre-
Grossman et al. 1991). With the ecological data, the lations, this method provides more accurate results.
test also retained large numbers of the components, Overall, it appears that the combination of these two
i.e., 19 of 32 components were considered significant approaches, i.e., the bootstrapped eigenvalue and ei-
with the benthic invertebrate data. genvector coefficients, provides a better measure of the
Bartlett's approach to test for homogeneity of the dimensionality than either approach alone. The max-
correlation matrix (i.e., whether the first X equalled all imum value obtained with either approach was close
others) appeared to identify the correct minimal di- to the true dimensionality, except with the S-I1 data.
mensionality except with the 32-variable analyses. Here An additional consideration of the bootstrapped ei-
the method indicated significant structure with ran- genvector method is that it assists with the evaluation
dom, uncorrelated data. Likewise, Lawley's test con- of whether or not each variable contributes to a given
sistently overestimated the dimensionality of the 12- component. If a specific variable is not significantly
and 32-variable matrices having uniform correlations. weighted on any nontrivial component, then that vari-
Because the test is designed to evaluate only whether able could be removed from the analysis. For example,
X2 is the same as successive eigenvalues, the method in the PCA of the lake morphology data, three variables
is rather limited. As a result of its limited utility and had eigenvector coefficients that differed from zero.
relatively poor performance in this set of comparisons, However, lake volume coefficients included 0 in the
the method is not recommended. 95% confidence limits on each component. Therefore,
The combination of testing for overlap in ranges of lake volume did not contribute to the analysis and
bootstrapped eigenvalues and for eigenvector coeffi- added little information to the PCA.
cients differing from 0 appears more promising. With With the use of any of the methods employing formal
simple matrices either lacking structure or having a statistical tests, e.g., Bartlett's test of sphericity, it is
single dimension, both approaches consistently re- important to recognize the increased probability of re-
vealed the underlying dimensionality of the simula- jecting the null hypothesis when many components are
tion. However with patterned matrices, the results were evaluated sequentially. When such tests are used, re-
less reliable for either approach individually. Both searchers may remove this increased risk of a Type I
methods worked well with S-I and S-IV matrices hav- error by employing some form of a correction for mul-
ing strong inter-variable correlations. However, in S-I1 tiple comparisons such as Bonferroni's adjustment.
where there were three underlying, but weak dimen- The most promising approaches to component eval-
sions, there were no differences between the boot- uation are the broken-stick model and the boot-
strapped eigenvalues. The eigenvector coefficient ap- strapped eigenvalue-eigenvector method. The broken-
proach produced inconsistent results and frequently stick approach has the advantage of being simple to
underestimated the correct dimensionality. With ma- calculate. Within the scope of this study, both methods
trices from S-Ill, both methods correctly identified two led to similar conclusions about the dimensionality of
dimensions, but from different replicated matrices. the simulated data sets. The matrices simulated in this
The poor showing of the eigenvector approach for study all represented relatively well-conditioned data
S-I1 and S-Ill is easily explained and similar to situ- (e.g., from a normal distribution, independent sam-
ations discussed elsewhere (Oksanen 1988). In both pling). However, many data used in ecological studies

do not meet formal assumptions of classical statistical Grossman, G. D., D. M. Nickerson, and M. C. Freeman.
approaches. The extension of this comparison to sim- 1991. Principal component analyses of assemblage struc-
ture data: utility of tests based on eigenvalues. Ecology 72:
ulated data varying in departure from the statistical 341-347.
"ideal" would be of considerable value (e.g., Davis Guttman, L. 1954. Some necessary conditions for common
1977). Approaches such as the bootstrapped eigenval- factor analysis. Psychometrika 19:149-161.
ue-eigenvector method would likely prove more useful Horn, J. L. 1965. A rationale and test for the number of
with such data conditions than the relatively sensitive factors in factor analysis. Psychometrika 30:179-185.
methods based on idealized distributions and formal Horn, J. L., and R. Engstrom. 1979. Cattell's scree test in
relation to Bartlett's chi-square test and other observations
tests (e.g., both of Bartlett's and Lawley's methods). on the number of factors problem. Multivariate Behavioral
Research 14:283-300.
ACKNOWLEDGMENTS Jackson, D. A. 1992. Fish and benthic invertebrates: ana-
This studywas greatlyassistedby the criticalcommentsof lytical approaches and community-environment relation-
K. P. Burnham,M. Dennison, R. H. Green, H. H. Harvey, ships. Dissertation. University of Toronto, Toronto, On-
K. M. Somers,and D. F Stauffer.Fundingwas providedby tario, Canada.
a NaturalSciencesandEngineering ResearchCouncil(NSERC) Jackson, D. A., and K. M. Somers. 1991. Putting things in
GraduateScholarshipand OntarioGraduateScholarshipto order: the ups and downs of detrended correspondence
D. A. Jackson,an NSERC Operatinggrantto H. H. Harvey, analysis. American Naturalist 137:704-712.
and OntarioMinistryof Environmentand Ontario Renew- Jackson, D. A., K. M. Somers, and H. H. Harvey. 1992.
able ResourcesResearchGrantsto H. H. Harvey and D. A. Null models and fish communities: evidence of nonrandom
Jackson. patterns. American Naturalist 139:930-951.
Jackson, J. E. 199 1. A user's guide to principal components.
LITERATURE CITED John Wiley & Sons, New York, New York, USA.
Allen, S. J., and R. Hubbard. 1986. Regression equations Jolliffe, I. T. 1972. Discarding variables in a principal com-
for the latent roots of random data correlation matrices ponents analysis. I. Artificial data. Applied Statistics 23:
with unities on the diagonal. Multivariate Behavioral Re- 160-173.
search 21:393-398. 1986. Principal components analysis. Springer-Ver-
Bartlett, M. S. 1950. Tests of significance in factor analysis. lag, New York, New York, USA.
British Journal of Psychology (Statistical Section) 3:77-85. Karr, J. R., and T. E. Martin. 1981. Random numbers and
2 1954. A note on the multiplying factors for various principal components: further searches for the unicorn. Pages
X2 approximation. Journal of the Royal Statistical Society, 20-24 in D. E. Capen, editor. The use of multivariate sta-
Series B 16:296-298. tistics in studies of wildlife habitat. United States Forest
Box, G. E. P. 1949. A general distribution theory for a class Service General Technical Report RM-87.
of likelihood criteria. Biometrika 36:317-346. Kendall, M. 1980. Multivariate analysis. Second edition.
Cattell, R. B. 1966. The scree test for the number of factors. Charles Griffin, London, England.
Journal of Multivariate Behavioral Research 1:245-276. Krzanowski, W. J. 1983. Cross-validatory choice in prin-
Cattell, R. B., and S. Vogelmann. 1977. A comprehensive cipal components analysis: some sampling results. Journal
trial of the scree and KG criteria for determining the num- of Statistical Computation and Simulation 18:299-314.
ber of factors. Multivariate Behavioral Research 12:289- 1988. Principles of multivariate analysis: a user's
325. perspective. Oxford University Press, London, England.
Cliff, N. 1988. The eigenvalues-greater-than-one rule and Lambert, Z. V., A. R. Wildt, and R. M. Durand. 1990. As-
the reliability of components. Psychological Bulletin 103: sessing sampling variation relative to number-of-factors
276-279. criteria. Educational and Psychological Measurement 50:
Cooley, W. W., and P. R. Lohnes. 1971. Multivariate data 33-49.
analysis. John Wiley & Sons, New York, New York, USA. Lautenschlager, G. J. 1989. A comparison of alternatives
Davis, A. W. 1977. Asymptotic theory for principal com- to conducting Monte Carlo analyses for determining par-
ponents analysis: non-normal case. Australian Journal of allel analysis criteria. Multivariate Behavioral Research 24:
Statistics 19:206-212. 365-395.
Digby, P. G. N., and R. A. Kempton. 1987. Multivariate Lawley, D. N. 1956. Tests of significance for the latent roots
analysis of ecological communities. Chapman and Hall, of covariance and correlation matrices. Biometrika 43:128-
New York, New York, USA. 136.
Dillon, W. R., and M. Goldstein. 1984. Multivariate anal- 1963. On testing a set of correlation coefficients for
ysis: methods and applications. John Wiley & Sons, New equality. Annals of Mathematical Statistics 34:149-151.
York, New York, USA. Legendre, L., and P. Legendre. 1983. Numerical ecology. El-
Dudzifiski, M. L., J. T. Chmura, and C. B. H. Edwards. 1975. sevier, Amsterdam, The Netherlands.
Repeatability of principal components in samples: normal Morrison, D. F. 1990. Multivariate statistical methods. Third
and non-normal data sets compared. Multivariate Behav- edition. McGraw-Hill, New York, New York, USA.
ioral Research 10: 109-118. Oksanen. J. 1988. A note on the occasional instability of
Efron, B. 1979. Bootstrap methods: another look at the detrending in correspondence analysis. Vegetatio 74:29-32.
jackknife. Annals of Statistics 7:1-26. - Orloci, L. 1978. Multivariate analysis in vegetation re-
Frontier, S. 1976. Etude de la decroissance des valeurs propres search. Second edition. Dr. W. Junk, The Hague, The Neth-
dans une analyze en composantes principales: comparison erlands.
avec le module de baton bris6. Journal of Experimental Pielou, E. C. 1984. The interpretation of ecological data.
Marine Biology and Ecology 25:67-75. John Wiley & Sons, New York, New York, USA.
Gauch, H. G., Jr. 1982a. Multivariate analysis in com- Pimentel, R. A. 1979. Morphometrics: the multivariate
munity ecology. Cambridge University Press, New York, analysis of biological data. Kendall-Hunt, Dubuque, Iowa,
New York, USA. USA.
1982b. Noise reduction by eigenvector ordination. Rexstad, E. A., D. D. Miller, C. H. Flather, E. M. Anderson,
Ecology 63:1643-1649. J. W. Hupp, and D. R. Anderson. 1988. Questionable

multivariate statistical inference in wildlife habitat and Stauffer, D. F., E. 0. Gordon, and R. K. Steinhorst. 1985.
community studies. Journal of Wildlife Management 52: A comparison of principal components from real and ran-
794-798. dom data. Ecology 66:1693-1698.
Rexstad, E. A., D. D. Miller, C. H. Flather, E. M. Anderson, Taylor, J. 1990. Questionable multivariate statistical infer-
J. W. Hupp, and D. R. Anderson. 1990. Questionable ence in wildlife habitat and community studies: a comment.
multivariate statistical inference in wildlife habitat and Journal of Wildlife Management 54:186-189.
community studies: a reply. Journal of Wildlife Manage- Zebra, K. E., and J. P. Collins. 1992. Spatial heterogeneity
ment 54:189-193. and individual variation in diet of an aquatic predator.
SAS. 1989. SAS/STAT user's guide. Version 6. SAS Insti- Ecology 73:268-279.
tute, Cary, North Carolina, USA.


Jackson 1993 - Stopping Rules in PCA

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Jackson 1993 - Stopping Rules in PCA

Caricato da

Copyright:

Formati disponibili

Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS:

Abstract. Approaches to determining the number of components to interpret from

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

1-4 0.8 0.0 1-4 0.3 0.0

5-8 0.8 5-8 0.3

9-12 0.0 0.8 9-12 0.0 0.3

1-4 5-8 9-12 1-4 5-8 9-12

1-4 0.8 0.3

9-12 0.3 0.8

1-4 5-8 9-12 1-5 6-9 10-12

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

ables. Statistically this approach is flawed because the

(Cooley and Lohnes 1971, Pimentel 1979) evaluates

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

4R0 4R3 4R8 Mor-

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

12R0 12R3 12R8

32R0 32R3 32R8 Benthic

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

nents for the 4R8 analyses. Application of the method 10

S-I S-u1 S-I11 S-IV

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

PCA from S-I to S-IV contained a minimum of two 12

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

This content downloaded from 130.240.43.43 on Sat, 17 Aug 2013 15:45:17 PM

Potrebbero piacerti anche