Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
@c 2012 by G. David Garson and Statistical Associates Publishing. All rights reserved
worldwide in all media. No permission is granted to any user to copy or post this work in
any format or any media.
The author and publisher of this eBook and accompanying materials make no
representation or warranties with respect to the accuracy, applicability, fitness, or
completeness of the contents of this eBook or accompanying materials. The author and
publisher disclaim any warranties (express or implied), merchantability, or fitness for any
particular purpose. The author and publisher shall in no event be held liable to any party for
any direct, indirect, punitive, special, incidental or other consequential damages arising
directly or indirectly from any use of this material, which is provided “as is”, and without
warranties. Further, the author and publisher do not warrant the performance,
effectiveness or applicability of any sites listed or linked to in this eBook or accompanying
materials. All links are for information purposes only and are not warranted for content,
accuracy or any other implied or explicit purpose. This eBook and accompanying materials is
© copyrighted by G. David Garson and Statistical Associates Publishing. No part of this may
be copied, or changed in any format, sold, or used in any way under any circumstances
other than reading by the downloading individual.
Contact:
Email: gdavidgarson@gmail.com
Web: www.statisticalassociates.com
Table of Contents
Overview ......................................................................................................................................... 5
Key Concepts and Terms................................................................................................................. 6
Correspondence analysis ........................................................................................................... 6
Correspondence table ................................................................................................................ 6
Points .......................................................................................................................................... 6
Point distance ............................................................................................................................. 6
Correspondence map ................................................................................................................. 6
The SPSS correspondence analysis interface .................................................................................. 8
The main correspondence analysis dialog ................................................................................. 8
The model dialog ........................................................................................................................ 8
Dimensions in the solution .................................................................................................... 9
Distance measure .................................................................................................................. 9
Standardization method ...................................................................................................... 10
Normalization method ........................................................................................................ 10
The statistics dialog .................................................................................................................. 14
The plots dialog ........................................................................................................................ 14
SPSS correspondence analysis output .......................................................................................... 15
Example .................................................................................................................................... 15
The summary of dimensions table ...................................................................................... 16
The correspondence table ................................................................................................... 18
The perceptual map............................................................................................................. 18
Row points and column points scatterplots ........................................................................ 20
Row profiles and column profiles tables ............................................................................. 20
Contribution tables .............................................................................................................. 21
Row and column confidence points tables ......................................................................... 23
Line Plots.............................................................................................................................. 24
The permuted correspondence table.................................................................................. 25
Assumptions.................................................................................................................................. 26
Data level and distribution ....................................................................................................... 26
Data do not need to be detrended .......................................................................................... 26
Correlated variables which meet assumptions ........................................................................ 26
Correspondence Analysis
Overview
Correspondence analysis is useful when the research focus is on mapping values
(levels) of categorical variables. It is a method of factoring categorical variables
and displaying them in a property space which maps their association in two or
more dimensions. Correspondence analysis is a special case of canonical
correlation, where one set of entities (category levels rather than variables as in
conventional canonical correlation) is related to another set.
Correspondence analysis is often used where a tabular approach is less effective
due to large tables with many rows and/or columns, and/or due to categories
being nominal, with no particular order. Correspondence analysis been popular in
marketing research, used to display customer color preference, size preference,
and taste preference in relation to preferences for Brands A, B, and C. for
instance.
Correspondence table
The correspondence table is the raw crosstabulation of two discrete variables,
with marginals. The object of correspondence analysis is to explain the inertia
(variance) in this table. In essence, the correspondence map is a graphical tool
which helps the researcher to notice easily relationships within this table. When
interpreting a correspondence map it is often helpful to refer back to the original
correspondence table.
Points
Also known as "profile points," a point is one of the values (category levels) of one
of the discrete variables in the analysis. For instance, "male" would be a point for
the variable "gender."
Point distance
For distance measurement, correspondence analysis in the SPSS default uses a
definition of chi-square distance rather than Euclidean distance between points.
SPSS supports Euclidean distance as an alternative. This is discussed further
below. The point distance matrix is the input to principal components analysis,
yielding the dimensions (factors) which correspondence analysis uses to map
points.
Correspondence map
The correspondence map, also called the perceptual map, is the central output of
correspondence analysis. A correspondence map displays two of the dimensions
which emerge from principal components analysis of point distances, and points
are displayed in relation to these dimensions. For instance, a correspondence
analysis may seek to relate political outlook (conservative, liberal, etc.) with
region (South, West, etc.), and the correspondence map might show the South is
close to conservative, whereas the West is closer to liberal. This is illustrated
below using 1993 U. S. General Social Survey data.
square root of the sum of squared differences between pairs of rows and pairs of
columns.
Standardization method
The default and “standard” method centers both rows and columns (“Row and
column means are removed”). Both the rows and columns are centered. This
method is required for standard correspondence analysis and is the only available
option if default chi-square distance is selected. If Euclidean distance is selected,
there are three other standardization methods available: (1) only columns are
centered (“Column means are removed”); (2) only rows are centered, after row
marginal are equalized (“Row totals are equalized and means are removed”); and
(3) only columns are centered, after equalizing column marginal (“Column totals
are equalized and means are removed”).
Normalization method
SPSS supports five normalization options:
1. Symmetrical. This is the SPSS default and is recommended when the
research purpose is to explore relationships among the category levels of
the two variables. Row and column scores reflect weighted averages.
2. Principal. This method is recommended when the research purpose is to
explore relationships of category levels within either or both variables. Row
and column scores reflect distances in the correspondence table.
3. Row principal. This method is recommended for comparing among
categories of the row variable. Distances between row point reflect row
distances in the correspondence table
4. Column principal. The same but for the column variable.
5. Custom. This option allows the researcher to specify a value between –1
and 1, where –1 is equivalent to the column principal option, 0 is equivalent
to the symmetrical option, and +1 is equivalent to the row principal option.
The normalization comparison figure below compares row and column
normalization with the default symmetrical normalization, given rows = political
views and columns = region. Euclidean distance is used for this figure.
• Row normalization: Row points (political views) are close together, column
points (regions) are spread out. The most extreme points on both axes
reflect the column variable, region.
Note that South, the most conservative region, is correctly shown as close to
conservative and extremely conservative under symmetrical normalization. Under
row normalization (rows are political views), South is still closer to conservative
and extremely conservative than to any other political view, but the relationship
is difficult to see and South is mapped as an isolate region distant from the main
cluster of region-polviews points. Under column normalization (columns are
regions), South is actually mapped closer to liberal than to conservative and by
this regional normalization, political views become hard to interpret meaningfully.
In general, symmetrical normalization, which is the default, should be selected
unless there are strong theoretical reasons not to do so.
Summary and cautions. The distance between one row point and another row
point is best interpretable if row standardization has been used, as are distances
between one column point and another column point if column standardization
has been used. Row principal is used to compare row variable points. Column
principal is used to compare column variable points. Principal normalization is a
compromise used for comparing points within either or both variables but not
between variables. Symmetrical normalization, called canonical standardization
elsewhere, standardizes on both row and column profiles and is suitable for
comparing two variables (that is, comparing row points to column points). Though
symmetrical standardization involves a form of averaging which could lead to less
meaningful results than row or column standardization employed separately,
many researchers find symmetrical normalization the most useful type.
Though symmetrical normalization is designed for this purpose, under any form of
standardization the researcher cannot precisely interpret the distance between a
row point and a column point. Rather the researcher must make a non-precise
general statement, such as noting where particular row points and column points
appear in the same map quadrant. In the example above, column points were
regions and row points were political views. The correspondence map distance
between a region and a political view is not an indicator of how highly rated that
region is on a given political view like conservatism. It will not always be true that
the more conservative the region, the less the map distance between the region
and that trait. That is, the map location of a region will be a multivariate
"compromise" position in which the distances are not reliably precise indicators
of "closeness" of row points to column points As a result (1) researchers must
make general statements, such as whether row and column points are in the
same quadrant, not making specific comparisons of exact map distances of row
points to column points; and (2) the researcher may find greater understanding of
the meaning of map distances by referring back to values in the correspondence
table, using the map as an easy graphical guide for where to examine the
correspondence table closely
The statistics dialog allows the researcher to select any or all of the output
options described below in the SPSS example section.
The plots dialog allows the researcher to select any or all of the output options
described below in the SPSS example section.
The “Plot Dimensions” area of the dialog shown above defaults to display of all
dimensions in the solution. However, the dialog allows the researcher to select
“Restrict the number of dimensions” as an alternative, in which case the
researcher may set the range of dimensions to be displayed. The maximum range
is lowest = 1 to the highest = the number of dimensions in the solution.
The sum of eigenvalues is total inertia. Total inertia reflects the spread of points
around the centroid. Total inertia may be interpreted as the percent of inertia
(variance) in the original correspondence table explained by all the computed
dimensions in the correspondence analysis. However, usually only the first two
dimensions are computed and used in the correspondence map, so the effective
model will explain a percent of inertia in the original table equal to the sum of
eigenvalues for the first two dimensions only. Above, the sum is 6.4% for the first
two dimensions.
Chi-square significance of total inertia. SPSS computes a chi-square test for total
inertia, along with the corresponding probability level. If this level is <= .05, the
conventional cutoff, the researcher concludes the dimensions are at least
somewhat associated with the values of the variables in the original
correspondence table and therefore correspondence analysis may proceed. In the
example above, the model is significant at less than the .001 level even though
the percent of variance (inertia) explained is only 6.4%. Note a finding of
significance does not demonstrate that the row and column variables are
significantly associated. Rather, a significant chi-square for total inertia merely
shows that the total inertia is not so low as to be insignificantly different from
zero.
Inertia as a criterion for selecting the number of dimensions for the solution. To
identify the appropriate number of dimensions to specify, some authors apply a
“scree test”. Not supported directly by SPSS, the scree test involves plotting the
size of the inertia coefficients on the Y axis against the dimension numbers (1, 2,
…n) on the X axis. As successive dimensions will have lower inertia coefficients,
the plotted line will tend to form an exponentially declining curve. The scree
criterion is that the point of inflection in the curve, where it begins to level off,
marks the optimal number of dimensions. That is, the scree test has the
researcher stop interpreting dimensions when the curve forms an "elbow," often
at an early dimension. The scree test is a bit subjective but is widely used in
correspondence analysis. Other criteria use absolute inertia as a criterion (stop
when the inertia falls below .01, for example) or the variance explained criterion
(stop when 90% is explained; others use 80%).
Singular value. A singular value is the square root of an inertia coefficient (that is,
of an eigenvalue. A singular value is interpreted as the maximum canonical
correlation between the categories of the variables in analysis for any given
dimension. Dimension 1 in the table above has a singular value of .206,
representing the maximum canonical correlation between categories of the row
and column variables on dimension 1. (Note that taking the square root of an
inertia coefficient manually from the table above will give approximate but not
fully correct singular values in most cases as the inertia coefficients shown are
rounded values.) The standard deviation columns refer back to the singular values
and help the researcher assess the relative precision of each dimension.
Centroid. Though not shown in the table below, the centroid is the weighted
mean of the row and column profiles and sets the origin of the axes in a
correspondence map.
Contribution tables
Output as “Overview” tables, illustrated below, these tables contain the
contribution of row and column points to the dimensions of the solution. As such,
the contribution tables may throw additional light on the dimensions shown in
the perceptual map and other plots.
Category mass: The mass column in the contribution/overview tables below are
the marginal proportions of the row and column categorical variables. Mass
coefficients are used to weight the point profiles when computing point distance.
This weighting has the effect of compensating for unequal numbers of cases in
the columns.
The sum of contributions of dimensions to points will add to 1.0 across all
dimensions for a given point in the full solution where all possible dimensions are
computed. However, the interpreted dimensions usually will sum to less than 1.0,
as is the case in the “Total” column of the tables above.
Confidence row points and confidence column points show the standard
deviations of the row or column scores and are used to assess their precision.
Line Plots
A middle section of the “Plots” button dialog, shown above, supports plotting row
or column variables on each dimension of the solution. “Transformed row
categories” outputs plots of the row variable category values against the
corresponding row scores. “Transformed column categories” does the same for
the column variable. The same dialog allows the researcher to specify label
lengths (max = 20). The plot shown below is for political views on dimension 1,
but similar plots can be output for each variable-dimension combination.
One use for the line plots is to determine if there is an ordinal relationship among
the categories of a variable on a given dimensions. Below, polviews is not totally
ordinal on dimension 1 as shown by the dip in the plot line in the middle. While
dimension 1 in general goes from extremely liberal to extremely conservative,
liberal scores higher on dimension 1 than slightly liberal or moderate.
Assumptions
Data level and distribution
Correspondence analysis is a nonparametric technique which makes no
distributional assumptions, unlike factor analysis. While correspondence analysis
may be used with any level of data, if continuous data are used, they must be
categorized into ranges. Categorical variables are scaled as if they were nominal.
For ordinal and binned interval data, this involves a loss of information which may
attenuate the level of computed association of variables. Differences in ranging
continuous data also may have a significant effect on later interpretation of
results. For this reason, some researchers prefer other techniques when key
variables are continuous.
Homogeneity of categories
Homogeneity of column categories across categories of row variables is assumed,
otherwise the measure of distance between points of the row variable is
misleading.
Although in principle able to handle n-way tables, like other forms of factor
analysis, the dimensions in correspondence analysis fall off sharply in
interpretability. Typically, correspondence analysis handles only two or three
categorical variables well.
Numerous categories
Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 27
Correspondence analysis is usually used with discrete variables which have many
categories (that is, with large tables). With only two or three categories, the
dimensions computed in correspondence analysis usually are not more
informative than the original small table itself. For variables with few categories,
log-linear analysis may be preferable to correspondence analysis.
Non-negative values
Case values cannot be negative.
In the dialog above, enter the minimum and maximum value and click Update.
The “Category Constraints” area then populates with the category values (1 to 7
in this case). Click on the first value and check the “Categories must be equal”
radio button. “Equal” will appear after “1”. Repeat for values 2 and 3, then click
the “Update” button, then the “Continue” button.
While the categories of the dependent variable, which is typically the row
variable, are the usual subject for equality constraints when appropriate, it is
possible to constrain column variable categories to be equal as well.
While the categories of the dependent variable, which is typically the row
variable, are the usual subject for constraints as supplementary, when
appropriate, it is possible to constrain column variable categories to be
supplemental as well.
Supplemental points will still appear in the correspondence table, profile tables,
line plots, the permuted correspondence table, and in the perceptual map itself.
The contributions/overview table will show a supplemental point as having .000
contribution of the point to inertia of all dimensions, but the contribution of
The arch effect occurs when one variable has a unimodal distribution with respect
to a second (ex., fish population is highest at a given pH level but decreases above
or below that level. This will cause the distribution of points in the
correspondence map to form an arch shape.
Compression occurs when points at the ends of the distribution appear on the
map very close together, such that their spacing along the primary map axis is not
well related to the amount of change along that dimension. Detrended
correspondence analysis (DCA) corrects these problems.
Detrending removes the arch effect. This is done by dividing the map into a series
of vertical partitions, thus dividing the map along the primary (horizontal) axis.
Within each partition, that cluster of points is relocated to center on the second
(vertical) axis's 0 point. This arbitrary adjustment of the data has been the subject
of methodological criticism.
Rescaling is a second step in DCA. Where detrending realigned the points with
respect to the secondary (vertical) axis, rescaling realigns the points along the
primary (horizontal) axis as well as the vertical axis. Both axes are rescaled such
that units represent standard deviations, seeking to make distance in ordination
space mean the same thing along the axes of the map. Note that rescaling
requires numeric (not nominal) measurement of points associated with the
primary axis.
The effects of detrending and rescaling may remove the arch effect, remove
compression at the ends of axes, and distances separating points are more easily
interpreted.
Bibliography
Benzecri, J. P. (1992). Correspondence analysis handbook. Paris: Dunod.
Fisher, R. A. 1938. Statistical methods for research workers. Edinburgh: Oliver and
Boyd.
Copyright 1998, 2008, 2011. 2012 by G. David Garson and Statistical Associates Publishers.
Worldwide rights reserved in all languages and on all media. Do not copy, lend, or post in any
format. Last update, 9/15/2012.
Association, Measures of
Assumptions, Testing of
Canonical Correlation
Case Studies
Cluster Analysis
Content Analysis
Correlation
Correlation, Partial
Correspondence Analysis
Cox Regression
Creating Simulated Datasets
Crosstabulation
Curve Fitting & Nonlinear Regression
Data Levels
Delphi Method
Discriminant Function Analysis
Ethnographic Research
Evaluation Research
Event History Analysis
Factor Analysis
Focus Groups
Game Theory
Generalized Linear Models/Generalized Estimating Equations
GLM (Multivariate), MANOVA, and MANCOVA
GLM (Univariate ), ANOVA, and ANCOVA
GLM Repeated Measures
Grounded Theory
Hierarchical Linear Modeling/Multilevel Analysis/Linear Mixed Models
Integrating Theory in Research Articles and Dissertations
Latent Class Analysis
Life Tables and Kaplan-Meier Survival Analysis
Literature Reviews
Logistic Regression
Log-linear Models,
Longitudinal Analysis
Missing Values Analysis & Data Imputation
Multidimensional Scaling
Multiple Regression
Narrative Analysis
Network Analysis
Ordinal Regression
Parametric Survival Analysis
Partial Least Squares Regression
Participant Observation
Path Analysis
Power Analysis
Probability
Probit Regression and Response Models
Reliability Analysis
Resampling
Research Designs
Sampling
Scales and Standard Measures
Significance Testing
Structural Equation Modeling
Survey Research
Two-Stage Least Squares Regression
Validity
Variance Components Analysis
Weighted Least Squares Regression