G. David Garson-Correspondence Analysis-Statistical Associates Publishing (2012)

CORRESPONDENCE ANALYSIS 2012 Edition
Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 1
Single User License. Do not copy or post.

@c 2012 by G. David Garson and Statistical Associates Publishing. All rights reserved
worldwide in all media. No permission is granted to any user to copy or post this work in
any format or any media.
The author and publisher of this eBook and accompanying materials make no
representation or warranties with respect to the accuracy, applicability, fitness, or
completeness of the contents of this eBook or accompanying materials. The author and
publisher disclaim any warranties (express or implied), merchantability, or fitness for any
particular purpose. The author and publisher shall in no event be held liable to any party for
any direct, indirect, punitive, special, incidental or other consequential damages arising
directly or indirectly from any use of this material, which is provided “as is”, and without
warranties. Further, the author and publisher do not warrant the performance,
effectiveness or applicability of any sites listed or linked to in this eBook or accompanying
materials. All links are for information purposes only and are not warranted for content,
accuracy or any other implied or explicit purpose. This eBook and accompanying materials is
© copyrighted by G. David Garson and Statistical Associates Publishing. No part of this may
be copied, or changed in any format, sold, or used in any way under any circumstances
other than reading by the downloading individual.
Contact:
G. David Garson, President

Statistical Publishing Associates
274 Glenn Drive
Asheboro, NC 27205 USA
Email: gdavidgarson@gmail.com
Web: www.statisticalassociates.com

Table of Contents
Overview ......................................................................................................................................... 5
Key Concepts and Terms................................................................................................................. 6
Correspondence analysis ........................................................................................................... 6
Correspondence table ................................................................................................................ 6
Points .......................................................................................................................................... 6
Point distance ............................................................................................................................. 6
Correspondence map ................................................................................................................. 6
The SPSS correspondence analysis interface .................................................................................. 8
The main correspondence analysis dialog ................................................................................. 8
The model dialog ........................................................................................................................ 8
Dimensions in the solution .................................................................................................... 9
Distance measure .................................................................................................................. 9
Standardization method ...................................................................................................... 10
Normalization method ........................................................................................................ 10
The statistics dialog .................................................................................................................. 14
The plots dialog ........................................................................................................................ 14
SPSS correspondence analysis output .......................................................................................... 15
Example .................................................................................................................................... 15
The summary of dimensions table ...................................................................................... 16
The correspondence table ................................................................................................... 18
The perceptual map............................................................................................................. 18
Row points and column points scatterplots ........................................................................ 20
Row profiles and column profiles tables ............................................................................. 20
Contribution tables .............................................................................................................. 21
Row and column confidence points tables ......................................................................... 23
Line Plots.............................................................................................................................. 24
The permuted correspondence table.................................................................................. 25
Assumptions.................................................................................................................................. 26
Data level and distribution ....................................................................................................... 26
Data do not need to be detrended .......................................................................................... 26
Correlated variables which meet assumptions ........................................................................ 26

Model specification and significance testing ........................................................................... 27

Homogeneity of categories ...................................................................................................... 27
Correct labeling of dimensions ................................................................................................ 27
Numerous categories ............................................................................................................... 27
Non-negative values................................................................................................................. 28
Frequently Asked Questions ......................................................................................................... 28
What procedures are related to correspondence analysis? .................................................... 28
How does correspondence analysis of three variables work in multiple correspondence
analysis (MCA)? ........................................................................................................................ 29
Explain active vs. constrained categories. ............................................................................... 29
Explain supplementary categories. .......................................................................................... 31
How is the distance between points computed in correspondence analysis?........................ 32
What is detrended correspondence analysis (DCA)?............................................................... 33
Bibliography .................................................................................................................................. 35

Correspondence Analysis
Overview
Correspondence analysis is useful when the research focus is on mapping values
(levels) of categorical variables. It is a method of factoring categorical variables
and displaying them in a property space which maps their association in two or
more dimensions. Correspondence analysis is a special case of canonical
correlation, where one set of entities (category levels rather than variables as in
conventional canonical correlation) is related to another set.
Correspondence analysis is often used where a tabular approach is less effective
due to large tables with many rows and/or columns, and/or due to categories
being nominal, with no particular order. Correspondence analysis been popular in
marketing research, used to display customer color preference, size preference,
and taste preference in relation to preferences for Brands A, B, and C. for
instance.
Correspondence analysis starts with tabular data on categorical variables, usually

two-way cross-classifications. However, the technique is generalizable to n-way
tables with more than two variables though only two are supported by SPSS. The
variables must be discrete: nominal, ordinal, or continuous variables segmented
into ranges. The technique defines a measure of distance between any two
points, where points are the values (categories) of the discrete variables. Since
distance is a type of measure of association (correlation), the distance matrix can
be the input to principal components analysis just as correlation matrices may be
the input for conventional factor analysis. However, where conventional factor
analysis determines which variables cluster together, correspondence analysis
determines which category values are close together. This is visualized on the
correspondence map, which plots points (categories) along the computed factor
axes.
Because the definition of point distance in correspondence analysis does not

support significance testing, it is recommended that some other technique
compatible with discrete data, such as log-linear modeling or logistic regression,
be used to test alternative models. After selecting a best-fitting model using

another technique, then correspondence analysis may be very useful in exploring

relationships within that model.
Key Concepts and Terms

Correspondence analysis
Correspondence analysis is also called correspondence mapping, perceptual
mapping, social space analysis, correspondence factor analysis, principal
components analysis of qualitative data, and dual scaling. These are largely
synonymous terms, though there are many variants of the technique.
Correspondence table
The correspondence table is the raw crosstabulation of two discrete variables,
with marginals. The object of correspondence analysis is to explain the inertia
(variance) in this table. In essence, the correspondence map is a graphical tool
which helps the researcher to notice easily relationships within this table. When
interpreting a correspondence map it is often helpful to refer back to the original
correspondence table.
Points
Also known as "profile points," a point is one of the values (category levels) of one
of the discrete variables in the analysis. For instance, "male" would be a point for
the variable "gender."
Point distance
For distance measurement, correspondence analysis in the SPSS default uses a
definition of chi-square distance rather than Euclidean distance between points.
SPSS supports Euclidean distance as an alternative. This is discussed further
below. The point distance matrix is the input to principal components analysis,
yielding the dimensions (factors) which correspondence analysis uses to map
points.
Correspondence map

The correspondence map, also called the perceptual map, is the central output of
correspondence analysis. A correspondence map displays two of the dimensions
which emerge from principal components analysis of point distances, and points
are displayed in relation to these dimensions. For instance, a correspondence
analysis may seek to relate political outlook (conservative, liberal, etc.) with
region (South, West, etc.), and the correspondence map might show the South is
close to conservative, whereas the West is closer to liberal. This is illustrated
below using 1993 U. S. General Social Survey data.

The SPSS correspondence analysis interface

The main correspondence analysis dialog
In SPSS select Analyze > Dimension Reduction > Correspondence analysis to get
the dialog illustrated below. Two and only two categorical variables may be
entered. Note that the row variable is normally the variable to be explained, and
the column variable is the explanatory variable. The “Define Range” buttons
require the researcher to enter the low and high values used to code the
categorical variables.
The model dialog

As illustrated below, in this dialog the researcher may set dimensions in the
solution, the distance measure used, the standardization method, and the
normalization method. The figure below illustrates default SPSS settings.

Dimensions in the solution

Normally two dimensions are specified and this is the default. In general, two-
dimensional solutions are more easily interpreted than higher dimension
solutions, but the researcher should set the number at whatever seems to explain
most of the variance and to be most interpretable. The maximum number of
dimensions is the number of active categories for the variable with the least
number of categories, minus 1. If greater than the permissible maximum is
specified, it is reset to the permissible maximum. See further discussion below of
the summary of dimensions table. Regarding active vs. constrained categories,
see the FAQ section below.
Distance measure
The distance measure is usually the chi-square method, which is a weighted
profile approach. “Standard” correspondence analysis uses the chi-square
method, which is the default in SPSS. However, the researcher may select the
Euclidean approach as an alternative. The Euclidean method is based on the

square root of the sum of squared differences between pairs of rows and pairs of
columns.
Standardization method
The default and “standard” method centers both rows and columns (“Row and
column means are removed”). Both the rows and columns are centered. This
method is required for standard correspondence analysis and is the only available
option if default chi-square distance is selected. If Euclidean distance is selected,
there are three other standardization methods available: (1) only columns are
centered (“Column means are removed”); (2) only rows are centered, after row
marginal are equalized (“Row totals are equalized and means are removed”); and
(3) only columns are centered, after equalizing column marginal (“Column totals
are equalized and means are removed”).
Normalization method
SPSS supports five normalization options:
1. Symmetrical. This is the SPSS default and is recommended when the
research purpose is to explore relationships among the category levels of
the two variables. Row and column scores reflect weighted averages.
2. Principal. This method is recommended when the research purpose is to
explore relationships of category levels within either or both variables. Row
and column scores reflect distances in the correspondence table.
3. Row principal. This method is recommended for comparing among
categories of the row variable. Distances between row point reflect row
distances in the correspondence table
4. Column principal. The same but for the column variable.
5. Custom. This option allows the researcher to specify a value between –1
and 1, where –1 is equivalent to the column principal option, 0 is equivalent
to the symmetrical option, and +1 is equivalent to the row principal option.
The normalization comparison figure below compares row and column
normalization with the default symmetrical normalization, given rows = political
views and columns = region. Euclidean distance is used for this figure.
• Row normalization: Row points (political views) are close together, column
points (regions) are spread out. The most extreme points on both axes
reflect the column variable, region.

• Column normalization: Column points (regions) are close together, row

points (political views) are spread out. The most extreme points on both
axes reflect the row variable, polviews.


Note that South, the most conservative region, is correctly shown as close to
conservative and extremely conservative under symmetrical normalization. Under
row normalization (rows are political views), South is still closer to conservative
and extremely conservative than to any other political view, but the relationship
is difficult to see and South is mapped as an isolate region distant from the main
cluster of region-polviews points. Under column normalization (columns are
regions), South is actually mapped closer to liberal than to conservative and by
this regional normalization, political views become hard to interpret meaningfully.
In general, symmetrical normalization, which is the default, should be selected
unless there are strong theoretical reasons not to do so.
Summary and cautions. The distance between one row point and another row
point is best interpretable if row standardization has been used, as are distances
between one column point and another column point if column standardization
has been used. Row principal is used to compare row variable points. Column
principal is used to compare column variable points. Principal normalization is a
compromise used for comparing points within either or both variables but not
between variables. Symmetrical normalization, called canonical standardization
elsewhere, standardizes on both row and column profiles and is suitable for
comparing two variables (that is, comparing row points to column points). Though
symmetrical standardization involves a form of averaging which could lead to less
meaningful results than row or column standardization employed separately,
many researchers find symmetrical normalization the most useful type.
Though symmetrical normalization is designed for this purpose, under any form of
standardization the researcher cannot precisely interpret the distance between a
row point and a column point. Rather the researcher must make a non-precise
general statement, such as noting where particular row points and column points
appear in the same map quadrant. In the example above, column points were
regions and row points were political views. The correspondence map distance
between a region and a political view is not an indicator of how highly rated that
region is on a given political view like conservatism. It will not always be true that
the more conservative the region, the less the map distance between the region
and that trait. That is, the map location of a region will be a multivariate
"compromise" position in which the distances are not reliably precise indicators
of "closeness" of row points to column points As a result (1) researchers must
make general statements, such as whether row and column points are in the

same quadrant, not making specific comparisons of exact map distances of row
points to column points; and (2) the researcher may find greater understanding of
the meaning of map distances by referring back to values in the correspondence
table, using the map as an easy graphical guide for where to examine the
correspondence table closely
The statistics dialog

The statistics dialog, obtained by clicking the “Statistics” button in the main
correspondence analysis dialog, is illustrated below.
The statistics dialog allows the researcher to select any or all of the output
options described below in the SPSS example section.
The plots dialog

The plots dialog, obtained by clicking the “Statistics” button in the main
correspondence analysis dialog, is illustrated below.

The plots dialog allows the researcher to select any or all of the output options
described below in the SPSS example section.
The “Plot Dimensions” area of the dialog shown above defaults to display of all
dimensions in the solution. However, the dialog allows the researcher to select
“Restrict the number of dimensions” as an alternative, in which case the
researcher may set the range of dimensions to be displayed. The maximum range
is lowest = 1 to the highest = the number of dimensions in the solution.
SPSS correspondence analysis output

Example
The example below uses 1993 U. S. General Social Survey data from the SPSS
sample data file GSS93 subset.sav. The column variable was region (four regions
of the United States) and the row variable was polviews (a seven-point scale from

extremely liberal to extremely conservative, in response to the item, “Think of self

as liberal or conservative”). Specific values of these two variables are shown in
the correspondence table below. The research objective was to explore the
relationship of region to political viewpoints.
The summary of dimensions table
The summary table printed at the start of SPSS output reveals the dimensions of
the solution. Inertia is a measure of variation, discussed below. In the cumulative
inertia table below, the 1-dimensional solution accounts for 56.7% of inertia. The
2-dimensional solution accounts for 86.6% and the 3-dimensional solution
accounts for 100%. However, for this example, two dimensions were specified in
the “Model” button dialog (see above).
Inertia. Inertia means variance in the context of correspondence analysis. Inertia

coefficients are the "characteristic roots" of the principal components solution.
There is one eigenvalue for each dimension, which SPSS labels the inertia for that
dimension. Each eigenvalue represents the amount of inertia (variance) a given
factor explains in the correspondence table. Inertia or eigenvalues reflect the
relative importance of the dimensions. The first dimension always explains the
most inertia (variance) and has the largest eigenvalue, the next the second-most,
and so on.
The sum of eigenvalues is total inertia. Total inertia reflects the spread of points
around the centroid. Total inertia may be interpreted as the percent of inertia
(variance) in the original correspondence table explained by all the computed
dimensions in the correspondence analysis. However, usually only the first two
dimensions are computed and used in the correspondence map, so the effective
model will explain a percent of inertia in the original table equal to the sum of
eigenvalues for the first two dimensions only. Above, the sum is 6.4% for the first
two dimensions.

Chi-square significance of total inertia. SPSS computes a chi-square test for total
inertia, along with the corresponding probability level. If this level is <= .05, the
conventional cutoff, the researcher concludes the dimensions are at least
somewhat associated with the values of the variables in the original
correspondence table and therefore correspondence analysis may proceed. In the
example above, the model is significant at less than the .001 level even though
the percent of variance (inertia) explained is only 6.4%. Note a finding of
significance does not demonstrate that the row and column variables are
significantly associated. Rather, a significant chi-square for total inertia merely
shows that the total inertia is not so low as to be insignificantly different from
zero.
Inertia as a criterion for selecting the number of dimensions for the solution. To
identify the appropriate number of dimensions to specify, some authors apply a
“scree test”. Not supported directly by SPSS, the scree test involves plotting the
size of the inertia coefficients on the Y axis against the dimension numbers (1, 2,
…n) on the X axis. As successive dimensions will have lower inertia coefficients,
the plotted line will tend to form an exponentially declining curve. The scree
criterion is that the point of inflection in the curve, where it begins to level off,
marks the optimal number of dimensions. That is, the scree test has the
researcher stop interpreting dimensions when the curve forms an "elbow," often
at an early dimension. The scree test is a bit subjective but is widely used in
correspondence analysis. Other criteria use absolute inertia as a criterion (stop
when the inertia falls below .01, for example) or the variance explained criterion
(stop when 90% is explained; others use 80%).
Proportion of inertia accounted for by a given dimension is its eigenvalue divided

by total inertia. Thus, if the proportion of inertia accounted for by dimension 1 is
.567, then that means that the inertia for dimension 1 divided by total inertia is
.567. Put another way, dimension 1 explains 56.7% of the variance which is
explained in the original correspondence table. Thus if the total inertia is .075,
meaning all the dimensions (factors) explain 7.5% of the variance in the original
correspondence table, then dimension 1 explains 56.7% of this 7.5%. Dimension 1
explains 56.7% of the variance in the original table which has been explained by
the model. It does not explain 56.7% of the total variance in the original table as is
sometimes misreported.

Singular value. A singular value is the square root of an inertia coefficient (that is,
of an eigenvalue. A singular value is interpreted as the maximum canonical
correlation between the categories of the variables in analysis for any given
dimension. Dimension 1 in the table above has a singular value of .206,
representing the maximum canonical correlation between categories of the row
and column variables on dimension 1. (Note that taking the square root of an
inertia coefficient manually from the table above will give approximate but not
fully correct singular values in most cases as the inertia coefficients shown are
rounded values.) The standard deviation columns refer back to the singular values
and help the researcher assess the relative precision of each dimension.
The correspondence table

The correspondence table, illustrated below, is a crosstabulation of the specified
row and column variables as input, with marginal totals. It is useful to refer to this
table when seeking to understand correspondence analysis plots and other
output.
The perceptual map

If “Biplot” is chosen under the “Plots” button dialog, the perceptual map is
output. Also called the correspondence map or plot, this output is a two-
dimensional plot of row and column points. In the example below, symmetric
normalization (the default) has been selected. If principal normalization is
selected, this plot is unavailable. The figure below shows two ways of associating
row (political views) and column (region) points. Color has been added to SPSS
output to highlight associations.


Row points and column points scatterplots

The “Plots” button dialog also makes it is possible to output the correspondence
maps of the row points or of column points in separate plots, as shown below.
Row profiles and column profiles tables

The "Row Profiles" are the cell values divided by their corresponding row totals
(ex., 6/14 = .429 for the first cell) from the correspondence table (see above).
Thus the profile coefficients, also called profile elements, are the relative
frequencies of the row or column variables. The row profiles and column profiles
tables, selected under the “Statistics” button, provide the actual numeric
coefficients by which the row points and column points scatterplots above are
graphed.
Centroid. Though not shown in the table below, the centroid is the weighted
mean of the row and column profiles and sets the origin of the axes in a
correspondence map.

Contribution tables
Output as “Overview” tables, illustrated below, these tables contain the
contribution of row and column points to the dimensions of the solution. As such,
the contribution tables may throw additional light on the dimensions shown in
the perceptual map and other plots.
Category mass: The mass column in the contribution/overview tables below are
the marginal proportions of the row and column categorical variables. Mass
coefficients are used to weight the point profiles when computing point distance.
This weighting has the effect of compensating for unequal numbers of cases in
the columns.
Score in dimension. In the contribution table below, “score in dimension” refers to

scores used as coordinates for points when plotting the correspondence map.
Each point has a score on each dimension.

Contribution of points to inertia of the dimensions. Analogous to factor loadings in

exploratory factor analysis, contribution coefficients are used to ascribe meaning
to dimensions. That is, "contribution of points to dimensions" are used to intuit
the meaning of dimensions in correspondence analysis. The contribution of points
to dimensions shows the percent of inertia (variance) of a particular dimension
which is explained by a row or column point. (Recall points represent levels of the
row or column categorical variable). By looking at the more heavily loaded points,
one may induce the meaning of a dimension. Contribution of points to
dimensions will sum to 1.0 across the categories of any one variable. In the row
points table below, for example, the strongest contributions to dimension 1
were “Conservative” (.394) and “Extremely liberal” (.235). These two
contributions were in opposite directions as shown by the differing signs in the
score column for dimension 1.
Contribution of dimensions to points, also known as squared correlations. Also

known as cr values or the quality of representation of the description of a point,
these columns reflect how well the principal components model is explaining any
given point. That is, the contribution of dimensions to points is the percent of
variance in a point explained by a given dimension. This is analogous to multiple

correlation coefficients in exploratory factor analysis. Ideally a high contribution

of dimensions to points value would show for all points. Less analytic focus must
be placed on points which are not well described by the model. Above, for
dimension 1, extremely liberal (.958) is well described and liberal (.048) is not
(rather, liberal is well described by dimension 2).
The sum of contributions of dimensions to points will add to 1.0 across all
dimensions for a given point in the full solution where all possible dimensions are
computed. However, the interpreted dimensions usually will sum to less than 1.0,
as is the case in the “Total” column of the tables above.
Note that high contribution of points to dimensions implies a high squared

correlation, but the reverse is not true. That is, if a point explains a lot of the
variance in a dimension, usually that dimension will also describe the point very
well (high squared correlation). However, just because a dimension describes a
point well does not mean the point will necessarily be important in explaining the
dimension.
Row and column confidence points tables
Confidence row points and confidence column points show the standard
deviations of the row or column scores and are used to assess their precision.

Line Plots
A middle section of the “Plots” button dialog, shown above, supports plotting row
or column variables on each dimension of the solution. “Transformed row
categories” outputs plots of the row variable category values against the
corresponding row scores. “Transformed column categories” does the same for
the column variable. The same dialog allows the researcher to specify label
lengths (max = 20). The plot shown below is for political views on dimension 1,
but similar plots can be output for each variable-dimension combination.
One use for the line plots is to determine if there is an ordinal relationship among
the categories of a variable on a given dimensions. Below, polviews is not totally
ordinal on dimension 1 as shown by the dip in the plot line in the middle. While
dimension 1 in general goes from extremely liberal to extremely conservative,
liberal scores higher on dimension 1 than slightly liberal or moderate.

The permuted correspondence table

The permuted correspondence table appears at the bottom of SPSS output when
requested in the “Statistics” button dialog shown above. This table arranges rows
and columns representing categories in ascending order by scores on as many
dimensions are requested, up to the maximum number of dimensions in the
solution. For this example, just one dimensions was requested, causing the table
below to be printed. This table may be compared with the initial correspondence
table shown above. The original correspondence table organized political views in
the order extremely liberal, liberal, slightly liberal, moderate, slightly
conservative, conservative, and extremely conservative. The order in the
permuted table below is extremely liberal, slightly liberal, moderate, liberal,
slightly conservative, conservative, and extremely conservative. The three
conservative categories behave as originally anticipated, but not the three liberal
categories. This again suggests that the original ordinal construct for the polviews
variable does not fully fit a model explaining the relationship of region to political
views.

Assumptions
Data level and distribution
Correspondence analysis is a nonparametric technique which makes no
distributional assumptions, unlike factor analysis. While correspondence analysis
may be used with any level of data, if continuous data are used, they must be
categorized into ranges. Categorical variables are scaled as if they were nominal.
For ordinal and binned interval data, this involves a loss of information which may
attenuate the level of computed association of variables. Differences in ranging
continuous data also may have a significant effect on later interpretation of
results. For this reason, some researchers prefer other techniques when key
variables are continuous.
Data do not need to be detrended

Detrending is discussed below in the FAQ section.
Correlated variables which meet assumptions

The more assumptions are violated, and the lower the total inertia, the less the
correspondence map will serve as a reliable guide to relationships in the original
correspondence table. The less the categorical variables are associated, the lower
the inertia.

Correspondence analysis assumes that its measure of chi-square distance

between points (category values) can be treated the similarly to correlation
among variables. To justify this assumption, it is recommended that the two
variables be shown to be associated using the conventional chi-square
significance test for tabular data. Correspondence is not association, and the
correspondence map may show points to be close even though the model
(reflected in total inertia) explains only a small percentage of the inertia (variance)
in the correspondence table.
Model specification and significance testing

Apart from testing the significance of total inertia in the model, significance
testing is not supported. Model comparison and selection of a best-fit model
should be done using another technique compatible with discrete variables, such
as log-linear modeling or logistic regression. That is, correspondence analysis
assumes that the researcher has already specified the appropriate variables and
value categories using some other technique prior to correspondence analysis.
Correspondence analysis is an exploratory, not a confirmatory technique.
Homogeneity of categories
Homogeneity of column categories across categories of row variables is assumed,
otherwise the measure of distance between points of the row variable is
misleading.
Correct labeling of dimensions

As in other forms of factor analysis, the meaning of correspondence dimensions is
induced from loadings (from contributions of points to dimensions) and the
ensuing labeling of dimensions is subject to human discretion, judgment, and
error.
Although in principle able to handle n-way tables, like other forms of factor
analysis, the dimensions in correspondence analysis fall off sharply in
interpretability. Typically, correspondence analysis handles only two or three
categorical variables well.
Numerous categories

Correspondence analysis is usually used with discrete variables which have many
categories (that is, with large tables). With only two or three categories, the
dimensions computed in correspondence analysis usually are not more
informative than the original small table itself. For variables with few categories,
log-linear analysis may be preferable to correspondence analysis.
Non-negative values
Case values cannot be negative.
Frequently Asked Questions

What procedures are related to correspondence analysis?
Correspondence analysis is related to all procedures for categorical data. Among
the considerations are these:
• Multiple correspondence analysis (MCA): Used where there are more than
two categorical variables. MCA is discussed below.
• Categorical principal components analysis (CPC): Preferred when the
categorical variables are ordinal (correspondence analysis uses nominal
scaling. Categorical principal components analysis will have greater power
for the same data. CPC is found in SPSS under Analyze > Dimension
Reduction > Optimal Scaling; then select “Some variable(s) are not multiple
nominal” and “One set” of variables. Optimal scaling is discussed in a
separate Statistical Associates “Blue Book” volume.
• Nonlinear canonical correlation (NLCC): Used when there is more than one
set of categorical variables. NLCC is found in SPSS under Analyze >
Dimension Reduction > Optimal Scaling; then select “Multiple sets” for the
number of sets of variables. Scaling levels may be wither “all variables are
multiple nominal” or “Some variable(s) are not multiple nominal” and “One
set” of variables. Nonlinear canonical correlation is discussed in a separate
Statistical Associates “Blue Book” volume on canonical correlation.
• Crosstabulation: While crosstabulation may well suffice for smaller tables,
it does not provide graphical insight into relationships. When there are
many categories and/or when the categories have no inherent order (when

nominal), discerning the relationships between categorical variables is

more difficult using crosstabulation.
• Factor analysis: Factor analysis requires interval data whereas
correspondence analysis only assumes a nominal level of measurement.
Factor analysis shows the clustering of variables but does not show the
clustering of category levels within and between categorical variables.
• Multidimensional scaling (MDS): MDS may be used as a substitute for factor
analysis when variables are ordinal or nominal. Like correspondence
analysis, a type of perceptual map is produced. This is discussed in a
separate Statistical Associates “Blue Book” volume on multidimensional
scaling.
How does correspondence analysis of three variables work in multiple

correspondence analysis (MCA)?
Multiple correspondence analysis (MCA) is found in SPSS under Analyze >
Dimension Reduction > Optimal Scaling. Optimal scaling is discussed in a separate
Statistical Associates “Blue Book” volume, which covers MCA as well as
categorical principal components analysis.
In three-way correspondence analysis with the type of correspondence analysis

discussed in this volume, a common approach is to combine the two variables of
least interest. For instance, in an analysis of gender, age range, and media
preference, the variable of most interest (media preference) would be the rows.
The columns would be age ranges for men and age ranges for women. The
computation would be the same as for two-way correspondence analysis, but in
plotting the correspondence map, different symbols would be used for the points
representing men and those representing women.
Explain active vs. constrained categories.

Active points are the category values of the variables used to compute the
dimensions used to plot the correspondence map. By default, all levels of the row
and column categorical variables are treated as active points. However, it is
possible to constrain some levels to be equal. This is done in the “Define Range”
dialog of the main correspondence analysis interface in SPSS shown above. In the
“Define Range” dialog below, for instance, all liberal values of polviews (the row

variable in the example above) are constrained to be equal. These were

1=extremely liberal, 2=liberal, and 3=slightly liberal.
In the dialog above, enter the minimum and maximum value and click Update.
The “Category Constraints” area then populates with the category values (1 to 7
in this case). Click on the first value and check the “Categories must be equal”
radio button. “Equal” will appear after “1”. Repeat for values 2 and 3, then click
the “Update” button, then the “Continue” button.
While the categories of the dependent variable, which is typically the row
variable, are the usual subject for equality constraints when appropriate, it is
possible to constrain column variable categories to be equal as well.
When points are constrained to be equal, the overview/contribution tables shown

above will display equal coefficients for score in dimension, inertia, contribution
of points to inertia, and contribution of dimensions to inertia, as shown below for
polviews, the row variable in the example. The perceptual map will be altered
accordingly.

Explain supplementary categories.

The “Define Range” dialog above allows the researcher to declare certain values
as supplemental. “Supplemental” means those levels of the categorical variable
do not contribute to defining the perceptual map. They are, however, still
available for other analysis purposes. At least two of the categories must remain
active and not supplemental.
While the categories of the dependent variable, which is typically the row
variable, are the usual subject for constraints as supplementary, when
appropriate, it is possible to constrain column variable categories to be
supplemental as well.
Once the dimensions in correspondence analysis have been computed without

use of supplemental values, the supplemental points can still be plotted on the
perceptual map and their squared correlations listed in the “Contribution of
dimensions to points” column of the overview/contribution table. One use of
supplemental points is to validate the correspondence model: active points
should fall on the map as one would expect based on related supplemental
points. Another use is to handle outliers: If one point is extremely divergent in the
profile table or other tables, it may be made supplemental to avoid it having
undue influence on dimensions in the perceptual map.
Supplemental points will still appear in the correspondence table, profile tables,
line plots, the permuted correspondence table, and in the perceptual map itself.
The contributions/overview table will show a supplemental point as having .000
contribution of the point to inertia of all dimensions, but the contribution of

dimensions to the supplemental point will be calculated. Supplemental points

will not appear in the confidence tables.
How is the distance between points computed in correspondence

analysis?
Correspondence analysis uses chi-square distances, d. These are measures of
distance between the row and column profiles for a set of points. A large d means
the two profiles are very different.
1. The researcher starts with a crosstabulation of two discrete variables, such

as party id (Republican, Democratic, Libertarian, Other, None) and primary
news source (newspaper, television, radio, magazine, other, none).
2. The correspondence analysis algorithm computes row profiles (cell entries
as a percent of the row marginal), row masses (row marginals as a percent
of n, the sample size), and average row profiles (column marginals as a
percent of n).
3. It then computes column profiles (cell entries as a percent of their column
marginal), column masses (column marginals as a percent of n), and
average column profiles (row marginals as a percent of n). Note that row
masses will equal average column profiles, and column masses will equal
average row masses by definition.
4. Next, the correspondence analysis algorithm computes the chi-square
distances between points. Chi-square distance is the Euclidean distance
weighted inversely according to the average profile element. Let d(ii') be
the chi-square distance from point i to point i' on the row variable. Let a(ij)
be the cell elements of the row profile. Let a(.j) be the elements in the row
of average row profiles. Then d(ii') = SQRT( SUM (((a(ij) - a(i'j))2)/a(.j))).
5. Note that since the average row profile element is used inversely (1/a(.j)),
this makes categories with few observations (as reflected in lower average
row profiles) contribute more to interpoint distances (because the divisor is
smaller). For instance, if party id is columns and media type is rows, and if
Libertarian is a small group, their small row profile elements are
compensated by dividing by their small average row profile. The effect is to
equalize the importance of the column categories, with Libertarians being
as important as Democrats when comparing distances among media types.

6. The computed matrix of interpoint distances is treated like a correlation

matrix for purposes of input to principal components analysis (PCA). As in
conventional PCA, it is necessary to rotate the axes to achieve good
interpretability of the dimensions.
7. The dimensions emerging from PCA are used as axes in plotting
correspondence maps.
What is detrended correspondence analysis (DCA)?

Correspondence analysis can suffer from two problems - the arch effect and
compression.
The arch effect occurs when one variable has a unimodal distribution with respect
to a second (ex., fish population is highest at a given pH level but decreases above
or below that level. This will cause the distribution of points in the
correspondence map to form an arch shape.
Compression occurs when points at the ends of the distribution appear on the
map very close together, such that their spacing along the primary map axis is not
well related to the amount of change along that dimension. Detrended
correspondence analysis (DCA) corrects these problems.
Detrending removes the arch effect. This is done by dividing the map into a series
of vertical partitions, thus dividing the map along the primary (horizontal) axis.
Within each partition, that cluster of points is relocated to center on the second
(vertical) axis's 0 point. This arbitrary adjustment of the data has been the subject
of methodological criticism.
Rescaling is a second step in DCA. Where detrending realigned the points with
respect to the secondary (vertical) axis, rescaling realigns the points along the
primary (horizontal) axis as well as the vertical axis. Both axes are rescaled such
that units represent standard deviations, seeking to make distance in ordination
space mean the same thing along the axes of the map. Note that rescaling
requires numeric (not nominal) measurement of points associated with the
primary axis.

The effects of detrending and rescaling may remove the arch effect, remove
compression at the ends of axes, and distances separating points are more easily
interpreted.
Detrended correspondence analysis is common in ecology, biology, geology, and

allied fields, where it is used for ordination (identifying the least number of
variables which satisfactorily capture the variation in the data, thereby
reproducing data structure in graphic form). DCA was invented by Mark Hill in
1979 and implemented in a FORTRAN program titled DECORANA (detrended
correspondence analysis). At this writing, DCA is not implemented by the major
software packages. However, DCA is implemented in the free statistical package
BASP (Bonn Archaeological Software Package) the free package PAST
(Paleontological Statistics), and the free package PSPP (the acronym does not
stand for anything; PSPP is a project to provide a free, compatible alternative to
SPSS). DCA is also supported by the commercial products CANOCO (a free trial
version is available) and PC-ORD (student pricing available).

Bibliography
Benzecri, J. P. (1992). Correspondence analysis handbook. Paris: Dunod.
Bourdieu, Pierre (1984). Distinction: A social critique of the judgment of taste.

Cambridge, MA: Harvard University Press. A seminal sociological example
of correspondence analysis.
Clausen, Sten-Erik (1998). Applied correspondence analysis. Quantitative

Applications in the Social Sciences Series No. 121. Thousand Oaks, CA: Sage
Publications.
Fisher, R. A. 1938. Statistical methods for research workers. Edinburgh: Oliver and
Boyd.
Fisher, R. A. 1940. The precision of discriminant functions. Annals of Eugenics, 10,

422-429.
Gilula, Z., and S. J. Haberman. 1988. The analysis of multivariate contingency

tables by restricted canonical and restricted association models. Journal of
the American Statistical Association, 83: 760-771.
Greenacre, M. J. (1984). Theory and applications of correspondence analysis. NY:

Academic Press.
Greenacre, M. J. (1993). Correspondence analysis in practice, London: Academic

Press.
Weller, S. C. and A. K. Romney (1990). Metric scaling: Correspondence analysis.

Quantitative Applications in the Social Sciences Series No. 75. Thousand
Oaks, CA: Sage Publications.
Copyright 1998, 2008, 2011. 2012 by G. David Garson and Statistical Associates Publishers.
Worldwide rights reserved in all languages and on all media. Do not copy, lend, or post in any
format. Last update, 9/15/2012.

Statistical Associates Publishing

Blue Book Series
Association, Measures of
Assumptions, Testing of
Canonical Correlation
Case Studies
Cluster Analysis
Content Analysis
Correlation
Correlation, Partial
Correspondence Analysis
Cox Regression
Creating Simulated Datasets
Crosstabulation
Curve Fitting & Nonlinear Regression
Data Levels
Delphi Method
Discriminant Function Analysis
Ethnographic Research
Evaluation Research
Event History Analysis
Factor Analysis
Focus Groups
Game Theory
Generalized Linear Models/Generalized Estimating Equations
GLM (Multivariate), MANOVA, and MANCOVA
GLM (Univariate ), ANOVA, and ANCOVA
GLM Repeated Measures
Grounded Theory
Hierarchical Linear Modeling/Multilevel Analysis/Linear Mixed Models
Integrating Theory in Research Articles and Dissertations
Latent Class Analysis
Life Tables and Kaplan-Meier Survival Analysis
Literature Reviews
Logistic Regression
Log-linear Models,
Longitudinal Analysis
Missing Values Analysis & Data Imputation
Multidimensional Scaling
Multiple Regression

Narrative Analysis
Network Analysis
Ordinal Regression
Parametric Survival Analysis
Partial Least Squares Regression
Participant Observation
Path Analysis
Power Analysis
Probability
Probit Regression and Response Models
Reliability Analysis
Resampling
Research Designs
Sampling
Scales and Standard Measures
Significance Testing
Structural Equation Modeling
Survey Research
Two-Stage Least Squares Regression
Validity
Variance Components Analysis
Weighted Least Squares Regression
Statistical Associates Publishing

http://www.statisticalassociates.com
sa.publishers@gmail.com

G. David Garson-Correspondence Analysis-Statistical Associates Publishing (2012)

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

G. David Garson-Correspondence Analysis-Statistical Associates Publishing (2012)

Caricato da

Copyright:

Formati disponibili

CORRESPONDENCE ANALYSIS 2012 Edition

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 1

Single User License. Do not copy or post.

G. David Garson, President

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 2

Single User License. Do not copy or post.

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 3

Single User License. Do not copy or post.

Model specification and significance testing ........................................................................... 27

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 4

Single User License. Do not copy or post.

Correspondence analysis starts with tabular data on categorical variables, usually

Because the definition of point distance in correspondence analysis does not

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 5

Single User License. Do not copy or post.

another technique, then correspondence analysis may be very useful in exploring

Key Concepts and Terms

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 6

Single User License. Do not copy or post.

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 7

Single User License. Do not copy or post.

The SPSS correspondence analysis interface

The model dialog

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 8

Single User License. Do not copy or post.

Dimensions in the solution

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 9

Single User License. Do not copy or post.

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 10

Single User License. Do not copy or post.

• Column normalization: Column points (regions) are close together, row

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 11

Single User License. Do not copy or post.

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 12

Single User License. Do not copy or post.

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 13

Single User License. Do not copy or post.

The statistics dialog

The plots dialog

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 14

Single User License. Do not copy or post.

SPSS correspondence analysis output

Single User License. Do not copy or post.

extremely liberal to extremely conservative, in response to the item, “Think of self

Inertia. Inertia means variance in the context of correspondence analysis. Inertia

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 16

Single User License. Do not copy or post.

Proportion of inertia accounted for by a given dimension is its eigenvalue divided

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 17

Single User License. Do not copy or post.

The correspondence table

The perceptual map

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 18

Single User License. Do not copy or post.

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 19

Single User License. Do not copy or post.

Row points and column points scatterplots

Row profiles and column profiles tables

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 20

Single User License. Do not copy or post.

Score in dimension. In the contribution table below, “score in dimension” refers to

Copyright @c 2012 by G. David Garson and Statistical Associates Publishing Page 21

Single User License. Do not copy or post.

Contribution of points to inertia of the dimensions. Analogous to factor loadings in