Sei sulla pagina 1di 8

The Sociological Quarterly 22 (Summer 1981):413420

Interpreting Proportional Reduction in Error


Measures as Percentage of Variation Explained *
Frederick J. Kviz, University of Illinois-Medkal Center
Costner (1965) showed that many of the most commonly used measures of association may be conceptually interpreted as indicating proportional reduction in prediction error. But because it is verbally cumbersome, and conceptually complex for
some measures, the proportional reduction in error interpretation has not been widely
incorporated in research reports where appropriate. Instead, measures of association
continue to be interpreted in abstract terms distinguishing between broad levels of
strength. This paper demonstrates that all proportional reduction in error measures
of association may be alternately interpreted as indicating the percent of variation explained. Because this interpretation is conceptually meaningful in a manner highly
relevant for scientific investigation, more convenient to apply in research reports, and
already familiar to most social scientists, it is argued that it be standardly applied to
all proportional reduction in error measures.

Costner (1965) showed that many of the measures of association most commonly used in the social sciences may be interpreted as indicating the proportional reduction in error by predicting categories, pair orders, ranks, or values
based on the bivariate distribution of observations as opposed to the distribution
for a single variable or a condition of independence between two variables. Although not an exhaustive list, the following may be considered as proportional
reduction in error measures: Goodman and Kruskals lambda (Guttmans coefficient of relative predictability) , Goodman and Kruskals tau-b, Goodman and
Kruskals gamma, Yules Q, Pearsons r2, correlation ratio (eta squared) (Costner, 1965); Somers d,, and d,, (Somers, 1968; Costner, 1968); Kendalls tau
(Wilson, 1969); the square of Spearmans rho (p) Mueller et al., 1970); and
Freemans theta (Crittenden and Montgomery, 1980).
The most significant advantage to be derived from this conceptual approach
to measures of association is that it would elucidate the interpretation of research
findings which focus on the analysis of bivariate relationships. That is, instead of
the previously standard interpretation of observed measures of association as indicating either no relationship or a relationship whose degree is abstractly evaluated as weak, moderate, or strong (cf., Davis, 1971; Gehring, 1978; Levin, 1977;
Ott, Mendenhall, and Larson, 1978), the proportional reduction in error approach provides a clear conceptual basis for the interpretation of such results,
Proportional reduction in error has been increasingly included as a heuristic
device in statistics textbooks (cf., Blalock, 1972; Leonard, 1976; Loether and
McTavish, 1974; Mueller, Schuessler, and Costner, 1977; Ott, Mendenhall and
Larson, 1978; Reynolds, 1977). Additionally, it continues to be used in the de0 1 9 8 1 by The Sociological Quarterly. All rights reserved. 0038-0253/81/1400-0413$00.75
* The author is grateful to Herbert L. Costner and anonymous reviewers for helpful comments on drafts of
this manuscript. Frederick J. Kvizs address is: University of Illinois-Medical Center, 2121 West Taylor
Street, Chicago, Illinois 60680.

414

THE SOCIOLOGICAL QUARTERLY

velopment of new measures of association. For example, Crittenden and Montgomery (1980) recently introduced two asymmetric measures of association (nu
and iota) for cases involving a nominal level independent variable and an ordinal
level dependent variable. But the proportional reduction in error interpretation
has not been used widely in reports of research findings. This is probably because the terminology is cumbersome and because a unique interpretation is
required for each measure. For example, it is rarely reported that a lambda value
of .57, for instance, indicates that errors committed when predicting the marginal
modal category for the dependent variable for all observations are reduced by 57
percent by predicting the conditional modal category for the dependent variable
within categories of the independent variable. Similarly, it is not frequently reported that a gamma value of .62 indicates that errors committed when predicting pair orders by random guessing are reduced by 62 percent by predicting the
order of the most prevalent type of ordered pair observed.
This paper argues for a single, convenient interpretation of proportional reduction in error measures in terms already familiar to social scientists. Specifically,
all proportional reduction in error measures may be interpreted as indicating the
percentage of variation explained in a manner similar to that which is typically
employed in the interpretation of the square of the bivariate linear correlation
coefficient (Pearsons rz). This interpretation is derived directly from the proportional reduction in error computational format that is common to all such
measures.

Proportional Reduction in Error


The basic logic underlying the proportional reduction in error approach to measuring association is that if two variables are related, then information about both
variables should be more useful than information about either of them alone.
Proportional reduction in error measures of association indicate how much more
useful information about two variables is than information about only one of
them.
The usefulness of information is measured by the extent to which errors are
committed in attempting to predict categories, pair orders, ranks, or values of
what may be generally termed a dependent variable. If a relationship exists between two variables then relatively fewer errors should be committed when predictions are based on information about the bivariate distribution of observations
in comparison with basing predictions upon information about the univariate
distribution for the dependent variable alone.
The proportional reduction in error approach to measuring association consists of four elements:
1. Prediction Rule I . A method for predicting categories, pair orders, ranks, or
values on the basis of knowledge about the distribution of observations on one
variable only.
2. Errors Committed Using Prediction Rule 1 . A method of measuring prediction
errors committed when prediction rule 1 is applied.
3 . Prediction Rule 2 . A method for predicting categories, pair orders, ranks, or
values with the addition of knowledge about the distribution of observations on a
second variable.

Proportional Reduction in Error

4 15

4. Errors Committed Using Prediction Rule 2 . A method of measuring prediction

errors committed when prediction rule 2 is applied.


The specification of each prediction rule and method of measuring prediction
errors is unique for each proportional reduction in error measure depending on
the level of measurement of the variables and the purpose of the analysis.
The general computational format for all proportional reduction in error
measures is:

El
where PRE = proportional reduction in error,
E, = prediction errors committed using prediction rule 1, and
E, = prediction errors committed using prediction rule 2.
The magnitude of a proportional reduction in error measure indicates the proportion of prediction errors committed using prediction rule 1 that are eliminated
by switching to prediction rule 2.

Variation and Prediction Error


By definition, variation may be observed for any variable. Indeed, the fact that
observations display variation is what makes them interesting and worthy of statistical investigation. Research generally consists of an effort to describe, explain,
and predict variation among a set of observations. Because variables are measured in different ways, the manner in which variation may be measured differs
according to the level of measurement. For example, at the nominal level, variation (often referred to as dispersion, for nominal variables) may be measured as
the number of observations that deviate from an expected or predicted category.
This operational definition of variation is employed, for instance, in the computation of the variation ratio, which is described by Freeman (1965) as the proportion of observations that are not in the modal category. It is also employed
in the index of qualitative variation (Leik and Gove, 1971:284; Mueller et al.,
1977: 179-8 1) and related measures of population diversity (Agresti and Agresti, 1978). There are two basic operational definitions of variation for ordinal
level variables. First, variation may be measured as the number of observed
pairs that deviate from an expected or predicted pair ordering, which is the procedure used in the computation of pair-based measures of association (cf., Leik
and Gove, 19711. A second approach is to compute deviations from an expected
(mean) rank, as in the computation of Spearmans rho squared (Mueller et al.,
1970:271). Finally, for interval and ratio level variables, variation may be measured as arithmetic deviations of observed values from an expected or predicted
value (usually a mean). Among the most commonly used interval level measures
of variation are the average deviation, sum of squares, variance, and standard
deviation.
Thus, the concept of variation may be applied to variables at all levels of measurement and may be generally defined as deviation from a predicted category,

4 16

THE SOCIOLOGICAL QUARTERLY

pair order, rank, or value. Variation is not unique to interval level variables and
its measurement is not restricted to the computation of the variance, even at the
interval level.
Furthermore, deviation from a predicted category, pair order, rank, or value
may be generally termed prediction error. Therefore, prediction error, as defined
in proportional reduction in error terminology, is equivalent to variation as the
term may be generally defined. This identity was pointed out by Senter (1969:
429), who noted that, contrary to everyday usage, the statistical definition of
error is equivalent to variation. Senter maintained, therefore, that Reduction
in error, statistically speaking, means reduction in variability (emphasis in original; also see Leonard, 1976:326).

Percent of Variation Explained


Researchers are able to explain variation on a given variable to the extent that
they are successful in making predictions regarding observations on that variable.
Prediction errors will be committed to the extent that a researcher is not able to
explain all of the variation on that variable. In order to improve predictive success, a researcher may collect additional information about the same set of observations regarding a second variable, which is believed to be related to the first
variable. If a relationship exists between the variables then this additional information will enable the researcher to explain additional variation among observations on the first variable and commit fewer prediction errors than would have
been the case initially.
For all proportional reduction in error measures, the values E, and E, (prediction error) are measures of variation from the predictions made according to
prediction rules 1 and 2 respectively. Specifically, El indicates the total variation
to be explained and E, indicates the unexplained variation that remains after p r e
diction rule 2 is appplied. Therefore, equation 1 may be re-expressed as
PRE =

Total Variation - Unexplained Variation.


Total Variation

Furthermore, because total variation may be partitioned into explained and unexplained variation (Total Variation = Explained Variation
Unexplained
Variation), explained variation can be expressed

Explained Variation = Total Variation - Unexplained Variation

(3)

By substitution in equation 2, proportional reduction in error may be expressed


as the ratio of explained variation fo total variation, or percentage of variation
explained when the result obtained from equation 4 is multiplied by 100.
PRE =

Explained Variation
Total Variation

(4)

The interpretation as percentage of variation explained may be applied to any

Proportional Reduction in Error

4 I7

proportional reduction in error measure. It has been applied almost exclusively


to Pearsons r2, however, because it is easily recognized that El and E, are measures of total and unexplained variation, respectively, in that case. That is, r2 indicates the proportion of variation (measured as variance in this case) observed in variable Y that can be explained by variable X when values for variable
Y are predicted using the least-squares regression formula (P = ayu b,,X)
as opposed to predicting values for variable Y based on the distribution of Y
alone, where the mean of Y (B) is the predicted value. Accordingly,iotal variation (El) = Z(Y - Y),and unexplained variation (E,) = Z(Y - Y)z. Dividing each of these expressions by N, total variation is recognized as the variance of
observations about the mean of variable Y and unexplained variation as the
variance about the least-squares regression line.
It is not so readily recognized that El and E, are measures of total and unexplained variation for other measures of association as well. This relationship
is particularly elusive when measures for ordinal and nominal level variables are
considered. Although Costner ( 1965:343-44) recognized that El and E, are
measures of variation he did not completely explore the similarities between
Pearsons r2 and other proportional reduction in error measures, because, as indicated in the following quotation, he conceived a measure of association to be
interpretable as the percentage of variation explained only when app?ied to interval or ratio level variables, where it is possible to measure variance: . . . although the general intuitive notion of variation is by no means inapplicable to
ordinal and nominal scales, the concept of variance as used in rs, defined in
terms of squared deviations, is well defined for interval and ratio scales only.
But employing the general definition of variation as deviation from a predicted
category, pair order, rank, or value, (i.e., prediction error) any proportional
reduction in error measure may be interpreted as indicating the percentage of
variation explained regardless of the level of measurement and the specification
of prediction rules and errors.
For example, the concept of variation employed in computing Spearmans rho
squared (p,) is parallel to that for r2 in that it is measured as the sum of squared
deviations from a predicted rank rather than a predicted value. For p2, total variation is equal to the sum of squared deviations of observed ranks from the mean
rank for the predicted variable ( Y ) : El = Z(R, - R,), where R, = rank for
each observation on variable Y and RY = the mean of the ranks on variable Y ,
computed as (N
1)/2. Unexplained variation is equal to the sum of the
squared deviations of observed ranks from the rank predicted based on the rank
of each observation on the second variable (X) :

E, = Z(R, - R,), where R,

an observed rank on variable X.

(RxX p )

+ ((N + 1)/2)

( 1 - p), and R, =

Goodman and Kruskals gamma, probably the most often used measure of association between ordinal variables, indicates the proportional reduction in error
when predicting ordered pairs of observations first by random guessing (rule 1)
and then by examination of the relative preponderance of same- and reverseordered pairs (also referred to as concordant and discordant pairs, or agreements
and inversions, respectively) actually observed (rule 2). Total variation (El) is

41 8

THE SOCIOLOGICAL QUARTERLY

equal to one-half the total number of ordered pairs because the probability of
making a correct prediction by random guessing for a set of dichotomous categories is .5. According to rule 2, each pair is predicted to be either same- or
reverse-ordered depending on which type of pair occurs most frequently and unexplained-variation (E2) is equal to the number of same- or reverse-ordered
pairs, whichever is smaller. Furthermore, this interpretation may be extended to
Yules Q (which is identical to gamma for a 2x2 cross-classification), Somers
d,, and d,,, and Kendalls tau once ties are taken into account because their structure differs only slightly from that of gamma regarding the treatment of tied pairs.
Finally, an example of the applicability of the percentage of variation explained interpretation for nominal level variables is Goodman and Kruskals
lambda, which measures total variation as the number of observations which are
not located within the modal category for the predicted variable. In other words,
El = N - f,,, where N = the total number of observations and f,, = the number of observations located within the modal category. If this expression is divided
by N, the result is equivalent to the variation ratio. Unexplained variation (E2)
in the computation of lambda is equal to the total number of observations which
are not located within the modal category of the predicted variable when predictions are made within each category of a second variable.

Discussion
The percentage of variation explained interpretation is most appropriate for
asymmetrical measures of association, for which the direction of prediction from
an independent variable to a dependent variable is unambiguous. Caution must
be exercised in the case of symmetrical measures (e.g., Goodman and Kruskals
gamma, Kendalls tau, and Pearsons r2) because a result may be interpreted in
either of two directions; that is, as indicating the percentage of variation in the
first variable that is explained by the second variable or as the percentage of variation in the second variable that is explained by the first variable. Detailed discussions of this problem as it pertains to the interpretation of Pearsons r2, and
which may be generalized to other symmetrical proportional reduction in error
measures, are presented in many texts (cf., Blalock, 1972; Korin, 1975; Loether
and McTavish, 1974; Mueller, Schuessler, and Costner, 1977).
It is also important to guard against the inappropriate use of this interpretation with certain correlation coefficients, such as Pearsons r, eta, and Spearmans
rho, which are not proportional reduction in error measures. The square of these
coefficients, however, are proportional reduction in error measures and may
therefore be interpreted as indicating the percentage of variation explained. In
a typical Pearson correlation analysis, for example, the most often reported result
is the correlation coefficient, r. The main disadvantage of squared coefficients
such as r2 is that they are always positive values and therefore do not indicate
the direction of a relationship. But although coefficients such as r, which may
range in vakue from - 1.00 to 1.00, indicate direction, there is no convenient
conceptual basis for interpreting them as indicators of the strength of a relationship. Although many researchers do attempt to consider the size of r to evaluate
the strength of a relationship, this can be seriously misleading because the relationship between r and the percent of variation explained is not linear. As a

Proportional Reduction in Error

41 9

result, rather high values of r may be observed when much less than 50 percent
of the variation is explained. For example, when r = S O , only 25 percent of the
variation is explained; when r = .60, 36 percent of the variation is explained;
and even when r is as high as .70, only 49 percent of the variation is explained.
Therefore, both values, r and r2, should be reported for a Pearson correlation
analysis. Similarly, both the unsquared and squared coefficients should be reported for analyses in which eta and Spearmans rho are computed.
The reader is especially cautioned regarding possible confusion between the
terms variation and variance. Variation is a general term referring to the spread
or dispersion of observations on any variable and, as described earlier in this
paper, may be measured by various methods according to the level of measurement and the purpose at hand. Variance is one particular method for measuring
variation among interval level observations. Therefore, the terminology percentage of variance explained, which is most familiar from Pearson correlation analysis, may be substituted when, and only when, prediction errors have been
measured as variance from a mean value. In all other cases, the general term
variation applies.
A major advantage of the proportional reduction in error interpretation is that
it communicates information regarding the nature or form of a relationship in
accordance with the specification of the prediction rules and prediction errors
for each measure. Although the alternate interpretation as percentage of variation explained is less precise in this regard it nevertheless provides a conceptually
useful and convenient universal interpretive approach for all proportional reduction in error measures.
Just as Costner (1965) argued that the proportional reduction in error interpretation obviates the arbitrary approach to interpreting observed values of a
measure of association in terms of broad levels of degree of strength, so does
the percentage of variation explained interpretation. Furthermore, a more useful
evaluation of the strength of a relationship is provided by the latter approach. For
example, because it is desirable to explain a majority, if not all, of the variation
observed for a variable, a conceptual rather than arbitrary cutting-point for distinguishing between a weak and strong relationship, as indicated by a proportional reduction in error measure, might be set at k.50. This would define a
strong relationship as one where at least 50 percent of the variation is explained
and a weak relationship as one where less than 50 percent of the variation is
explained.
This is only a suggested guideline, however. Additionally, evaluations of
strength must also consider the quality of the data and the purpose of the analysis. That is, when exploring an area in which little or no research has been conducted previously, or valid and reliable measurement methods have not been
developed, it would be wise for the investigator not to ignore relatively weak relationships because they may indicate areas that warrant further investigation. In
contrast, when in an area in which a considerable amount of research has already
been reported, and when using measurement methods whose validity and reliability are well established, the investigator may focus only on relationships that
are relatively strong.
The percentage of variation explained interpretation contributes greatly to the
evaluation of the substantive significance of research findings where proportional

420

THE SOCIOLOGICAL QUARTERLY

reduction in error measures are reported. But, it is emphasized that proportional


reduction in error measures, regardless of how great their magnitude, must still
be tested for statistical significance when they are computed for data collected
from a sample.
REFERENCES
Agresti, A. and B. F. Agresti. 1978. Statistical analysis of qualitative variation. Pp. 204-37 in K. F.
Schuessler (ed.) , Sociological Methodology: 1978. San Francisco: Jossey-Bass.
Blalock, H. M., Jr. 1972. Social Statistics. New York: McGraw-Hill.
Costner, H. L. 1965. Criteria for measures of association. American Sociological Review 30:341-53.
1968. Reply to Somers. American Sociological Review 33 :292.
Crittenden, K. S. and A. C. Montgomery. 1980. A system of paired asymmetric measures of association for
use with ordinal dependent variables. Social Forces 58: 1178-94.
Davis, J. A. 1971. Elementary Survey Analysis. Englewood CIiffs, N.J.: Prentice-Hall.
Freeman, L. C. 1965. Elementary Applied Statistics: For Students in Behavioral Science. New York: Wiley.
Gehring, R. E. 1978. Basic Behavioral Statistics. Boston: Houghton-Mimin.
Korin, B. P. 1975. Statistical Concepts for the Social Sciences. Cambridge: Winthrop.
Leik, R. K. and W. R. Gove. 1971. Integrated approach to measuring association. Pp. 279-301 in H. L.
Costner (ed.), Sociological Methodology: 1971. San Francisco: Jossey-Bass.
Leonard, W. M. 11. 1976. Basic Social Statistics. St. Paul, Minn.: West.
Levin, J. 1977. Elementary Statistics in Social Research. New York: Harper and Row.
Loether, H. J. and D. G. McTavish. 1974. Descriptive Statistics for Sociologists. Boston: Allyn and Bacon.
Mueller, J. H., K. F. Schuessler, and H. L. Costner. 1970. Statistical Reasoning in Sociology. 2d ed. Boston: Houghton-Mifllin.
-.
1977. Statistical Reasoning in Sociology. 3d ed. Boston: Houghton-Mifflin.
Ott, L., W. Mendenhall and R. F. Larson. 1978. Statistics: A Tool for the Social Sciences. North Scituate,
Mass.: Duxbury Press.
Reynolds, H. T. 1977. The Analysis of Cross-Classifications. New York: Free Press.
Senter, R. J. 1969. Analysis of Data: Introductory Statistics for the Behavioral Sciences. Glenview, Ill.:
Scott, Foreman and Company.
Somers, R. H. 1968. On the measurement of association. American Socioloaical Review 33 ~291-92.
Wilson, T. P. 1969. A proportional-reduction-in-error interpretation for Kendalls tau-b. Social Forces
47 :340-42.

-.

Potrebbero piacerti anche