Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
The calculation of the correlation coefficient for two variables, say X and
Y, is simple to understand. Let zX and zY be the standardized versions of X
and Y, respectively. That is, zX and zY are both re-expressed to have means
equal to zero, and standard deviations (std) equal to one. The re-expressions
used to obtain the standardized scores are in Equations (2.1) and (2.2):
For a simple illustration of the calculation, consider the sample of five obser-
vations in Table 2.1. Columns zX and zY contain the standardized scores of
X and Y, respectively. The last column is the product of the paired standard-
ized scores. The sum of these scores is 1.83. The mean of these scores (using
the adjusted divisor n–1, not n) is 0.46. Thus, rX,Y = 0.46.
2.2 Scatterplots
The linearity assumption of the correlation coefficient can easily be tested
with a scatterplot, which is a mapping of the paired points (Xi,Yi) in a graph;
i represents the observations from 1 to n, where n equals the sample size. For
example, a picture that shows how two variables are related with respect to
a horizontal/X-axis perpendicular to a vertical/Y-axis is representative of a
scatterplot. If the scatter of points appears to overlay a straight-line, then the
assumption has been satisfied, and rX,Y provides a meaningful measure of the
linear relationship between X and Y. If the scatter does not appear to overlay
a straight-line, then the assumption has not been satisfied and the rX,Y value
is at best questionable. Thus, when using the correlation coefficient to measure
the strength of the linear relationship, it is advisable to construct the scatter-
plot in order to test the linearity assumption. Unfortunately, many data ana-
lysts do not construct the scatterplot, thus rendering any analysis based on
the correlation coefficient as potentially invalid. The following illustration is
presented to reinforce the importance of evaluating scatterplots.
Consider the four data sets with eleven observations in Table 2.2. [1] There
are four sets of (X,Y) points, with the same correlation coefficient value of
0.82. However, each X-Y relationship is distinct from one another, reflecting
a different underlying structure as depicted in the scatterplots in Figure 2.1.
Scatterplot for X1-Y1 (upper-left) indicates a linear relationship; thus, the
rX1,Y1 value of 0.82 correctly suggests a strong positive linear relationship
between X1 and Y1. Scatterplot for X2-Y2 (upper right) reveals a curved
relationship; rX2,Y2 = 0.82. Scatterplot for X3-Y3 (lower left) reveals a straight-
line except for the “outside” observation #3, data point (13, 12.74); rX3,Y3 =
0.82. Scatterplot for X4-Y4 (lower right) has a “shape of its own,” which is
clearly not linear; rX4,Y4 = 0.82. Accordingly, the correlation coefficient value
of 0.82 is not a meaningful measure for the latter three X-Y relationships.
2.3.1 Example #1
Consider the quantitative target variable TOLL CALLS (TC) in dollars, and
the predictor variable HOUSEHOLD INCOME (HI) in dollars from a sample
of size 102,000. The calculated rTC, HI is 0.09. The TC-HI scatterplot in Figure
2.2 shows a cloud of points obscuring the underlying relationship within the
data (assuming a relationship exists). This scatterplot is not informative as
to an indication for the reliable use of the calculated rTC, HI.
FIGURE 2.1
Four Different Datasets with the Same Correlation Coefficient
© 2003 by CRC Press LLC
FIGURE 2.2
Scatterplot for TOLL CALLS and HOUSEHOLD INCOME
2.3.2 Example #2
Consider the qualitative target variable RESPONSE (RS), which measures
the response to a mailing, and the predictor variable HOUSEHOLD
INCOME (HI) from a sample of size 102,000. RS assumes “yes” and “no”
values, which are coded as 1 and 0, respectively. The calculated rRS, HI is 0.01.
The RS-HI scatterplot in Figure 2.3 shows “train tracks” obscuring the under-
lying relationship within the data (assuming a relationship exists). The tracks
appear because the target variable takes on only two values, 0 and 1. As in
the first example, this scatterplot is not informative as to an indication for
the reliable use of the calculated rRS, HI.
1 There are other designs for slicing continuous as well as categorical variables.
FIGURE 2.3
Scatterplot for RESPONSE and HOUSEHOLD INCOME
5. Plot the smooth points (smooth Y, smooth X), producing a smoothed
scatterplot.
6. Connect the smooth points, starting from the left-most smooth point.
The resultant smooth trace line reveals the underlying relationship
between X and Y.
Returning to Examples #1 and #2, the HI data are grouped into ten equal-
sized slices each consisting of 10,200 observations. The averages (means) or
smooth points for HI with both TC and RS within the slices (numbered from
0 to 9) are presented in Tables 2.3 and 2.4, respectively. The smooth points
are plotted and connected.
The TC smooth trace line in Figure 2.4 clearly indicates a linear relation-
ship. Thus, the rTC, HI value of 0.09 is a reliable measure of a weak positive
linear relationship between TC and HI. Moreover, the variable HI itself
(without any re-expression) can be tested for inclusion in the TC Model.
Note that the small r value does not preclude testing HI for model inclusion.
This point is discussed in Chapter 4, Section 4.5.
The RS smooth trace line in Figure 2.5 is clear: the relationship between
RS and HI is not linear. Thus, the rRS, HI value of 0.01 is invalid. This raises
the following question: does the RS smooth trace line indicate a general
association between RS and HI, implying a nonlinear relationship, or does
the RS smooth trace line indicate a random scatter, implying no relationship
between RS and HI? The answer can be readily found with the graphical
nonparametric General Association Test. [5]
TABLE 2.3
Smooth Points: Toll Calls and Household Income
Slice Average Toll Calls Average HH Income
0 $31.98 $26,157
1 $27.95 $18,697
2 $26.94 $16,271
3 $25.47 $14,712
4 $25.04 $13,493
5 $25.30 $12,474
6 $24.43 $11,644
7 $24.84 $10,803
8 $23.79 $9,796
9 $22.86 $6,748
FIGURE 2.4
Smoothed Scatterplot for TOLL CALLS and HOUSEHOLD INCOME
FIGURE 2.5
Smoothed Scatterplot for RESPONSE and HOUSEHOLD INCOME
2.5 General Association Test
Here is the General Association Test:
Reject the null hypothesis if TS is greater than or equal to the cutoff score
in Table 2.5. It is concluded that there is an association between the two
variables. The smooth trace line indicates the “shape” or structure of the
association.
Fail-to-Reject the null hypothesis if TS is less than the cutoff score in Table
2.5. It is concluded that there is no association between the two variables.
Returning to the smoothed scatterplot of RS and HI, I determine the
following:
Thus, there is a 99% (and of course 95%) confidence level that there is an
association between RS and HI. The RS smooth trace line in Figure 2.5
suggests that the observed relationship between RS and HI appears to be
polynomial to the third power. Accordingly, the linear (HI), quadratic (HI2)
and cubed (HI3) HOUSEHOLD INCOME terms should be tested in the
RESPONSE model.
2.6 Summary
It should be clear that an analysis based on the uncritical use of the coefficient
correlation is problematic. The strength of a relationship between two vari-
ables cannot simply be taken as the calculated r value. The testing for the
linearity assumption, which is made easy by the simple scatterplot or
smoothed scatterplot, is necessary for a thorough and valid analysis. If the
observed relationship is linear, then the r value can be taken at face value
for the strength of the relationship at hand. If the observed relationship is
not linear, then the r value must be disregarded, or used with extreme
caution.
When a smoothed scatterplot for big data does not reveal a linear relation-
ship, its scatter can be tested for randomness or for a noticeable general
FIGURE 2.6
General Association Test for Smoothed RS-HI Scatterplot
association by the proposed nonparametric method. If the former is true,
then it is concluded that there is no association between the variables. If the
latter is true, then the predictor variable is re-expressed to reflect the
observed relationship, and consequently tested for inclusion into the model.
References
1. Anscombe, F.J., Graphs in statistical analysis, American Statistician, 27, 17–22,
1973.
2. Tukey, J.W., Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1997.
3. Hardle, W., Smoothing Techniques, Springer-Verlag, New York, 1990.
4. Simonoff, J.S., Smoothing Methods in Statistics, Springer-Verlag, New York, 1996.
5. Quenouille, M.H., Rapid Statistical Calculations, Hafner Publishing, New York,
1959.