Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Learning Objectives
Introduction
When two or more characteristics are measured from each experimental unit, statistical
inferences frequently involve regression and correlation analysis. For example, when
both grain yield and plant height are collected from a rice experiment, one may wish to
determine how and to what magnitude the two characteristics are related to one another.
The relationship of these characters may either be expressed in a functional form or in
terms of the degree of their association with one another. The techniques used in
determining such relationships are known as regression and correlation.
Correlation is concerned with the study of linear dependency between variables. The
correlation coefficient is a measure of the intensity of the linear relationship between two
variables, it is applicable when there is no clear-cut cause and effect relationship between
the variables.
Correlation reflects the extent to which deviations from the mean in one variable are
accompanied by proportional deviations in the other in either direction according to the
sign of the correlation. The population correlation coefficient, ρ, measures the linear
relationship between all possible values of two variables X and Y. It is defined as
Cov(X, Y)
SD(X)SD(Y)
The sample correlation coefficient is computed from the observed data as follows:
∑ Xi Yi − (∑ Yi)(∑ Xi) / n
r=
[∑ Xi − (∑ Xi) / n ][∑ Yi − (∑ Yi) 2 / n ]
2
r is an estimate of ρ and it can be used to test the null hypothesis Ho: ρ = ρo. The test
statistic used is
r − ρ
tc =
(1 − r 2) /(n − 2)
tc has a t-distribution with (n -2) degrees of freedom when the population correlation is
ρo. A test for linear independence is made by setting ρo = 0, i.e., testing Ho: ρ = 0 using
this test statistic.
The correlation coefficient (r) measures the strength of the linear relationship existing
between observations on two variables. If two variables are independent, they have zero
correlation. However, the converse is not necessarily true. A value of r equal to zero
does not necessarily mean that there is no relationship between the variables. It may
mean that the relationship is not linear. The value of r ranges from -1 to +1 with the
extreme values indicating a close linear association and the mid-value, zero, indicating no
linear association between the variables. A positive or negative value or r indicates the
direction of change in one variable relative to the change in the other. That is, the value
of r is negative when a positive change in one variable is associated with a negative
change in another, and positive when the two variables change in the same direction.
Regression Analysis
Note that the linearity refers to the parameters and not the variables. For example, Y = α
+ βX1 + γX1X2 is nonlinear in the variables but is a linear regression model since it is
linear in the parameters α, β, γ.
Linear Regression
The linear regression equation for response Y and p independent variables, Xj,
j =1, 2 ... p, takes the form:
p
Y = β0 + ∑ β j X j + e
j=1
where Xj, j = 1, 2, . . ., p are independent variables, β0 and the βj’s are the regression
coefficients and e is a random error. The linear model assumes that:
The parameter estimates, βɵ j , are derived from a set of n observations, Yi, Xi1, Xi2, ..., Xip ,
i = 1, 2 ... n, by the method of least squares. This involves choosing values βɵ forj
)
2
Σ eɵ 2i = Σ(Yi −Y
ɵj ,
ɵ i = βɵ 0 + Σ βɵ j Xij .
where Y
With the assumption of normality among the error terms, the null hypothesis, Ho: βj = βjo
can be tested using the statistic
o
βɵ j − βɵ j
t=
s. e. βɵ j
where s.e. (βj) is computed from the residual SS, Σ eɵ 2i and the Xi values, but the
The s.e.’s are always provided by computer programs. The test statistic t follows the
student’s T distribution with (n - p -1) degrees of freedom on the null hypothesis. Setting
βjo = 0 tests the hypothesis that Y and Xj are not related by the stated regression model.
The F-test of the ANOVA procedure is used to test the significance of the regression
equation, that is, the null hypothesis Ho: β1 = β2 = . . . = βp = 0 (Note: β0 is excluded). In
the ANOVA for linear regression, the total SS of Y, SSY, is partitioned into two
components, namely, the sum of squares due to regression (SSReg) and the sum of
squares due to deviations from regression (SSE).
βˆ j ± t (n − p − 1, α / 2)[s.e.(βˆ j )]
where t(n-p-1, α/2) is the upper tail 100α/2% critical value of Students’ T distribution
with n-p-1 df.
SSY
R2 is the ordinary correlation between Yi andŶi and so ranges from 0 to 1. It is also the
proportion of the variability in the response variable which is accounted for by the
regression analysis. A value of R2 close to unity indicates that the model fits the data
well. With a good fit, the observed and predicted values will be close to each other so
SSE will be small. On the other hand, if the model does not fit well, SSE will be large
and R2 will be near zero. The value of R2 is therefore used as a summary measure to
judge the fit of the linear model to a given set of data.
Figure 2a. Curvilinear data fitted by Figure 2b. Segregated data fitted by
a straight line with high R2. a straight line with high R2.
For detecting these kinds of departures from the regression model there is no substitute to
plotting the data.
The second problem with R2 is that as additional variables are included to a regression
equation, R2 tends to increase, regardless of the true importance of these variables in
determining the values of the dependent variable. A related statistic, the adjusted R2 (Ra2)
is used to compensate for this effect. It is defined as
n −1
Ra = 1 − (1 − R 2).
2
n−p
The purpose of Ra2 is to assist goodness of fit comparisons between regression equations
which differ with respect to either the number of explanatory variables or the number of
observations.
A simple and effective method for detecting model adequacy in regression analysis is by
examining the standardized residuals,
êi = (Yi − Ŷi) / s.e.(êi)
*
If the fitted model is correct, the residuals should conform with the assumptions made on
the error terms, that is, normality, homoscedasticity, and independence.
In general, when the model is correct, the standardized residuals tend to fall between -2
and +2 and are randomly distributed about zero. The process of checking for model
violations by analyzing residuals is useful for uncovering hidden structures in the data.
If any standardized residual is outside the range -2.5 to + 2.5, it should be checked for
possible data errors or other reasons for deviance from the model.
Residual plots provide a visual indication of model adequacy and suggest possible
modifications if it is inadequate. Some of the more commonly used plots are those in
which the standardized residuals are plotted against the fitted value, Ŷ ; the independent
variable Xj; and the time order, t, in which the observations occur (if relevant).
Essentially, a good regression model should result in residuals whose graphs
against t, Ŷ or and each Xj, or t do not exhibit any distinct pattern of variation (Figure 3).
Any distinct patterns depicted in the residual plots (Figures 4a-c) indicate inadequacy of
the fitted model or violations of the assumptions in the regression procedure.
(b)
(c)
a) Variance not constant; need for weighted least squares or a transformation on the
observations Yi before making a regression analysis.
c) Model inadequate - need for extra terms in the model (e.g., square or cross-
product terms), or need for a transformation on the observations Yi before
analysis.
2. If the data were collected in time order, a plot against time can be used to detect auto-
correlation or time trends. The plot forms in Figure 4 indicate:
a) The variance is not constant; implying the need for weighted least squares
analysis.
b) A linear term in time should have been included in the model.
c) Linear and quadratic terms in time should have been included in the model.
3. Plot against the predictor variables Xji . Plot forms as in Figure 4 indicate:
Chatterjee, S. and Price, B. Regression analysis by example. 1977. John Wiley and
Sons, Inc. New York.
Freund, R.J. and Littell, R.C. 1991. SAS system for regression. 2nd. ed. SAS Institute,
Inc. Cary, North Carolina.
Gomez, K.A. and Gomez, A.A. 1984. Statistical procedures for agricultural research.
2nd. ed. John Wiley and Sons, Inc. New York.
Ratkowsky, D.A. 1983. Nonlinear regression modelling. Marcel Dekker, Inc. New
York.
Rayner, A.A. 1967. A first course in biometry for agriculture students. University of
Natal Press. South Africa.
Woolson, R.F. 1987. Statistical methods for the analysis of biomedical data. John Wiley
and Sons, Inc. New York.