Sei sulla pagina 1di 9

REGRESSION AND CORRELATION ANALYSIS

Learning Objectives

At the end of this lesson, the participants should be able to:


• discuss the basic concepts of the regression and correlation technique in
characterizing associations between variables;
• enumerate the assumptions in regression analysis and the consequences of a
violation of any of the assumptions;
• derive a regression model for a given set of data; and
• check for model adequacy through residual analysis, goodness-of-fit test and
coefficient of determination

Introduction

When two or more characteristics are measured from each experimental unit, statistical
inferences frequently involve regression and correlation analysis. For example, when
both grain yield and plant height are collected from a rice experiment, one may wish to
determine how and to what magnitude the two characteristics are related to one another.
The relationship of these characters may either be expressed in a functional form or in
terms of the degree of their association with one another. The techniques used in
determining such relationships are known as regression and correlation.

In crop research, associations between responses, treatments, environmental factors are


frequently evaluated. Associations of particular interest are:
1. Associations between response variables (e.g. between weed density and tiller number
or panicle weight).
2. Association between response and treatment (e.g. between grain yield and nitrogen
rate).
3. Association between response and environment (e.g. between grain yield and
rainfall).

Regression and Correlation Analysis 333


Correlation Analysis

Correlation is concerned with the study of linear dependency between variables. The
correlation coefficient is a measure of the intensity of the linear relationship between two
variables, it is applicable when there is no clear-cut cause and effect relationship between
the variables.

Correlation reflects the extent to which deviations from the mean in one variable are
accompanied by proportional deviations in the other in either direction according to the
sign of the correlation. The population correlation coefficient, ρ, measures the linear
relationship between all possible values of two variables X and Y. It is defined as
Cov(X, Y)
SD(X)SD(Y)

The sample correlation coefficient is computed from the observed data as follows:

∑ Xi Yi − (∑ Yi)(∑ Xi) / n
r=
[∑ Xi − (∑ Xi) / n ][∑ Yi − (∑ Yi) 2 / n ]
2

r is an estimate of ρ and it can be used to test the null hypothesis Ho: ρ = ρo. The test
statistic used is
r − ρ
tc =
(1 − r 2) /(n − 2)

tc has a t-distribution with (n -2) degrees of freedom when the population correlation is
ρo. A test for linear independence is made by setting ρo = 0, i.e., testing Ho: ρ = 0 using
this test statistic.

The correlation coefficient (r) measures the strength of the linear relationship existing
between observations on two variables. If two variables are independent, they have zero
correlation. However, the converse is not necessarily true. A value of r equal to zero
does not necessarily mean that there is no relationship between the variables. It may
mean that the relationship is not linear. The value of r ranges from -1 to +1 with the
extreme values indicating a close linear association and the mid-value, zero, indicating no
linear association between the variables. A positive or negative value or r indicates the
direction of change in one variable relative to the change in the other. That is, the value
of r is negative when a positive change in one variable is associated with a negative
change in another, and positive when the two variables change in the same direction.

334 Correlation Analysis


Figure 1 illustrates the various degrees of association between two variables as reflected
in the r values.

Figure 1. Graphical representation of various values


of simple correlation coefficient, r.

Regression Analysis

Regression analysis is a statistical technique for investigating and modeling the


relationship of a dependent (response) variable to a set of independent or explanatory
variables. The relationship is expressed in the form of an equation connecting the
response variable Y and one or more independent variables X1, X2, . . ., Xp. The
regression equation involves some parameters which must be estimated from the data.

A regression equation containing only one independent variable is called a simple


regression equation. An equation containing more than one variable is referred to as a
multiple regression equation. The regression equation or model defines the structural
form of the relationship by linking the dependent variable with the independent variables
through parameters, for example, Y = α + β1X1 + β2X12 where Y and X1 are variables and
α, β1, β2 are parameters. If the relationship that exists between the dependent and the
independent variables is a linear function of the parameters then it is termed a linear
regression such as the equation above, or Y = β0 + β1X1+ β2X 2. On the other hand, a
non-linear regression exists if the relationship between the dependent and the independent
variables is non-linear in the parameters such as in Y = αβX where parameters α and β are
multiplied.

Note that the linearity refers to the parameters and not the variables. For example, Y = α
+ βX1 + γX1X2 is nonlinear in the variables but is a linear regression model since it is
linear in the parameters α, β, γ.

Regression and Correlation Analysis 335


Most often, regression analysis is done to:

* obtain estimates of the parameters,


* estimate the variance of the error term,
* estimate the standard error of the parameter estimates,
* test hypotheses about the parameters,
* calculate predicted values using the estimated equation,
* evaluate the fit or lack of fit of the model

Linear Regression

The linear regression equation for response Y and p independent variables, Xj,
j =1, 2 ... p, takes the form:
p
Y = β0 + ∑ β j X j + e
j=1

where Xj, j = 1, 2, . . ., p are independent variables, β0 and the βj’s are the regression
coefficients and e is a random error. The linear model assumes that:

the expected values of the error terms are zero


the variances of the error terms are unrelated to the magnitude of Y or Xj
(homoscedasticity)
the error terms are uncorrelated
the error terms are normally distributed

The parameter estimates, βɵ j , are derived from a set of n observations, Yi, Xi1, Xi2, ..., Xip ,
i = 1, 2 ... n, by the method of least squares. This involves choosing values βɵ forj

βj which minimize the sum of squares of the residuals:

)
2
Σ eɵ 2i = Σ(Yi −Y
ɵj ,

ɵ i = βɵ 0 + Σ βɵ j Xij .
where Y

With the assumption of normality among the error terms, the null hypothesis, Ho: βj = βjo
can be tested using the statistic
o
βɵ j − βɵ j
t=
s. e.  βɵ j

where s.e. (βj) is computed from the residual SS, Σ eɵ 2i and the Xi values, but the

336 Linear Regression


computation is complicated except for simple linear regression when
s.e. (βˆ ) = ((n − 1) /( n − 2)∑ ei2) /( ∑ (Xi1 − X1) 2 ) . βˆ
1 j

The s.e.’s are always provided by computer programs. The test statistic t follows the
student’s T distribution with (n - p -1) degrees of freedom on the null hypothesis. Setting
βjo = 0 tests the hypothesis that Y and Xj are not related by the stated regression model.

The F-test of the ANOVA procedure is used to test the significance of the regression
equation, that is, the null hypothesis Ho: β1 = β2 = . . . = βp = 0 (Note: β0 is excluded). In
the ANOVA for linear regression, the total SS of Y, SSY, is partitioned into two
components, namely, the sum of squares due to regression (SSReg) and the sum of
squares due to deviations from regression (SSE).

Table 1. ANOVA for linear regression for testing


Ho: βj = 0 vs. Ha: βj ≠ 0 all j = 1, 2, ..., p for same j.
SV DF SS MS F
Regression P SSReg MSR MSR/MSE
Error n-p-1 SSE MSE
Total n–1 SSY

Confidence limits for β̂ j with coefficient (1 - α) are given by

βˆ j ± t (n − p − 1, α / 2)[s.e.(βˆ j )]

where t(n-p-1, α/2) is the upper tail 100α/2% critical value of Students’ T distribution
with n-p-1 df.

A useful measure of the importance of the independent variables is to consider the


coefficient of determination or squared multiple correlation R2, where
SS Re g
R =
2

SSY

R2 is the ordinary correlation between Yi andŶi and so ranges from 0 to 1. It is also the
proportion of the variability in the response variable which is accounted for by the
regression analysis. A value of R2 close to unity indicates that the model fits the data
well. With a good fit, the observed and predicted values will be close to each other so
SSE will be small. On the other hand, if the model does not fit well, SSE will be large
and R2 will be near zero. The value of R2 is therefore used as a summary measure to
judge the fit of the linear model to a given set of data.

Regression and Correlation Analysis 337


There are two dangers in using R2 alone as a measure of goodness of fit for a regression
model. First, it gives no information on the appropriateness of the model; a curve may be
well approximated by a straight line according to R2, but the distinction may be very
important from a practical point of view (Figure 2a) or the data may be segregated into
groups, but still appear well described according to R2 (Figure 2b).

Figure 2a. Curvilinear data fitted by Figure 2b. Segregated data fitted by
a straight line with high R2. a straight line with high R2.

For detecting these kinds of departures from the regression model there is no substitute to
plotting the data.

The second problem with R2 is that as additional variables are included to a regression
equation, R2 tends to increase, regardless of the true importance of these variables in
determining the values of the dependent variable. A related statistic, the adjusted R2 (Ra2)
is used to compensate for this effect. It is defined as
n −1
Ra = 1 − (1 − R 2).
2
n−p

The purpose of Ra2 is to assist goodness of fit comparisons between regression equations
which differ with respect to either the number of explanatory variables or the number of
observations.

338 Linear Regression


Residual Analysis

A simple and effective method for detecting model adequacy in regression analysis is by
examining the standardized residuals,
êi = (Yi − Ŷi) / s.e.(êi)
*

If the fitted model is correct, the residuals should conform with the assumptions made on
the error terms, that is, normality, homoscedasticity, and independence.

In general, when the model is correct, the standardized residuals tend to fall between -2
and +2 and are randomly distributed about zero. The process of checking for model
violations by analyzing residuals is useful for uncovering hidden structures in the data.

If any standardized residual is outside the range -2.5 to + 2.5, it should be checked for
possible data errors or other reasons for deviance from the model.

Residual plots provide a visual indication of model adequacy and suggest possible
modifications if it is inadequate. Some of the more commonly used plots are those in
which the standardized residuals are plotted against the fitted value, Ŷ ; the independent
variable Xj; and the time order, t, in which the observations occur (if relevant).
Essentially, a good regression model should result in residuals whose graphs
against t, Ŷ or and each Xj, or t do not exhibit any distinct pattern of variation (Figure 3).
Any distinct patterns depicted in the residual plots (Figures 4a-c) indicate inadequacy of
the fitted model or violations of the assumptions in the regression procedure.

Figure 3. A satisfactory residual plot should give this overall impression

Regression and Correlation Analysis 339


(a)

(b)

(c)

Figure 4. Plots indicating unsatisfactory residuals behavior.

1. Plot against Ŷi . Plot forms as in Figure 4 indicate:

a) Variance not constant; need for weighted least squares or a transformation on the
observations Yi before making a regression analysis.

b) Systematic departure from the fitted equation (negative residuals correspond to


low Ŷ' s , positive residuals to high Ŷ' s). This indicates an incorrect model such as
use of linear regression when a curve is appropriate. This can also be caused by
wrongly omitting a β0 term in the model.

c) Model inadequate - need for extra terms in the model (e.g., square or cross-
product terms), or need for a transformation on the observations Yi before
analysis.

2. If the data were collected in time order, a plot against time can be used to detect auto-
correlation or time trends. The plot forms in Figure 4 indicate:

a) The variance is not constant; implying the need for weighted least squares
analysis.
b) A linear term in time should have been included in the model.
c) Linear and quadratic terms in time should have been included in the model.

3. Plot against the predictor variables Xji . Plot forms as in Figure 4 indicate:

a) Variance not constant; need for weighted least squares or a preliminary


transformation on the Y’s.
b) Error in calculations; linear effect of Xj not removed.
c) Need for extra terms in powers of Xj in the model or a transformation on the Y’s.

340 Residual Analysis


References:

Chatterjee, S. and Price, B. Regression analysis by example. 1977. John Wiley and
Sons, Inc. New York.

Freund, R.J. and Littell, R.C. 1991. SAS system for regression. 2nd. ed. SAS Institute,
Inc. Cary, North Carolina.

Gomez, K.A. and Gomez, A.A. 1984. Statistical procedures for agricultural research.
2nd. ed. John Wiley and Sons, Inc. New York.

Ratkowsky, D.A. 1983. Nonlinear regression modelling. Marcel Dekker, Inc. New
York.

Rayner, A.A. 1967. A first course in biometry for agriculture students. University of
Natal Press. South Africa.

Woolson, R.F. 1987. Statistical methods for the analysis of biomedical data. John Wiley
and Sons, Inc. New York.

Regression and Correlation Analysis 341

Potrebbero piacerti anche