Sei sulla pagina 1di 45

CORRELATION AND

REGRESSION
Correlation and regression (linear)
are the most commonly used techniques
for investigating the relationship
between two quantitative variables.
 
Many hydrologic variables are related to
each other through cause and effect –
changes in the
values of one or more variables cause
changes in some other variable.
INTRODUCTION
WHAT IS CORRELATION?
The goal of a correlation analysis is to see whether two measurement
variables co vary, and to quantify the strength of the relationship between
the variables. If the change in one variable brings about a change in the
other variable, they are said to be correlated. Correlation is only concerned
with strength of the relationship. No causal effect is implied with correlation

A set of variables may be related due to two reasons:


• If one variable drives the other, they may be correlated, as rainfall and
runoff.
• The variables may also be correlated if they share the same cause.
Examples include dependent variables, such as river discharge,
concentration or transport rates of sediment, and concentration or transport
rates of substances that are transported in association with suspended
sediment.
WHAT IS REGRESSION?
Regression expresses the relationship, which was determined in
the correlation analysis, in the form of an equation. In regression
analysis, the problem of interest is the nature of the relationship
itself between the dependent variable(response) and the
(explanatory) independent variable.
Regression analysis is used to detect a relation between the
values of two or more variables, of which at least one is subject
to random variation, and to test whether such a relation, either
assumed or calculated, is statistically significant. It is a tool for
detecting relations between hydrologic parameters in different
places, between the parameters of a hydrologic model, between
hydraulic parameters and soil parameters, between crop growth
and water table depth, and so on.
 REGRESSION ANALYSIS IS USED TO:
• Predict the value of a dependent variable based on the value of at
least one independent variable
• Explain the impact of changes in an independent variable on the
dependent variable
• Dependent variable: the variable we wish to predictor explain (i.e.
runoff)
• Independent variable: the variable used to explain the dependent
variable (i.e. rainfall)
• Only one independent variable, X
• Relationship between X and Y is described by a linear function
• Changes in Y are assumed to be caused by changes in X
 WHAT IS THE SCATTER DIAGRAM?
A scatter diagram can be used to show the
relationship between two variables. The starting
point is to draw a scatter of points on a graph, with
one variable on the X-axis and the other variable on
the Y-axis, to get a feel of the relationship (if any)
between the variables as suggested by the data.
The closer the points are to a straight line, the
stronger the linear relationship between two
variables. A scatter diagram of the data provides an
initial check of the assumptions for regression.
 
ASSUMPTIONS
Some underlying assumptions governing the uses
of correlation and regression are as follows. The
observations are assumed to be independent.
For correlation, both variables should be random
variables, but for regression only the dependent
variable Y must be random. In carrying out hypothesis
tests, the response variable should follow normal
distribution and the variability of Y should be the
same for each value of the predictor variable.
THREE MAIN USES OF CORRELATION AND REGRESSION
• One is to test hypotheses about cause-and-effect relationships. In this case, the
experimenter determines the values of the X-variable and sees whether variation
in X causes variation in Y. For example, giving people different amounts of a drug
and measuring their blood pressure.

• The second main use for correlation and regression is to see whether two
variables are associated, without necessarily inferring a cause-and-
effect relationship. In this case, neither variable is determined by the
experimenter; both are naturally variable. If an association is found, the inference
is that variation in X may cause variation in Y, or variation in Y may cause
variation in X, or variation in some other factor may affect both X and Y.

• The third common use of regression (linear) is estimating the value of one
variable corresponding to a particular value of the other variable.
 
CORRELATION
CORRELATION COEFFICIENT:
A) Pearson Product-Moment Correlation is one of the measures
of correlation which quantifies the strength as well as
direction of such relationship. It is usually denoted by
Greek letter ρ.
 
CONDITIONS
This coefficient is used if two conditions are satisfied
• the variables are in the interval or ratio scale of measurement
• a linear relationship between them is suspected
POSITIVE AND NEGATIVE CORRELATION
The coefficient (ρ) is computed as the ratio of covariance between the
variables to the product of their standard deviations. This formulation is
advantageous.
ρ (X,Y)=

• First, it tells us the direction of relationship. Once the coefficient is


computed, ρ > 0 will indicate positive relationship, ρ < 0 will indicate
negative relationship while ρ = 0 indicates non existence of any
relationship.
• Second, it ensures (mathematically) that the numerical value of ρ range
from -1.0 to +1.0. This enables us to get an idea of the strength of
relationship - or rather the strength of linear relationship between the
variables. Closer the coefficients are to +1.0 or -1.0, greater is the
strength of the linear relationship.
ρ (X,Y)=
 Properties of ρ
• This measure of correlation has interesting properties, some of
which are enunciated below:
• It is independent of the units of measurement. It is in fact unit
free.
• It is symmetric. This means that ρ between X and Y is exactly the
same as ρ between Y and X.
• Pearson's correlation coefficient is independent of change in origin
and scale.
• If the variables are independent of each other, then one would
obtain ρ = 0. However, the converse is not true. In other words ρ
= 0 does not imply that the variables are independent - it only
indicates the non existence of a non-linear relationship.
B) Spearman Rank Correlation Coefficient is a non-parametric
measure of correlation, using ranks to calculate the correlation.
Spearman Rank Correlation Coefficient uses ranks to calculate
correlation. The correlation coefficient is sometimes denoted by r s.
 
rs = correlation coefficient,

In general,
• rs > 0 implies positive agreement among ranks
• rs < 0 implies negative agreement (or agreement in the reverse
direction)
• rs = 0 implies no agreement
Closer rs is to 1, better is the agreement while rs closer to -1 indicates
strong agreement in the reverse direction.
SIGNIFICANCE OF CORRELATION

Look up r in a table of correlation coefficients (ignoring + or -


sign). The number of degrees of freedom is two less than
the number of points on the graph . If our calculated r value
exceeds the tabulated value at p = 0.05 then the correlation
is significant.
Degrees of Freedom Probability, p
  0.05 0.01 0.001
1 0.997 1.000 1.000

2 0.950 0.990 0.999

3 0.878 0.959 0.991

4 0.811 0.917 0.974

5 0.755 0.875 0.951

6 0.707 0.834 0.925

7 0.666 0.798 0.898

8 0.632 0.765 0.872

9 0.602 0.735 0.847

10 0.576 0.708 0.823

11 0.553 0.684 0.801

12 0.532 0.661 0.780

13 0.514 0.641 0.760

14 0.497 0.623 0.742

15 0.482 0.606 0.725

16 0.468 0.590 0.708

17 0.456 0.575 0.693

18 0.444 0.561 0.679

19 0.433 0.549 0.665

20 0.423 0.457 0.652


PARTIAL CORRELATION ANALYSIS
involves studying the linear relationship between
two variables after excluding the effect of one or
more independent factors. In order to get a correct
picture of the relationship between two variables,
we should first eliminate the influence of other
variables. For example, study of partial correlation
between price and demand would involve studying
the relationship between price and demand
excluding the effect of money supply, exports, etc.
The partial correlation analysis assumes
great significance in cases where the
phenomena under consideration have
multiple factors influencing them, especially
in physical and experimental sciences, where
it is possible to control the variables and
the effect of each variable can be studied
separately.
DISADVANTAGE:
In simple correlation, we measure the
strength of the linear relationship
between two variables, without taking
into consideration the fact that both
these variables may be influenced by a
third variable.
MULTIPLE CORRELATION
Another technique used to overcome the drawbacks of simple
correlation is multiple regression analysis.
Here, we study the effects of all the independent variables
simultaneously on a dependent variable. For example, the
correlation co-efficient between the yield of paddy (X1) and the
other variables, viz. type of seedlings (X2), manure (X3), rainfall
(X4), humidity (X5) is the multiple correlation co-efficient R1.2345 .
This co-efficient takes value between 0 and +1.
The limitations of multiple correlation are similar to those of
partial correlation.
Coaxial graphical correlations of runoff, with rainfall and other
parameters like time of the year, storm duration and antecedent
CORRELATION COEFFICIENT
P – rainfall
R – Run-off
N – Number of observations
REGRESSION
ASSUMPTION OF LINEARITY
Linear regression does not test whether data is
linear. It finds the slope and the intercept
assuming that the relationship between the
independent and dependent variable can be best
explained by a straight line.
One can construct the scatter plot to confirm this
assumption. If the scatter plot reveals non linear
relationship, often a suitable transformation can be
used to attain linearity (e.g logarithmic).
Depending on the number of independent variables, regression
analysis can also be classified as:
a) Simple linear regression: most commonly used model in
hydrology, where the dependent variable is regressed on only one
independent variable
yi = a + bxi + εi i =1, 2, … n
where, yi is the ith value of the dependent or regressed variable, xi is
the ith value of the independent or regressor or predictor variable
and a, b = regression coefficients. The regression line crosses the y-
axis at a point a (the intercept), and has a slope b, and εi is the
random error or residual term for the ith data point. Since the actual
(or observed) values of variable Y will not match with the values
estimated by the regression equation, there will be residuals. 
b) Multiple regression: where the
dependent variable is regressed on more
than one independent variable
DEPENDENT AND INDEPENDENT VARIABLES
By linear regression, we mean models with
just one independent and one dependent
variable. The variable whose value is to be
predicted is known as the dependent
variable and the one whose known value is
used for prediction is known as
the independent variable.
CHOICE OF REGRESSION LINE
For example, consider two variables crop yield (Y) and
rainfall (X). Here construction of regression line of Y on X
would make sense and would be able to demonstrate the
dependence of crop yield on rainfall. We would then be able
to estimate crop yield given rainfall.
Careless use of linear regression analysis could mean
construction of regression line of X on Y which would
demonstrate the laughable scenario that rainfall is
dependent on crop yield; this would suggest that if you
grow really big crops you will be guaranteed a heavy
rainfall.
MAKING THE REGRESSION LINE
The way to draw the line is to take three values of x, one on
the left side of the scatter diagram, one in the middle and one
on the right. Use the regression equation to get the values.
Although two points are enough to define the line, three are
better as a check.
y= a + bx
REGRESSION COEFFICIENT
The coefficient of X in the line of regression of Y on X is called the
regression coefficient of Y on X. It represents change in the value
of dependent variable (Y) corresponding to unit change in the
value of independent variable (X).
For instance if the regression coefficient of Y on X is 0.53 units, it
would indicate that Y will increase by 0.53 if X increased by 1
unit. A similar interpretation can be given for the regression
coefficient of X on Y.
Once a line of regression has been constructed, one can check
how good it is (in terms of predictive ability) by examining the
coefficient of determination (R2). R2 always lies between 0 and 1.
Coefficient of determination R2 = 1 – Sse / Syy
Non-linear Regression: In a non-linear
regression equation, the dependent and
independent variable(s) are related through a
non-linear relationship:
Y = a*X1b1*X2b2…*Xnbn
 
Note that by logarithmic transformation, the above
non-linear equation can be written as a linear
equation and the coefficients can be estimated in
the same manner as for linear regression.
Transforming Non Linear Relations
Relationship between some variables may be non-linear but can be transformed to
linear form so that the technique of linear regression can be applied. For example,
consider that two variables X and Y are non-linearly related as follows:
This non-linear relation can be linearized by logarithmic transformation of the equation
Ln Y = Ln α+ βLn X or
A = a + b*B
where A = Ln Y, a = Ln α, b = β, and B = Ln X. Now, one can use the regression
technique to estimate parameters a and b and thereby α and β. In this procedure, two
important points are worth noting.
a) The values of a and b are estimated by minimizing Σ(A - Areg)2 and not by minimizing
Σ(Y - Yreg)2. Here Areg and Yreg are the value of A and Y estimated by the regression
equation.
b) In the log-transformed equation, the error term is additive (A = a + b B+ c) which
means
that it is multiplicative in the original equation
Y = αXβε (11.60)
The errors are related as c = ln ε . Hence, the assumptions in hypothesis testing and
confidence intervals should be valid for c.
MAKING A REGRESSION EQUATION
s
Solution: In regression analysis, it is always helpful to first plot the data
and note the variation sin the dependent and independent variables. Fig.
11.5 gives a plot of the precipitation and runoff data which shows that
there is not much scatter around the line of best fit.
(a)The values of various variables required to calculate a and b are computed in the Table
11.1.
Here, x = 763.67/18 = 42.43, y = 272.75/18= 15.15.
The regression coefficients are:
b = Sxy/Sxx = 321.443/678.979=0.473
and a= y - b x = 15.15-0.473*42.43=-4.933
Hence, the regression equation is: y = -4.933 + 0.473 x.
 
(b) The percent of variation in y that is accounted for by the regression is computed as the
coefficient of determination (r2) multiplied by 100. The value of Sse has been computed in
Table
11.1.
Coefficient of determination R2 = 1 – Sse / Syy = 1 – 163.073/315.251=0.483.
The coefficient of correlation (r) = square root of coefficient of determination= (0.483)1/2 =
0.695.
Thus, nearly 66 percent of variation in y is explained by the regression equation. The remaining
34 percent variation is due to unexplained causes.