00 mi piace00 non mi piace

4 visualizzazioni45 pagineApr 26, 2020

CORRELATION AND RECESSION.pptx

© © All Rights Reserved

PPTX, PDF, TXT o leggi online da Scribd

© All Rights Reserved

4 visualizzazioni

00 mi piace00 non mi piace

CORRELATION AND RECESSION.pptx

© All Rights Reserved

Sei sulla pagina 1di 45

REGRESSION

Correlation and regression (linear)

are the most commonly used techniques

for investigating the relationship

between two quantitative variables.

Many hydrologic variables are related to

each other through cause and effect –

changes in the

values of one or more variables cause

changes in some other variable.

INTRODUCTION

WHAT IS CORRELATION?

The goal of a correlation analysis is to see whether two measurement

variables co vary, and to quantify the strength of the relationship between

the variables. If the change in one variable brings about a change in the

other variable, they are said to be correlated. Correlation is only concerned

with strength of the relationship. No causal effect is implied with correlation

• If one variable drives the other, they may be correlated, as rainfall and

runoff.

• The variables may also be correlated if they share the same cause.

Examples include dependent variables, such as river discharge,

concentration or transport rates of sediment, and concentration or transport

rates of substances that are transported in association with suspended

sediment.

WHAT IS REGRESSION?

Regression expresses the relationship, which was determined in

the correlation analysis, in the form of an equation. In regression

analysis, the problem of interest is the nature of the relationship

itself between the dependent variable(response) and the

(explanatory) independent variable.

Regression analysis is used to detect a relation between the

values of two or more variables, of which at least one is subject

to random variation, and to test whether such a relation, either

assumed or calculated, is statistically significant. It is a tool for

detecting relations between hydrologic parameters in different

places, between the parameters of a hydrologic model, between

hydraulic parameters and soil parameters, between crop growth

and water table depth, and so on.

REGRESSION ANALYSIS IS USED TO:

• Predict the value of a dependent variable based on the value of at

least one independent variable

• Explain the impact of changes in an independent variable on the

dependent variable

• Dependent variable: the variable we wish to predictor explain (i.e.

runoff)

• Independent variable: the variable used to explain the dependent

variable (i.e. rainfall)

• Only one independent variable, X

• Relationship between X and Y is described by a linear function

• Changes in Y are assumed to be caused by changes in X

WHAT IS THE SCATTER DIAGRAM?

A scatter diagram can be used to show the

relationship between two variables. The starting

point is to draw a scatter of points on a graph, with

one variable on the X-axis and the other variable on

the Y-axis, to get a feel of the relationship (if any)

between the variables as suggested by the data.

The closer the points are to a straight line, the

stronger the linear relationship between two

variables. A scatter diagram of the data provides an

initial check of the assumptions for regression.

ASSUMPTIONS

Some underlying assumptions governing the uses

of correlation and regression are as follows. The

observations are assumed to be independent.

For correlation, both variables should be random

variables, but for regression only the dependent

variable Y must be random. In carrying out hypothesis

tests, the response variable should follow normal

distribution and the variability of Y should be the

same for each value of the predictor variable.

THREE MAIN USES OF CORRELATION AND REGRESSION

• One is to test hypotheses about cause-and-effect relationships. In this case, the

experimenter determines the values of the X-variable and sees whether variation

in X causes variation in Y. For example, giving people different amounts of a drug

and measuring their blood pressure.

• The second main use for correlation and regression is to see whether two

variables are associated, without necessarily inferring a cause-and-

effect relationship. In this case, neither variable is determined by the

experimenter; both are naturally variable. If an association is found, the inference

is that variation in X may cause variation in Y, or variation in Y may cause

variation in X, or variation in some other factor may affect both X and Y.

• The third common use of regression (linear) is estimating the value of one

variable corresponding to a particular value of the other variable.

CORRELATION

CORRELATION COEFFICIENT:

A) Pearson Product-Moment Correlation is one of the measures

of correlation which quantifies the strength as well as

direction of such relationship. It is usually denoted by

Greek letter ρ.

CONDITIONS

This coefficient is used if two conditions are satisfied

• the variables are in the interval or ratio scale of measurement

• a linear relationship between them is suspected

POSITIVE AND NEGATIVE CORRELATION

The coefficient (ρ) is computed as the ratio of covariance between the

variables to the product of their standard deviations. This formulation is

advantageous.

ρ (X,Y)=

computed, ρ > 0 will indicate positive relationship, ρ < 0 will indicate

negative relationship while ρ = 0 indicates non existence of any

relationship.

• Second, it ensures (mathematically) that the numerical value of ρ range

from -1.0 to +1.0. This enables us to get an idea of the strength of

relationship - or rather the strength of linear relationship between the

variables. Closer the coefficients are to +1.0 or -1.0, greater is the

strength of the linear relationship.

ρ (X,Y)=

Properties of ρ

• This measure of correlation has interesting properties, some of

which are enunciated below:

• It is independent of the units of measurement. It is in fact unit

free.

• It is symmetric. This means that ρ between X and Y is exactly the

same as ρ between Y and X.

• Pearson's correlation coefficient is independent of change in origin

and scale.

• If the variables are independent of each other, then one would

obtain ρ = 0. However, the converse is not true. In other words ρ

= 0 does not imply that the variables are independent - it only

indicates the non existence of a non-linear relationship.

B) Spearman Rank Correlation Coefficient is a non-parametric

measure of correlation, using ranks to calculate the correlation.

Spearman Rank Correlation Coefficient uses ranks to calculate

correlation. The correlation coefficient is sometimes denoted by r s.

rs = correlation coefficient,

In general,

• rs > 0 implies positive agreement among ranks

• rs < 0 implies negative agreement (or agreement in the reverse

direction)

• rs = 0 implies no agreement

Closer rs is to 1, better is the agreement while rs closer to -1 indicates

strong agreement in the reverse direction.

SIGNIFICANCE OF CORRELATION

sign). The number of degrees of freedom is two less than

the number of points on the graph . If our calculated r value

exceeds the tabulated value at p = 0.05 then the correlation

is significant.

Degrees of Freedom Probability, p

0.05 0.01 0.001

1 0.997 1.000 1.000

PARTIAL CORRELATION ANALYSIS

involves studying the linear relationship between

two variables after excluding the effect of one or

more independent factors. In order to get a correct

picture of the relationship between two variables,

we should first eliminate the influence of other

variables. For example, study of partial correlation

between price and demand would involve studying

the relationship between price and demand

excluding the effect of money supply, exports, etc.

The partial correlation analysis assumes

great significance in cases where the

phenomena under consideration have

multiple factors influencing them, especially

in physical and experimental sciences, where

it is possible to control the variables and

the effect of each variable can be studied

separately.

DISADVANTAGE:

In simple correlation, we measure the

strength of the linear relationship

between two variables, without taking

into consideration the fact that both

these variables may be influenced by a

third variable.

MULTIPLE CORRELATION

Another technique used to overcome the drawbacks of simple

correlation is multiple regression analysis.

Here, we study the effects of all the independent variables

simultaneously on a dependent variable. For example, the

correlation co-efficient between the yield of paddy (X1) and the

other variables, viz. type of seedlings (X2), manure (X3), rainfall

(X4), humidity (X5) is the multiple correlation co-efficient R1.2345 .

This co-efficient takes value between 0 and +1.

The limitations of multiple correlation are similar to those of

partial correlation.

Coaxial graphical correlations of runoff, with rainfall and other

parameters like time of the year, storm duration and antecedent

CORRELATION COEFFICIENT

P – rainfall

R – Run-off

N – Number of observations

REGRESSION

ASSUMPTION OF LINEARITY

Linear regression does not test whether data is

linear. It finds the slope and the intercept

assuming that the relationship between the

independent and dependent variable can be best

explained by a straight line.

One can construct the scatter plot to confirm this

assumption. If the scatter plot reveals non linear

relationship, often a suitable transformation can be

used to attain linearity (e.g logarithmic).

Depending on the number of independent variables, regression

analysis can also be classified as:

a) Simple linear regression: most commonly used model in

hydrology, where the dependent variable is regressed on only one

independent variable

yi = a + bxi + εi i =1, 2, … n

where, yi is the ith value of the dependent or regressed variable, xi is

the ith value of the independent or regressor or predictor variable

and a, b = regression coefficients. The regression line crosses the y-

axis at a point a (the intercept), and has a slope b, and εi is the

random error or residual term for the ith data point. Since the actual

(or observed) values of variable Y will not match with the values

estimated by the regression equation, there will be residuals.

b) Multiple regression: where the

dependent variable is regressed on more

than one independent variable

DEPENDENT AND INDEPENDENT VARIABLES

By linear regression, we mean models with

just one independent and one dependent

variable. The variable whose value is to be

predicted is known as the dependent

variable and the one whose known value is

used for prediction is known as

the independent variable.

CHOICE OF REGRESSION LINE

For example, consider two variables crop yield (Y) and

rainfall (X). Here construction of regression line of Y on X

would make sense and would be able to demonstrate the

dependence of crop yield on rainfall. We would then be able

to estimate crop yield given rainfall.

Careless use of linear regression analysis could mean

construction of regression line of X on Y which would

demonstrate the laughable scenario that rainfall is

dependent on crop yield; this would suggest that if you

grow really big crops you will be guaranteed a heavy

rainfall.

MAKING THE REGRESSION LINE

The way to draw the line is to take three values of x, one on

the left side of the scatter diagram, one in the middle and one

on the right. Use the regression equation to get the values.

Although two points are enough to define the line, three are

better as a check.

y= a + bx

REGRESSION COEFFICIENT

The coefficient of X in the line of regression of Y on X is called the

regression coefficient of Y on X. It represents change in the value

of dependent variable (Y) corresponding to unit change in the

value of independent variable (X).

For instance if the regression coefficient of Y on X is 0.53 units, it

would indicate that Y will increase by 0.53 if X increased by 1

unit. A similar interpretation can be given for the regression

coefficient of X on Y.

Once a line of regression has been constructed, one can check

how good it is (in terms of predictive ability) by examining the

coefficient of determination (R2). R2 always lies between 0 and 1.

Coefficient of determination R2 = 1 – Sse / Syy

Non-linear Regression: In a non-linear

regression equation, the dependent and

independent variable(s) are related through a

non-linear relationship:

Y = a*X1b1*X2b2…*Xnbn

Note that by logarithmic transformation, the above

non-linear equation can be written as a linear

equation and the coefficients can be estimated in

the same manner as for linear regression.

Transforming Non Linear Relations

Relationship between some variables may be non-linear but can be transformed to

linear form so that the technique of linear regression can be applied. For example,

consider that two variables X and Y are non-linearly related as follows:

This non-linear relation can be linearized by logarithmic transformation of the equation

Ln Y = Ln α+ βLn X or

A = a + b*B

where A = Ln Y, a = Ln α, b = β, and B = Ln X. Now, one can use the regression

technique to estimate parameters a and b and thereby α and β. In this procedure, two

important points are worth noting.

a) The values of a and b are estimated by minimizing Σ(A - Areg)2 and not by minimizing

Σ(Y - Yreg)2. Here Areg and Yreg are the value of A and Y estimated by the regression

equation.

b) In the log-transformed equation, the error term is additive (A = a + b B+ c) which

means

that it is multiplicative in the original equation

Y = αXβε (11.60)

The errors are related as c = ln ε . Hence, the assumptions in hypothesis testing and

confidence intervals should be valid for c.

MAKING A REGRESSION EQUATION

s

Solution: In regression analysis, it is always helpful to first plot the data

and note the variation sin the dependent and independent variables. Fig.

11.5 gives a plot of the precipitation and runoff data which shows that

there is not much scatter around the line of best fit.

(a)The values of various variables required to calculate a and b are computed in the Table

11.1.

Here, x = 763.67/18 = 42.43, y = 272.75/18= 15.15.

The regression coefficients are:

b = Sxy/Sxx = 321.443/678.979=0.473

and a= y - b x = 15.15-0.473*42.43=-4.933

Hence, the regression equation is: y = -4.933 + 0.473 x.

(b) The percent of variation in y that is accounted for by the regression is computed as the

coefficient of determination (r2) multiplied by 100. The value of Sse has been computed in

Table

11.1.

Coefficient of determination R2 = 1 – Sse / Syy = 1 – 163.073/315.251=0.483.

The coefficient of correlation (r) = square root of coefficient of determination= (0.483)1/2 =

0.695.

Thus, nearly 66 percent of variation in y is explained by the regression equation. The remaining

34 percent variation is due to unexplained causes.

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.