Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Correlations
Correlation is used to test for a relationship between two numerical variables or two
ranked (ordinal) variables.
Correlation is a bivariate analysis that measures the strengths of association between two
variables. In statistics, the value of the correlation coefficient varies between +1 and -1.
When the value of the correlation coefficient lies around 1, then it is said to be a perfect
degree of association between the two variables. As the correlation coefficient value goes
towards 0, the relationship between the two variables will be weaker. Usually, in statistics,
we measure three types of correlations:
Pearson correlation
Kendall rank correlation
Spearman correlation
Where:
r = Pearson r correlation coefficient
N = number of value in each data
xy = sum of the products of paired scores
x = sum of x scores
y = sum of y scores
x2= sum of squared x scores
y2= sum of squared y scores
For the Pearson r correlation, both variables should be normally distributed. Other
assumptions include linearity and homoscedasticity. Linearity assumes a straight line
relationship between each of the variables in the analysis and homoscedasticity assumes
that data is normally distributed about the regression line.
BPS651
Where:
Nc= number of concordant
Nd= Number of discordant
Concordant: Ordered in the same way
Discordant: Ordered differently
Spearman rank correlation: Spearman rank correlation is a non-parametric test that is
used to measure the degree of association between two variables. Spearman rank
correlation test does not assume any assumptions about the distribution of the data and is
the appropriate correlation analysis when the variables are measured on a scale that is at
least ordinal.
The following formula is used to calculate the Spearman rank correlation:
Where:
P= Spearman rank correlation
di= the difference between the ranks of corresponding values Xi and Yi
n= number of value in each data set
Correlations with R
The cor( ) function to produce correlations. A simplified format is:-
Description
use
Specifies the handling of missing data. Options are all.obs (assumes no missing data, *
missing data will produce an error), complete.obs (listwise deletion), and
pairwise.complete.obs (pairwise deletion)
method
BPS651
Visualizing Correlations
Simple Scatterplot
The basic function is plot(x, y) that is used to plot scatter plot, where x and y are numeric vectors denoting
the (x,y) points to plot.
Example
>plot(wt, mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
BPS651
Scatterplot Matrices
pairs() to create scatterplot matrices
# Basic Scatterplot Matrix
pairs(~mpg+disp+drat+wt,data=mtcars,
main="Simple Scatterplot Matrix")
Exercise 10
Protein intake X and fat intake Y (in gm) for ten old women given as
X
56,47,33,39,42,38,46,47,38,32
56,83,49,52,65,52,56,48,59,70
Calculate correlation Coefficient (Pearson) , draw scatter plot matrix and scatter plot
Exercise 11
Find correlation coefficient (Pearson) between the sales and expenses from the data given below:
Firm:
10
50
50
55
60
65
65
65
60
60
50
11
13
14
16
16
15
15
14
13
13
BPS651
Regression
Simple Linear Regression
A simple linear regression model that describes the relationship between two variables x
and y can be expressed by the following equation. The numbers and are called
parameters, and is the error term.
For example, in the data set faithful, it contains sample data of two random variables
named waiting and eruptions. The waiting variable denotes the waiting time until the next
eruptions, and eruptions denote the duration. Its linear regression model can be expressed
as:
Topic to be studied
Estimated Simple Regression Equation
Coefficient of Determination
Significance Test for Linear Regression
Confidence Interval for Linear Regression
Prediction Interval for Linear Regression
Residual Plot
Standardized Residual
Normal Probability Plot of Residuals
>data()
>faithful
>head(faithful)
Estimated Simple Regression Equation
If we choose the parameters and in the simple linear regression model so as to
minimize the sum of squares of the error term , we will have the so called estimated
simple regression equation. It allows us to compute fitted values of y based on values of x.
Problem: Apply the simple linear regression model for the data set faithful, and estimate
the next eruption duration if the waiting time since the last eruption has been 80 minutes.
Solution : We apply the lm function to a formula that describes the variable eruptions by
the variable waiting, and save the linear regression model in a new variable eruption.lm.
>eruption.lm=lm(eruptions~waiting,data=faithful)
>eruption.lm
Then we extract the parameters of the estimated regression equation with the coefficients
function.
>coeffs=coefficients(eruption.lm)
BPS651
Problem: Find the coefficient of determination for the simple linear regression model of
the data set faithful.
Solution: We apply the lm function to a formula that describes the variable eruptions by
the variable waiting, and save the linear regression model in a new variable eruption.lm.
>eruption.lm=lm(eruptions~waiting,data=faithful)
Then we extract the coefficient of determination from the r.squared attribute of its
summary.
>summary(eruption.lm)$r.squared
[1]0.81146
Answer: The coefficient of determination of the simple linear regression model for the
data set faithful is 0.81146.
Significance Test for Linear Regression
Assume that the error term in the linear regression model is independent of x, and is
normally distributed, with zero mean and constant variance. We can decide whether there
is any significant relationship between x and y by testing the null hypothesis that = 0.
Problem: Decide whether there is a significant relationship between the variables in the
linear regression model of the data set faithful at .05 significance level.
Solution: We apply the lm function to a formula that describes the variable eruptions by
the variable waiting, and save the linear regression model in a new variable eruption.lm.
BPS651
>eruption.lm=lm(eruptions~waiting,data=faithful)
Then we print out the F-statistics of the significance test with the summary function.
>summary(eruption.lm)
Call:
lm(formula=eruptions~waiting,data=faithful)
Residuals:
Min 1Q Median 3Q Max
-1.2992 -0.3769 0.0351 0.3491 1.1933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.87402 0.16014 -11.7 <2e-16 ***
waiting 0.07563 0.00222 34.1 <2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.497 on 270 degrees of freedom
Multiple R-squared: 0.811, Adjusted R-squared: 0.811
F-statistic: 1.16e+03 on 1 and 270 DF, p-value: <2e-16
Answer: As the p-value is much less than 0.05, we reject the null hypothesis that = 0.
Hence there is a significant relationship between the variables in the linear regression
model of the data set faithful.
Confidence Interval for Linear Regression
Assume that the error term in the linear regression model is independent of x, and is
normally distributed, with zero mean and constant variance. For a given value of x, the
interval estimate for the mean of the dependent variable, , is called the confidence
interval.
Problem: In the data set faithful, develop a 95% confidence interval of the mean eruption
duration for the waiting time of 80 minutes.
Solution: We apply the lm function to a formula that describes the variable eruptions by
the variable waiting, and save the linear regression model in a new variable eruption.lm.
>attach(faithful)#attach the data frame
>eruption.lm=lm(eruptions~waiting)
Then we create a new data frame that set the waiting time value.
>newdata=data.frame(waiting=80)
We now apply the predict function and set the predictor variable in the newdata argument.
We also set the interval type as "confidence", and use the default 0.95 confidence level.
>predict(eruption.lm, newdata, interval="confidence")
fit lwr upr
BPS651
Problem: Plot the residual of the simple linear regression model of the data set faithful
against the independent variable waiting.
Solution: We apply the lm function to a formula that describes the variable eruptions by
the variable waiting, and save the linear regression model in a new variable eruption.lm.
Then we compute the residual with the resid function.
>eruption.lm = lm(eruptions ~ waiting, data=faithful)
>eruption.res = resid(eruption.lm)
We now plot the residual against the observed values of the variable waiting.
BPS651
Standardized Residual
The standardized residual is the residual divided by its standard deviation.
Problem: Plot the standardized residual of the simple linear regression model of the data
set faithful against the independent variable waiting.
Solution:We apply the lm function to a formula that describes the variable eruptions by
the variable waiting, and save the linear regression model in a new variable eruption.lm.
Then we compute the standardized residual with the rstandard function.
>eruption.lm = lm(eruptions ~ waiting, data=faithful)
>eruption.stdres = rstandard(eruption.lm)
We now plot the standardized residual against the observed values of the variable waiting.
>plot(faithful$waiting, eruption.stdres,
+ ylab="Standardized Residuals",
+ xlab="Waiting Time",
+ main="Old Faithful Eruptions")
>abline(0, 0)
# the horizon
BPS651
Exercise 12
Geographical area x and area under paddy cultivated y ( in hectares) for 15 villages of a
district are given belowX 103,106,120,120,100,151,160,155,136,178,196,140,160,166,112
Y 041,033,087,078,035,081,090,085,070,100,102,070,082,085,050
Calculate correlation coefficient, Calculate regression equation of y on x, Calculate
Coefficient of determination, display significance test.
Estimate paddy cultivation where geographical area is 136 hectares, develop a 95%
confidence interval & 95% prediction interval of the mean y for 136 hectares.
Draw Residual plot, Standardized residual plot and Normal probability plot of residuals.
Exercise 13
BPS651
12,14,9.5,10.5,8,11.5,10,14,8,9.5,11,12
II
11.5,13.5,12,14,7,14,8,12.5,6.5,10,9,12
Calculate correlation coefficients, Calculate regression equation of y(i.e. II) on x(i.e. I),
Calculate Coefficient of determination, display significance test.
Draw Residual plot, Standardized residual plot and Normal probability plot of residuals.
Exercise 14
Twelve students for the following percentage of makes in Physics & Statistics calculate:Correlation coefficient
Linear regression equation of y on x
X
73,42,88,38,68,75,80,54,64,48,35,37
73,48,86,58,65,60,76,54,50,38,32,30
Estimate Y where X is 80, develop a 95% confidence interval & 95% prediction interval of
the mean Y for 80 hectares.
BPS651
For example, in the built-in data set stackloss from observations of a chemical plant
operation, if we assign stackloss as the dependent variable, and assign Air.Flow (cooling air
flow), Water.Temp (inlet water temperature) and Acid.Conc. (acid concentration) as
independent variables, the multiple linear regression model is:
>data()
>stackloss
Topic to be studied
Problem: Apply the multiple linear regression model for the data set stackloss, and predict
the stack loss if the air flow is 72, water temperature is 20 and acid concentration is 85.
Solution: We apply the lm function to a formula that describes the variable stack.loss by
the variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regression
model in a new variable stackloss.lm.
>stackloss.lm=lm(stack.loss~+Air.Flow+Water.Temp+Acid.Conc.,data=stackloss)
>stackloss.lm
We also wrap the parameters inside a new data frame named newdata.
>newdata= data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
Lastly, we apply the predict function to stackloss.lm and newdata.
BPS651
Problem: Find the coefficient of determination for the multiple linear regression model of
the data set stackloss.
Solution: We apply the lm function to a formula that describes the variable stack.loss by
the variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regression
model in a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)
Then we extract the coefficient of determination from the r.squared attribute of its
summary.
>summary(stackloss.lm)$r.squared
[1] 0.91358
Answer: The coefficient of determination of the multiple linear regression model for the
data set stackloss is 0.91358.
Adjusted Coefficient of Determination
The adjusted coefficient of determination of a multiple linear regression model is defined
in terms of the coefficient of determination as follows, where n is the number of
observations in the data set, and p is the number of independent variables.
Problem: Find the adjusted coefficient of determination for the multiple linear regression
model of the data set stackloss.
Solution: We apply the lm function to a formula that describes the variable stack.loss by
the variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regression
model in a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)
BPS651
Answer:As the p-values of Air.Flow and Water.Temp are less than 0.05, they are both
statistically significant in the multiple linear regression model of stackloss.
Confidence Interval for MLR
BPS651
We now apply the predict function and set the predictor variable in the newdata argument.
We also set the interval type as "confidence", and use the default 0.95 confidence level.
> predict(stackloss.lm, newdata, interval="confidence")
fit lwr upr
1 24.582 20.218 28.945
> detach(stackloss) # clean up
Answer: The 95% confidence interval of the stack loss with the given parameters is
between 20.218 and 28.945.
Prediction Interval for MLR
Assume that the error term in the multiple linear regression (MLR) model is independent
of xk (k = 1, 2, ..., p), and is normally distributed, with zero mean and constant variance. For
a given set of values of xk (k = 1, 2, ..., p), the interval estimate of the dependent variable y is
called the prediction interval.
Problem:In data set stackloss, develop a 95% prediction interval of the stack loss if the air
flow is 72, water temperature is 20 and acid concentration is 85.
Solution: We apply the lm function to a formula that describes the variable stack.loss by
the variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regression
model in a new variable stackloss.lm.
> attach(stackloss) # attach the data frame
> stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.)
Then we wrap the parameters inside a new data frame variable newdata.
> newdata = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
We now apply the predict function and set the predictor variable in the newdata argument.
We also set the interval type as "predict", and use the default 0.95 confidence level.
BPS651
Exercise 15
Following are the data on yield per plant (Y) days to maturity (x1) and length of ear
head(x2) for 12 plants of a crop. To calculate
Multiple Correlation Coefficients
Multiple regression equation of Y on X1 and X2
Y
x1
x2
10
50
1.0
20
50
1.2
30
50
1.5
50
51
2.0
70
52
2.5
90
53
3.0
100
54
3.3
130
55
4.0
140
55
5.0
150
55
6.0
155
56
6.5
160
58
7.5
Exercise 16
Compute the values of correlation coefficients, multiple regression equation Calculate
Multiple Coefficient of determination, Adjusted Coefficient of determination, display
significance test for MLR for the following data.
X1
2931.5
1054.9
4671.7
7090.0
6280.4
2737.6
3176.9
2719.1
8274.9
4571.7
8965.9
6907.3
8852.9
2469.2
9861.1
BPS651
X2
1055
689
1102
1785
993
395
836
663
1370
766
1030
1160
778
822
1186
X3
242
290
132
235
337
211
441
181
233
196
254
219
189
185
144
X4
867.3
313.9
394.5
1209.7
2454.0
1002.0
2352.9
3083.0
1239.5
1053.8
853.5
762.7
1023.0
942.0
443.0
X5
282
884
297
257
281
293
288
270
281
250
283
281
280
281
285
X6
476
172
166
299
532
155
270
229
497
145
164
360
388
350
250
X7
985
111
674
534
735
492
317
440
938
451
360
401
347
340
600
X8
430
413
441
475
434
570
540
344
441
402
391
343
365
360
400
X9
276
276
276
276
276
280
284
281
376
278
277
277
285
292
278
X10
1093
1473
1218
1613
1066
1235
1305
1401
1029
1095
1056
1062
1621
1146
1246
Y
16.5
16.0
13.0
18.0
18.5
18.5
17.5
20.5
18.5
17.5
21.1
19.5
20.0
21.1
13.5
Laboratory -IV
Logistic Regression
Analysis of Variance
BPS651