Sei sulla pagina 1di 44

CHAPTER 2

Describing bivariate data


 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Watch out!
o Log transformation and regression
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

1
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
What is it?
 VALUES OF 2 QUANTITATIVE VARIABLES x AND y, n INDIVIDUALS.
o Measurements (x1,y1), (x2,y2),…, (xn,yn) or: (xi,yi) i =1,2,…,n

 AN EXAMPLE: DATA ON ALCOHOL CONSUMPTION


Alcohol consumption Cirrhosis & alcoholism
Number Country
(liters/person/year) (death rate/100,000)
1 France 24.7 46.1
2 Italy 15.2 23.6
3 Germany 12.3 23.7
: : : :
12 Ireland 5.6 6.4
13 Norway 4.2 4.3
14 Finland 3.9 3.6
15 Israel 3.1 5.4 2
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Notation
 2 QUANTITATIVE VARIABLES x AND y, n INDIVIDUALS
o Some old statistics

x
 xi
, sx 
 ( xi  x ) 2

, y 
 yi
, s 
 ( yi  y ) 2

n 1
y
n n 1 n

o And some new


n
C xy   ( xi  x)( yi  y )   xi yi  n x y
i 1
n n
C xx   ( xi  x)   x  n x C yy   ( yi  y )   y  n y
2 2 2 2 2 2
i i
i 1 i 1

s xy 
C xy

 ( x  x)( y  y )
i i
rxy 
s xy
n 1 n 1 sx s y
3
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Alcohol example (without France)
i Country xi yi xi2 yi2 xi yi
1 Italy 15.2 23.6 231.04 556.96 358.72
2 Germany 12.3 23.7 151.29 561.69 291.51
3 Austria 10.9 7 118.81 49 76.3
: : : : : : :
14 Israel 3.1 5.4 9.61 29.16 16.74
Total 109.5 132.4 1020.3 1863.80 1297.74

n  14 x
 x i

109.5
 7.8214 y
 y i

132.4
 9.4571 s y  6.86
n 14 n 14
n
C xy   xi yi  n x y  1297.74  14  7.8214  9.4571  262.18
i 1
n
C xx   xi2  n x  1020.03  14  7.8214 2  163.58
2

i 1
n
C yy   yi2  n y  1863.8  14  9.45712  611.67
2
4
i 1
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Watch out!
o Log transformation and regression
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

5
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Scatter plot (including France)
 REPRESENT DATA AS A SCATTER OF POINTS

Alcohol consumption Cirrhosis & alcoholism


Country
(liters/person/year) (death rate/100,000)
France 24.7 46.1

50 Italy 15.2 23.6

45 Germany 12.3 23.7

: : :
40
Death rate (per 100 000)

Ireland 5.6 6.4


35
Norway 4.2 4.3
30
Finland 3.9 3.6
25 Israel 3.1 5.4
20

15

10
5

0
0 5 10 15 20 25 30
Consumption (liters per person per year)

6
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Scatter plot (excluding France)
 REPRESENT DATA AS A SCATTER OF POINTS
Alcohol consumption Cirrhosis & alcoholism
Country
(liters/person/year) (death rate/100,000)
Italy 15.2 23.6

Germany 12.3 23.7

: : :
25
Ireland 5.6 6.4

Norway 4.2 4.3


20 Finland 3.9 3.6
Death rate (per 100 000)

Israel 3.1 5.4

15

10

0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
7
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Watch out!
o Log transformation and regression
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

8
STAT6202 Chapter 2 2012/2013
REGRESSION
What is it? It’s everywhere!
 TRYING TO PREDICT (/EXPLAIN) ONE VARIABLE AS A
FUNCTION OF ANOTHER ONE
o Or multiple other ones

 ONE OF THE MOST OFTEN USED STATISTICAL TOOLS

 REGRESSION CAN BE VERY POWERFUL AND COMPLEX

 WE WILL LOOK AT THE MOST BASIC


o Simple linear regression:
 Linear relationship between two variables

9
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Linear relationship
 IN GENERAL: A LINEAR RELATIONSHIP BETWEEN TWO
VARIABLES X AND Y IS
o A straight line, or equivalently
o y = a+bx, where
 a is the intercept (value of y when x = 0)
 b is the slope (amount that y changes when x changes by 1 unit)
y= + x

10
9
8
7
6
5
y

4
3
2
1
0
0 1 2 3 4 5
x 10
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
The challenge
 QUESTION: Given the data, what is the linear relationship, i.e.
o What is the “correct” straight line? Or equivalently,
o What is the “correct” equation: yi = a + b xi

 THE ALCOHOL EXAMPLE (WITHOUT FRANCE)


25

20
Death rate (per 100 000)

15

10

0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)

 ANSWER: Least squares estimation


11
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Watch out!
o Log transformation and regression
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

12
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Least Squares Estimation
 COMPUTING THE STRAIGHT LINE WHICH IS CLOSER TO
THE DATA POINTS THAN ANY OTHER LINE

 MINIMIZING THE SUM OF SQUARED RESIDUALS


n n

 e  ( y  a  bx )
i 1
2
i
i 1
i i
2

 LEAST SQUARES ESTIMATORS ARE


C xy
b Estimated change in mean y for a 1 unit increase in x
C xx
a  y  bx Estimated mean y for x = 0
13
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Residuals
 SCATTERED ABOUT THE LINE: RESIDUALS
o Vertical distance between data points and regression line, or
25
o yi - a - bxi
20
Death rate (per 100 000)

15

10

0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)

 LEAST SQUARES ESTIMATION RESULTS IN


1 n RSS C yy (1  rxy2 )
sres  
n  2 i 1
( y i  a  bxi ) 2

n2

n2 14
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
The alcohol example (without France)
 LEAST SQUARES ESTIMATES ARE
C xy 262.18
b   1.6027
C xx 163.58
a  y  bx  9.4571  1.6027  7.8214  3.0786

 ESTIMATED LINEAR REGRESSION EQUATION


-3.0786 + 1.6027xi

15
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Drawing the regression line
 CALCULATE 2 POINTS USING THE REGRESSION EQUATION
xi  3, yi  3.0785  1.6027  3  1.73
xi  15, yi  3.0785  1.6027  15  21.0

 CONNECT THE TWO POINTS


25

20
Death rate (per 100 000)

15

10

0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)

16
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
The alcohol example (without France)
25

20 -3.0786 + 1.6027xi
Death rate (per 100 000)

15

10

0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)

 INTERPRETATION REGRESSION COEFFICIENTS


o b : mean number of deaths per 100,000 is estimated to increase
by 1.6027 for a 1 liter increase in annual alcohol intake per person
per year
o a: mean number of deaths per 100,000 is estimated to be –3.0786
for a 0 liter annual alcohol intake per person per year. 17
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Prediction
 WHAT IS THE EXPECTED DEATH RATE FOR AN ANNUAL
ALCOHOL INTAKE OF 15 LITERS P.P. PER YEAR?
o Don’t read from graph, use the regression equation
o Answer: 21.0 deaths per 100,000 people
( xi  15, yi  3.0785  1.6027 15  21.0)

 WHAT IS THE STANDARD DEVIATION OF THE DEATH RATE


FOR A GIVEN LEVEL OF ALCOHOL INTAKE?

RSS C yy (1  rxy2 ) 611.67(1  0.832 )


sres     3.98
n2 n2 12

18
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Watch out!
o Log transformation and regression
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

19
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Watch out!
 YOU CAN ALWAYS DRAW A REGRESSION LINE, BUT
o Not always appropriate/good fit
y

x
x
o Relationship does not necessarily hold outside range
 Average death rate for alcohol intake of 1 liter p.p.p.y?

o Regression  causality
 Drawing a line does not prove that one variable is causing a change in another one
 Causality could be the case, but just a line is not proof
 There can be confounding variables
 An example: national ice cream sales and my mood
20
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Watch out!
o Log transformation and regression
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

21
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Why/When?
 When natural comparisons are in terms of ratios.
REMEMBER: LOG APPROXIMATION PROPERTY
o Difference between two numbers, as a fraction of their mean,
approximately equals difference between their natural logs:
105  95
 0.10
100
log e (105)  log e (95)  0.10008
110  90
 0.20
100
log e (110)  log e (90)  0.20067
o This works well for fractional differences up to 0.5

22
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Why/When?
 When natural comparisons are in terms of ratios.

 When data vary over several orders of magniture.

 For transforming a positively skewed distribution to a more symmetric scale.


o An example: the handbook of Biological Statistics (http://udel.edu/~mcdonald/stattransform.html)

Eastern mudminnow (Umbra pygmaea).

23
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Why/When?
 When natural comparisons are in
terms of ratios.

 When data vary over several


orders of magniture.

 For transforming a positively


skewed distribution to a more
symmetric scale.

 For making relationships more


linear:

24
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Pay attention to the following
 PREVIOUSLY WE SAW FOR y = a + bx
o Least square estimates: b  C xy / C xx , a  y  bx
o Interpretations:
 b : mean y is estimated to increase by b for a 1 unit increase in x
 a: mean y is estimated to be a when x is 0

 THE ABOVE NEEDS TO BE ‘TRANSLATED’ FOR TRANSFORMATIONS

 AN EXAMPLE: y = a + bz, where z=log(x)


o Least square estimates: b  C zy / C zz , a  y  bz
o Interpretations:
 b : mean y is estimated to increase by b for a 1 unit increase in z
 a: mean y is estimated to be a when z is 0, i.e.
mean y is estimated to be a when x is 1

25
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
30 An example: BMI and GDP
28

26
mean BMI

24

22

20

18
0 5000 10000 15000 20000 25000 30000 35000 4000
GDP 26
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
An example: BMI and log(GDP)
30

28

26
mean BMI

24

22

20

18
6 7 8 9 10 11
27
STAT6202 Chapter 2 2012/2013
log(GDP)
LOG TRANSFORMS IN REGRESSION
An example: BMI and GDP
 DATA FOR BMI (kg/m2) AND GDP FOR DIFFERENT
COUNTRIES

 PREDICT BMI FOR COUNTRY WITH GDP OF 4000

 APPLYING LOG-TRANSFORMATION MIGHT BE USEFUL

 REGRESSION EQUATION FOR BMI ON LOG(GDP)


BMI = 6.89 + 2 · loge(GDP)

 LOGe(4000)=8.29, SO PREDICTED BMI IS


28
STAT6202 Chapter 2 2012/2013
BMI = 6.89 + 2 · loge(4000) = 23.5
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Watch out!
o Log transformation and regression
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

29
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
A practical approach
1. DRAW A SCATTER PLOT

2. THINK OF POTENTIAL USEFUL TRANSFORMATIONS

3. TRANSFORM VARIABLES

4. SELECT VARIABLES WITH STRONGEST LINEAR RELATIONSHIP BY


1. Comparing scatterplots of original and transformed variables
2. Comparing correlation coefficients of original and transformed variables

5. USE VARIABLES WITH STRONGEST LINEAR RELATIONSHIP

6. CALCULATE REGRESSION EQUATION USING LEAST SQUARES ESTIMATION


In this course we consider the following 4 options:
a. y = a + b•x
b. loge(y) = a + b•x
c. y = a + b•loge(x)
d. loge(y) = a + b•loge(x) )

7. USE REGRESSION EQUATION FOR PREDICTION 30


STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Log transformation and regression
o Watch out!
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

31
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Scatter plot (excluding France)
 REPRESENT DATA AS A SCATTER OF POINTS
Alcohol consumption Cirrhosis & alcoholism
Country
(liters/person/year) (death rate/100,000)
Italy 15.2 23.6

Germany 12.3 23.7

: : :
25
Ireland 5.6 6.4

Norway 4.2 4.3


20 Finland 3.9 3.6
Death rate (per 100 000)

-3.0786+1.62027 xi Israel 3.1 5.4

15

10

0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
32
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
The theory
 CORRELATION COEFFICIENT MEASURES THE STRENGTH OF
A LINEAR RELATIONSHIP BETWEEN TWO VARIABLES
C xy s xy
 FORMULA: rxy  
C xx C yy sx s y

 SOME PROPERTIES
o –1 ≤ rxy ≤ 1
o rxy > 0: y tends to increase as x increases and vice versa
o rxy < 0: y tends to decrease as x increases and vice versa
o rxy =1 or rxy =-1: all points (x1, yi) lie on a straight line
o the further rxy is away from 0, the closer the points are to a
straight line
o |rxy| is not affected by linear transformations

33
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
Illustrations (1)

34
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
Illustrations (2)

35
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
The alcohol example (without France)
 REPRESENT DATA AS A SCATTER OF POINTS
Alcohol consumption Cirrhosis & alcoholism
Country
(liters/person/year) (death rate/100,000)
Italy 15.2 23.6

Germany 12.3 23.7

: : :
25
Ireland 5.6 6.4

Norway 4.2 4.3


20 Finland 3.9 3.6
Death rate (per 100 000)

Israel 3.1 5.4

15

C xy  262.18
10
C xx  163.58
5 C yy  611.67
C xy 262.18
rxy    0.83
0
0 2 4 6 8 10 12 14 16
C xx C yy 163.58  611.67
Consumption (liters per person per year)
36
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
Illustrations (3)

 IN BOTH GRAPHS, THE CORRELATION COEFFICIENT BETWEEN


THE X AND Y POINTS IS rxy = 0.7

 THE PRESENCE OF REMOTE POINTS AND/OR OUTLIERS DOES


MODIFY THE APPEARANCE OF THE GRAPHS
37
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
 NOTATION

 SCATTER PLOT

 SIMPLE LINEAR REGRESSION


o What is it?
o Least squares estimation
o Log transformation and regression
o Watch out!
o A practical approach

 SAMPLE CORRELATION COEFFICIENT

 RANK CORRELATION COEFICIENT

38
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
The theory
 SPEARMAN’S RANK CORRELATION COEFFICIENT, rs, MEASURES THE
STRENGHT OF THE LINEAR RELATIONSHIP BETWEEN ORDERINGS
OF 2 VARIABLES
n
6 d i2
 FORMULA: rS  1  i 1

n(n  1)
2

o Where di is the difference between the rank of xi and the rank of yi

 PROPERTIES (SIMILAR TO rxy) –1 ≤ rs ≤ 1


o rs > 0: y tends to increase as x increases and vice versa
o rs < 0: y tends to decrease as x increases and vice versa
o the further rs is away from 0, the stronger the relationship
o |rs| is not affected by linear transformations 39
STAT6202 Chapter 2 2012/2013
RANKING DATA/OBSERVATIONS
How to go about it?
 RANKING DATA
o Put n observations in ascending order
o Assign each observation its order number from 1 to n
o Rank:
 For unique observations: the order number
 For identical observations: the average order number

 AN EXAMPLE: 20 40 30 20 10 20 30
o Order data: 10 20 20 20 30 30 40
o Assign order number: 1 2 3 4 5 6 7
o Final rank: 1 3 3 3 5.5 5.5 7

2+3+4 5+6
3 2 40
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
Illustrations

41
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
An example (1)
 A HOUSEHOLD INCOME AND EXPENDITURE EXAMPLE
Household
600
Obs Income (£) Expenditure (£)
1 100 50 500
2 100 100

Household expenditure(£)
400
3 200 95
4 300 225
300
5 400 280
6 400 270 200

7 400 340
100
8 500 380
9 500 400 0
0 100 200 300 400 500 600 700
10 500 455 Household incom e (£)
n
11 500 480 6 d i2
12 600 535 rS  1  i 1

n(n  1)
2

42
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
An example (2)
 A HOUSEHOLD SPENDING EXAMPLE
Household Household Household
Obs Income (£) Expenditure (£) Income(£) Rank Expenditure (£) Rank
1 100 50 100 1.5 50 1
2 100 100 100 1.5 95 2
3 200 95 200 3 100 3
4 300 225 300 4 225 4
5 400 280 400 6 270 5
6 400 270 400 6 280 6
7 400 340 400 6 340 7
8 500 380 500 9.5 380 8
9 500 400 500 9.5 400 9
10 500 455 500 9.5 455 10
11 500 480 500 9.5 480 11
12 600 535 600 12 535 12
43
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
An example (3)
 A HOUSEHOLD SPENDING EXAMPLE
Household Rank
Obs Income(£) Expenditure (£) xi yi di di 2
1 100 50 1.5 1 0.5 0.25
n
6 d i2
2 100 100 1.5 3 -1.5 2.25
3 200 95 3 2 1 1
rS  1  i 1
4 300 225 4 4 0 0 n(n  1)2

5 400 280 6 6 0 0 6 10.5


 1
6 400 270 6 5 1 1
12(12 2  1)
7 400 340 6 7 -1 1
 0.9633
8 500 380 9.5 8 1.5 2.25
9 500 400 9.5 9 0.5 0.25
10 500 455 9.5 10 -0.5 0.25
11 500 480 9.5 11 -1.5 2.25
12 600 535 12 12 0 0
Total 10.5
44
STAT6202 Chapter 2 2012/2013

Potrebbero piacerti anche