Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SCATTER PLOT
1
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
What is it?
VALUES OF 2 QUANTITATIVE VARIABLES x AND y, n INDIVIDUALS.
o Measurements (x1,y1), (x2,y2),…, (xn,yn) or: (xi,yi) i =1,2,…,n
x
xi
, sx
( xi x ) 2
, y
yi
, s
( yi y ) 2
n 1
y
n n 1 n
s xy
C xy
( x x)( y y )
i i
rxy
s xy
n 1 n 1 sx s y
3
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Alcohol example (without France)
i Country xi yi xi2 yi2 xi yi
1 Italy 15.2 23.6 231.04 556.96 358.72
2 Germany 12.3 23.7 151.29 561.69 291.51
3 Austria 10.9 7 118.81 49 76.3
: : : : : : :
14 Israel 3.1 5.4 9.61 29.16 16.74
Total 109.5 132.4 1020.3 1863.80 1297.74
n 14 x
x i
109.5
7.8214 y
y i
132.4
9.4571 s y 6.86
n 14 n 14
n
C xy xi yi n x y 1297.74 14 7.8214 9.4571 262.18
i 1
n
C xx xi2 n x 1020.03 14 7.8214 2 163.58
2
i 1
n
C yy yi2 n y 1863.8 14 9.45712 611.67
2
4
i 1
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
NOTATION
SCATTER PLOT
5
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Scatter plot (including France)
REPRESENT DATA AS A SCATTER OF POINTS
: : :
40
Death rate (per 100 000)
15
10
5
0
0 5 10 15 20 25 30
Consumption (liters per person per year)
6
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Scatter plot (excluding France)
REPRESENT DATA AS A SCATTER OF POINTS
Alcohol consumption Cirrhosis & alcoholism
Country
(liters/person/year) (death rate/100,000)
Italy 15.2 23.6
: : :
25
Ireland 5.6 6.4
15
10
0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
7
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
NOTATION
SCATTER PLOT
8
STAT6202 Chapter 2 2012/2013
REGRESSION
What is it? It’s everywhere!
TRYING TO PREDICT (/EXPLAIN) ONE VARIABLE AS A
FUNCTION OF ANOTHER ONE
o Or multiple other ones
9
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Linear relationship
IN GENERAL: A LINEAR RELATIONSHIP BETWEEN TWO
VARIABLES X AND Y IS
o A straight line, or equivalently
o y = a+bx, where
a is the intercept (value of y when x = 0)
b is the slope (amount that y changes when x changes by 1 unit)
y= + x
10
9
8
7
6
5
y
4
3
2
1
0
0 1 2 3 4 5
x 10
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
The challenge
QUESTION: Given the data, what is the linear relationship, i.e.
o What is the “correct” straight line? Or equivalently,
o What is the “correct” equation: yi = a + b xi
20
Death rate (per 100 000)
15
10
0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
SCATTER PLOT
12
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Least Squares Estimation
COMPUTING THE STRAIGHT LINE WHICH IS CLOSER TO
THE DATA POINTS THAN ANY OTHER LINE
e ( y a bx )
i 1
2
i
i 1
i i
2
15
10
0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
15
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Drawing the regression line
CALCULATE 2 POINTS USING THE REGRESSION EQUATION
xi 3, yi 3.0785 1.6027 3 1.73
xi 15, yi 3.0785 1.6027 15 21.0
20
Death rate (per 100 000)
15
10
0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
16
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
The alcohol example (without France)
25
20 -3.0786 + 1.6027xi
Death rate (per 100 000)
15
10
0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
18
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
NOTATION
SCATTER PLOT
19
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
Watch out!
YOU CAN ALWAYS DRAW A REGRESSION LINE, BUT
o Not always appropriate/good fit
y
x
x
o Relationship does not necessarily hold outside range
Average death rate for alcohol intake of 1 liter p.p.p.y?
o Regression causality
Drawing a line does not prove that one variable is causing a change in another one
Causality could be the case, but just a line is not proof
There can be confounding variables
An example: national ice cream sales and my mood
20
STAT6202 Chapter 2 2012/2013
CHAPTER 2
Describing bivariate data
NOTATION
SCATTER PLOT
21
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Why/When?
When natural comparisons are in terms of ratios.
REMEMBER: LOG APPROXIMATION PROPERTY
o Difference between two numbers, as a fraction of their mean,
approximately equals difference between their natural logs:
105 95
0.10
100
log e (105) log e (95) 0.10008
110 90
0.20
100
log e (110) log e (90) 0.20067
o This works well for fractional differences up to 0.5
22
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Why/When?
When natural comparisons are in terms of ratios.
23
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Why/When?
When natural comparisons are in
terms of ratios.
24
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
Pay attention to the following
PREVIOUSLY WE SAW FOR y = a + bx
o Least square estimates: b C xy / C xx , a y bx
o Interpretations:
b : mean y is estimated to increase by b for a 1 unit increase in x
a: mean y is estimated to be a when x is 0
25
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
30 An example: BMI and GDP
28
26
mean BMI
24
22
20
18
0 5000 10000 15000 20000 25000 30000 35000 4000
GDP 26
STAT6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION
An example: BMI and log(GDP)
30
28
26
mean BMI
24
22
20
18
6 7 8 9 10 11
27
STAT6202 Chapter 2 2012/2013
log(GDP)
LOG TRANSFORMS IN REGRESSION
An example: BMI and GDP
DATA FOR BMI (kg/m2) AND GDP FOR DIFFERENT
COUNTRIES
SCATTER PLOT
29
STAT6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION
A practical approach
1. DRAW A SCATTER PLOT
3. TRANSFORM VARIABLES
SCATTER PLOT
31
STAT6202 Chapter 2 2012/2013
BIVARIATE DATA
Scatter plot (excluding France)
REPRESENT DATA AS A SCATTER OF POINTS
Alcohol consumption Cirrhosis & alcoholism
Country
(liters/person/year) (death rate/100,000)
Italy 15.2 23.6
: : :
25
Ireland 5.6 6.4
15
10
0
0 2 4 6 8 10 12 14 16
Consumption (liters per person per year)
32
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
The theory
CORRELATION COEFFICIENT MEASURES THE STRENGTH OF
A LINEAR RELATIONSHIP BETWEEN TWO VARIABLES
C xy s xy
FORMULA: rxy
C xx C yy sx s y
SOME PROPERTIES
o –1 ≤ rxy ≤ 1
o rxy > 0: y tends to increase as x increases and vice versa
o rxy < 0: y tends to decrease as x increases and vice versa
o rxy =1 or rxy =-1: all points (x1, yi) lie on a straight line
o the further rxy is away from 0, the closer the points are to a
straight line
o |rxy| is not affected by linear transformations
33
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
Illustrations (1)
34
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
Illustrations (2)
35
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
The alcohol example (without France)
REPRESENT DATA AS A SCATTER OF POINTS
Alcohol consumption Cirrhosis & alcoholism
Country
(liters/person/year) (death rate/100,000)
Italy 15.2 23.6
: : :
25
Ireland 5.6 6.4
15
C xy 262.18
10
C xx 163.58
5 C yy 611.67
C xy 262.18
rxy 0.83
0
0 2 4 6 8 10 12 14 16
C xx C yy 163.58 611.67
Consumption (liters per person per year)
36
STAT6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT
Illustrations (3)
SCATTER PLOT
38
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
The theory
SPEARMAN’S RANK CORRELATION COEFFICIENT, rs, MEASURES THE
STRENGHT OF THE LINEAR RELATIONSHIP BETWEEN ORDERINGS
OF 2 VARIABLES
n
6 d i2
FORMULA: rS 1 i 1
n(n 1)
2
AN EXAMPLE: 20 40 30 20 10 20 30
o Order data: 10 20 20 20 30 30 40
o Assign order number: 1 2 3 4 5 6 7
o Final rank: 1 3 3 3 5.5 5.5 7
2+3+4 5+6
3 2 40
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
Illustrations
41
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
An example (1)
A HOUSEHOLD INCOME AND EXPENDITURE EXAMPLE
Household
600
Obs Income (£) Expenditure (£)
1 100 50 500
2 100 100
Household expenditure(£)
400
3 200 95
4 300 225
300
5 400 280
6 400 270 200
7 400 340
100
8 500 380
9 500 400 0
0 100 200 300 400 500 600 700
10 500 455 Household incom e (£)
n
11 500 480 6 d i2
12 600 535 rS 1 i 1
n(n 1)
2
42
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
An example (2)
A HOUSEHOLD SPENDING EXAMPLE
Household Household Household
Obs Income (£) Expenditure (£) Income(£) Rank Expenditure (£) Rank
1 100 50 100 1.5 50 1
2 100 100 100 1.5 95 2
3 200 95 200 3 100 3
4 300 225 300 4 225 4
5 400 280 400 6 270 5
6 400 270 400 6 280 6
7 400 340 400 6 340 7
8 500 380 500 9.5 380 8
9 500 400 500 9.5 400 9
10 500 455 500 9.5 455 10
11 500 480 500 9.5 480 11
12 600 535 600 12 535 12
43
STAT6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT
An example (3)
A HOUSEHOLD SPENDING EXAMPLE
Household Rank
Obs Income(£) Expenditure (£) xi yi di di 2
1 100 50 1.5 1 0.5 0.25
n
6 d i2
2 100 100 1.5 3 -1.5 2.25
3 200 95 3 2 1 1
rS 1 i 1
4 300 225 4 4 0 0 n(n 1)2