Sei sulla pagina 1di 34

-f

Student's Copy

H2 Mathematics JC 2
-:_

ConnrurroN CoErrIcrENT

AI\D LrxnEn RnGREssroN

Include:

r
.
.
t

concepts of scatter diagram, correlation coefficient and linear regression

calculation and interpretation of the product moment correlation coefficient and of the
equation ofthe least squares regression line
interpolation and exfrapolation
use of a square, reciprocal or logarithmic fransformation to achieve linearity

Exclude:
r derivation of formulae

In this unit, students will:


understand that bivariate data consists of the values of two variables ( independent
and dependent variables ) obtained from the same sample, expressed as ordered pairs;
use a graphic calculator to plot the scatter diagram for a set ofbivariate data to
determine ifthere is a linear relationship between the two variables;
understand that the correlation coefficient is a measure ofthe fit of a scatter diagram
to a linear model;
calculate the product moment correlation coefficient for a set of bivariate data using a
graphic calculator, and relate the value (in particular, values close to -1, 0 and -1) to
the appearance of the scatter diagram; I Note: Zero ennelation does not necessarily
imply'no relationship', but rather 'no linear relatiorship'.]
understand thit a high correlation between two variables does not necessarily
imply one directly causes the other;
understand the concepts of linear regression and 'least squares'with reference to the
scatter diagram;
use a graphic calculator to find the equation ofthe least squares regression line, and
interpret its slope and intercept; I Note: A different line ofregression will be obtained
ifwe interchange the independent and dependent variables.]
understand the concepts of extrapolation and interpolation of dat4 and use the
appropriate regression line to make prediction or estimate a value in practical
situations;
use an appropriate transformation to linearise a set of bivariate data to fit the
regression model.

(a)
(b)
(c)

(d)
(e)
(D
G)

hypothesis tests

(h)
(r)

H2 Mathematics JC 2

S I

al
Student's Copy

4,4

lntroduction

Examples of data with two variables include


. displacement of an object and its velocity,
. speed and time of a falling object,
. annual income level and education level of individuals,
r students' examination scores in Chemistry and Physics.
Data with two variables are known as bivariate data and they are usually expressed as

orderedpairs (x,y).

Bivariate Data

Studies ofbivariate data first began in 1860s when Sir Francis


Galton investigated the degree ofresemblance between children
and their parents. In a study carried out by Galton and his
student, Karl Pearson, they measured the heights of 1078 fathers
and the heights of their sons. In order to investigate the data, the
first use of correlation and regression sfudies of data emerged.

W
.'1 .:i,:

.iil

Suppose we wish to examine the relationship between the


Sir Francis Galton
midyear scores and final examination scores of students. We may
(r822-l9l l)
want to find a model that can be used to predict the final
examination score for a sfudent having a known midyear examination score"

5 3 Scatter Diagram (or Scatter Plot)


i

A scatter diagram is

obtained when each

of the observed bivariate data (x,,f i),

i =1,2,...,n is plotted on the Cartesipn plane.

(x,,!,)

on the scatter diagram represents a single data point. From the scatter diagranq
we can judge visually if there is any relationship between x and y.
Each

An example of a data set and its scatter diagram is shown below.


Student

Midyear score
Final examination score

In the previous chapters, we have been studying data with one variable. In this chapter, we
shall investigate data with two variables and their relationship. For example, if weian find a
relationship between the midyear exarnination scores and year-end examination scores of
students, we will be able to use the information to help us make statistical inferences about
the two examination scores.

S 2

40

50

55

60

65

80

50

53

58

60

70

88

a.
\l,,

H2 Mathematics JC 2

Student's Copy

Final Examination Score


100

90
80
70
60
50

40
30

30

40

50

60

70

80

90

Midyear Score

Example

The tenrperature, Z, in degree Celsius ('C) of the tyre of a car is measured when the car
travels at different speed, v (kmtr.I). Eight sets of data are obtained. Sketch the scatter
for the data.
70
80
90
v
60
20
30
40
50
T
66
91
86
98
45
104
52
64

Solution

Using TI84+

Create two new lists, L1 and L2 using the v and


values respectively.
Press lSIITl, select l:Edit, press lEffitE-Fl , Key
into Lr and Lz the values for y and T
respectively.

H2 Mathematics JC 2

Student's Copy

To plot the data, press l2friil [srnr plor] for Stat


Plot.
Press [ENTE-R-I to select Plot l.
Press IENTEH-I to highlight on.
Under Tlpe, choose first icon for Scatter plot.
To seleot Lr for Xist and Lz for Ylist, press
lzno-l[flfor Lr and @iltr for L2,
For 'Mark', select the desired style to represent
the data points.

PIoIT

PI+t3

ff
HPE :E 14 Jh
{IF l.,r
lis t: Lr
liE t:Lz
!E +.
rIH..

eFk

To view the scatter diagran\ press fhTilil, select


9 :ZoomStat, press fETmHl.
To read the coordinates of each point, press

ImdEI.

Using CASIO GC:

t.

2.

@ffiErArsl

Go to
Create two new lists, List I and List 2 using the given y and
Tvalues respectively.
To plot the data, press [!]for Graph.
Press @ for [SET] (settings).
Choose settings as shown on the right.
Choose Lr for X list (independent) and Lz for Y list
(dependent)"
For Mark Type, select the style to represent the data points.
foflowed Uv
To view the plot, pr"rr
To read the coordinates of each point, press

3.
4.
5.

6.
7.
8.

9.

ffi

(TRACE).

M.

lsUlffl[!

sug
E

1
E

ffiilmDlE lm@r-F:
:

!1

Frequencv
l'1ark Tvpe

:q

ffiffiEffi

l.
l.!

lo"

l{=au

Listl
List 2

T={5

H2 Mathematics JC 2

-.4

Analysis of Scatter Diagrams

Scatter diagrams can reveal general patterns and relationships between two variables. We can
comment on the

l. Direction

ctnbe positively related or negatively


The two variables x and
related.
In general, if y increases as .r increases, then x and y are positively
related.
are negatively
In general, if y deoreases as increases, then x and

related.

Negative relationship

2.

Form

The form of a scatter diagram refers to the shape of the distribution of


data points. The relationship may be linear or curvilinear and perhaps
there is no clear relationship.

Quadratic relationship

3.

Strength

No clear relationship

The strength of a scatter diagram describes how tightly clustered the


points are to the underlying form.
If the data points are tightly clustered to a sftaight line, there is a
strong linear relation between x and y .
If the data points are loosely clustered to a straight line, it may indicate
a weak linear relation between x andy.

Strorg linear relationship

Weak linear relationship

H2 Mathematics JC 2

Student's

Copy

' ,

Example 2

Comment on the relationship between the two variables based on the scaffer diagrams below.
100
80

100
80

60
40
20
0

60

40
20
0

tI

30+
2s)
2o) .o o .
15 I
,lj .o .. . ..
o,I
0ro2030|

II

'm+I

rrl .i'

'..a

|.o,

_t**--_

Note

Scatterdiagramscana1sogiveusvisualevidenceof-orsuspiciousobservations.
These data points are points in a sample that lie outside the overall pattern of a distribution.

H2 Mathematics JC 2

Student's Copy

5 Correlation
Interpretation of the relationship between the variables of sample data solely based on the
scatter diagram is subjective.

It can even

be deceiving when different scales for the u(es are

used. The scatter diagrams below are plotted using the same set of data but on different scales

for they axis.


40
30
20

12
8

l0
0
-10
-20

4
0

Since the scale of a scatter diagram can be manipulated, it may be more helpful to use a

numerical approach to measure the strength of a relationship between two variables. The

product moment correlation coefficient, r, is used to measure the strength of


relationship. The formula for r with

a linear

n datapoints, is as follows:

*_Z,ZY

I(,-rX v-Y)

Z*'- ryJ[t,'-tr!)
Using S* = I(,
s,,

(I,')'

=I 0-v)'=Zy'-(Z:)'

,s,z =

wehave

-i)' =,1r'-

!1x- x )(y -fl =|ry -Z-ZJ,

r=L.
Js-s,,

Example 3
In a physical education class, the number of push-ups (-r) and sit-ups (y) done by a sample

of

ten randornly chosen students were recorded and summarised as shown


Student

x
v

I
27
30

22
26

25

Z*y =14257, Zr'

15

--13717,

4
35
42

Zr'

30
38

6
52

40

=15298,

10

35
32

55
54

40
50

4A

Z* =zst,\t=380

Find the product moment correlation coefficient ofthe sample.

43

H2 Mathematics JC 2

Student's Copy

Solution

Since the actual data is giverl we can use the GC to calculate the product moment correlation
coefficient.

Using TI-84+:

To turn on Stat Diagnostics for TI 84+:


Press lili0DEl and scroll down

@
@
@

to STAT

tEfrCH

I
llnrd

clRssrc

HITfl

DIAGNOSTICS, select On, and press [EffiER-].

OUTU FIIEIIftT {iEftPH:

Subsequently, Stat Diagnostics will be switched


on by default and we will not need to activate it
manually.

5TftT DIfi6NE5TI(5:

Store data under Lr.(x) and Lz (y). (Recall


Example l)
Press ISIA'il >CALC, select 4:LinReg(ax+b) and
press [ENER-|.
Key in Lr nnder Xlist ,then Lz under Ylist and
press Calculate.

[IITffiEffi
Hl ist-: Lr
Vl iEt,: Le
FreqLiEt :
Store RegEH:
Ealculat e
mrffi

From the results displayed, r = 0.839.

g=a*hx

a= I 4. 98822556

h=. 6578855 1 75
F E =. 7846588499
r=. 839439 I ?E I

H2 Mathematics JC 2

Student's Copy

Using CASIO GC:


Store data under List 1 (x) and List 2 (y).

1.

SUB

IEI
l5t
'EI

tq

TE

lE
qE

-21

fGltlffi'|ffilEin--Ef fT-

2.

eress @ for [CALC] and


for X (linear).

ffifor

[REG] and

selea@

=14.

F =8.8
re=8. ?

l,l$e=S1.

I(o-F[

Note: rf the independent variabte is not in List 1 or

the

iUELEhiA-h*:'-

dependentvariableisnotinList2,select[SETI,andchangeffi
lists in 2Var Xlist (independent) and 2Var Ylist (dependent). ZUer Fneq : I

m
We will be looking at the significance of the other values that appear in the ssreen shot in the
next section.

q
l.
2.

Note
For any sample data, -1

(r(

1.

The sign ofthe correlation coefficient indicates the direction of linear correlation.
When

r)

O,the correlation betweenr andy ir

ffi.

When r ( 0, the correlationbetween x ardy ir ffi.

3.

The magnitude of r indicates the strength of the linear correlation.


Generally, the bigger the value of lrl, the stronger the linear relationship.

When r = *1, we have perfect positive linear correlation. All the points lie on a straight
line with positive gradient.

When r

-1, we

have perfect negative linear correlation.

All the points lie on a straight

line with negative gradient.

When r

:0

, there is no

linear correlation between x

no relationship between the two variables.

and

y.

It does not mean there is

H2 Mathematics JC 2

Student's Copy

4. A scatter diagram together with the product moment


used to determine
sets

if there is a linear relationship.

Appendk A for an example of data

Correlation does not imply causation.

6.

In general,
0.8 < lrl <
0.5 <

.lrl

indicates strong linear correlation between the two variables.

lrl< 0.8 indicates moderately strong linear correlation between the two variables.
< 0.5 indicates weak linear correlation between the two variables.

The above classification should only be treated as a guide.

7.

The measure

has no units.

It

of the

is

variables. (See Appendix B)

Example 4

The temperature, Z, in degree Celsius ( "C ) of the tyre of a car is measured when the car
travels at different speed, v (kmh.t). Eight sets of data are obtained.
v

(r)
(ir)

20
45

40
64

30
52

=2r
5

Solution CII-84+)
(i) Using TI84+, r =0.975

50
66

60

9l

70
86

80
98

90
104

Find the product moment correlation coefficient between v and' T.


Find the corresponding product moment correlation coefficient between v and F
where .F is the temperature of the tyre measured in degree Fahrenheit ( "F ). (Use
the formul a F

O
@

with the same product moment correlation coefficient but different scatter diagrams.

5.

See

correlation coefficient should be

*nl

Store data under Lr.(x) andLz[).


Press Fmn >CALC, select 8:LinReg(a+bx) and
press lNiEFl.
Key in Ll.under Xlist , then Lz under Ylist and press
Calculate.

t0

H2 Mathematics JC 2

Student's Copy

From the results displayed, r x 0.975

95888944 I I
F=. 97499282 I 5
t^ z =.

(ii) Using TI84+, r:0"975

Key into the header L: the formula for F, which


is 9l5Lz + 32, and press fffiR-l. Note that to enter
L2, wa press

Scroll up to the header Lr and highlight by


pressing [ENTEEI.

Press

lS-r-TATi

[z-i'o-'l@.

If,ilffi

>CALC, select 8:LinReg(a+bx) and

,J=e+bH

press [ffiTEHl.
@

E=81.84285714
h=1.572887145

Key in Lr under Xlist , then Lt under Ylist and


press Calculate.

=. 95888944 I I
F=. 97499?B? I 5
F

From the results displayed, "r x 0.975

!
Solution (CASIO GC):
(1) Using CASIO GC, r =0.975.

1.

EI

1UI

EI

IEI

1l
Store data under List 1 (x) and List 2 (y)"

EE

EEI

sE
I

EII

liffilEu-ffiEE

2.

Rress @|

for [CALC] and @for [REG] and setect

for X (linear).

E'

ffi

1l

H2 Mathematics JC 2

(ii)
L.
2.

Student's Copy

Using CASIO GC, r =0.975.


Scroll up to the header L3 "
Key in the formulafor 7. Since List 2 stores the data for

and Z'

=27 *3z,key in List 3 = f Ur,

55

2+32.

I,

(press

II

ilIE
3l
III

EE

5El

EII

qE
EE

(9+5)List

te5.E

t{1.E
EEI I5B.E

ffiEfor[List])
Notice that the value of r calculated in (i) and (ii) are the same.
SUB
I
E

IO

{5

IE

5E
Eq

{[

lel.E
I

ttI.

|Eh4-E'IEaTIF

IFETIT-

IfoFF

Note
Notice that the value of r calculated in (i) and (ii) are the same. This illustrates an important
property of r- it is independent of the scale of measurement for temperature.

S 6 linear Regression
In the last section, we used both the scatter diagram and the product moment correlation
coefficient between the two variables to indicate whether it is meaningful to model the
observed data with a straight line. If it appears that the data fits into a linear model, we then
attempt to find an equation to represent the relationship by linear regression.
In Example 4,the speed, v, is controlled and the temperature, Z, is measured based on v.
Thus, v is known as the independent variable while 7, whose value depends on y is called
the dependent variable.

In Example 3, we were investigating the number of sit ups and push ups a student can do. In
this case, there is no clear dependency between the two variables.

t2

H2 Mathematics JC 2

5 7

Student's Copy

Method of Least Squares

Consider the data given in the scatter diagram below. We randomly draw a line to fit the data
first. The line drawn below may not be the line that best represents the data, the line of best
fit. There are sweral ways to find a line ofbest fit and we ture a method most commonly used
for finding zuch a line called the least squares method. The line obtained by this method is
called the least squares regression line.

To understand this method, we consider any line drawn to fit the data, for example

! = a+bx

We consider .r as the independent variable and y as the dependent variable. The circled
points in the diagram correspond to observed data points. If we were to use the line given
above to model the data, we would predict a differenty value for the corresponding r-value.
The difference is the error, e, which is known as the residual and is calculated as
e =observed y value - the predicted y value
Each pair of observations (x,,y,) produces a residual, e, for i =1,2,...,n .

>,4
i=l

is known as the sum of the squares of the residuals and we use

ef to denote

Yo2
L"i'
i=l

The least squares regression line

ofy

on

is the line that produces the smallest

Z"?

, where

is the independent variable and .y is the dependent variable.

l3

H2 Mathematics JC 2

Student's Copy

S 8 Equation of the Least Squares Regression line of y on x


Consider a line ofequation,
and b that minimize

Zr?

! = a*bx. Given a set of data, we want to find the values of a

=f
=

It

(observedy value-predictedy value)2

I(x

can be proven that in order to

L @-t)(v-v)
:;
x)'
Llx

---Fi;

-@+bx,))'

minimis"

(Appendix.B),

a=V -b7

(u-ru

Consider the general equation of a line

y = (y

Zn?

------ (l)

! = a+bx arrd substituting (1) into the equation,

-bi)+bx

* Y-T=b(x-7\
Therefore, the equation of the regression line

y-y-b(x-i),

where

is also known as the regression coefficient

b-

is

(x-i)(y -V)

ofy on x.

Note

1.

Itv-D"-,)=2ry

2.

l{*-t)'=Z*'

3.

of y on x

Another formula for

ry

U.
n

tn-Ed/
? " n ,_
Yr'
Ltn

-(I')'

t4

H2 Mathematics JC 2

Student's Copy

S 9

Equation of the Least Squares Regression Line ol x on y


Let the equation of the regression line ofr ony be x = c * dy where c and d are to be found
so that the sum of the squares of the residuals in thex-direction is a minimum.
We consider

as the independent

variable and

as the dependent variable.

We can show that the equation of the regression line of

q
l.

2.

x on y

is

-I - d(y -V) , where d - Z0-D@--)


Zt, - y)'

Note

Zl--yl'=Zy'-U.
n
= -" Z*Zv
n
Another formula for d =L*'=

3.

v u, -(zr)'
L-n
The equation of the regression line of x on y cannot be found by making x
subject in the equation of the regression line of y on x .

the

l5

;
H2 Mathematics JC 2

10

Student's

Copy

Use of Regression Lines for Estimation

Consider the data (x,y\. Given a value of one of the variables, regression lines can be used to
predict or estimate the value of the other. The choice of the regression line used depends on
the context of the situation:

(a) If there is a clear indication that x is the independent variable we will always use the
regression ti"" orffi
to do estimation.
O)

For cases where there is no clear independent variable, if we want to estimate

for a

givenvalueofx,weusetheregressionlh"'-o.ffiIfwewanttoestimatexfora
given value of

q
l.
Z.
4.

,use the regression tine of

ffi.

Note
When we dq

rslimatiaqlitlitrlh

siven range of values of the data, it is known as

jA sgllnaqon o-utside the given

range of values of the data, it is known as


Values obtained by extrapolation may not be reliable since
tfe rbhtionship between the two variables may not follow the same linear model
outside the range of the data.
Estimates using regression lines are more reliable if both the following conditions are
met:

-\[hgrrlyj

(a) The value of

of the data is close to +1, and the scatter diagram also suggest

that there is a strong linear correlation.

(b)

The estimatidn is done within the given range of values of data.

t6

'

'

H2 Mathematics JC 2

Student's Copy

Example 5
In a physical education class, the number ofpush-ups (x ) and sit-ups (y ) done by a sample
of ten randomly chosen students were recorded in the table below.
Student

x
v

27
30

(r)

2
22
26

15

35

30

6
52

25

42

38

40

Find the equation ofthe regression line of

t0

35
32

55

40
50

40
43

54

on x .

(il)

Interpret the slope and intercept in the context of the question.

(iii)

Predict the number of sit-ups a student can do when he can do 50 push-ups. Give a
reason if the predicted value is reliable.

(iv)

Give a reason whether it is reliable to use the equation in (i) to predict the number
sit-ups when 60 push-ups are done.

of

Solution

L
I

H2 Mathematics JC 2

Student's

Copy

Example 6
An electrical fire was switched on in a cold room and the tunperature ofthe room was noted
at 5-minute interval.
Time, x (in minutes) from
switching on fire
Temperature,

(a)

(inoC)

l0

t5

20

25

30

35

40

0.4

1.5

3.4

5.5

7.7

9.7

tt.7

13.5

15.4

Find the equation of the regression line


correlation coefficient between x and
the relationship between x and y.

y.

of y on x

and the proauct moment


Comment on what its value implies about

(b)

Explain why the regression line of y oL x rather than the regression line of .r on y
should be used to predict the time that has passed after switching on the fire if the
ternperature is 93C.

(c)

Predict the temperature of the room when the fire is switched on for 30 and 60
minutes. Comment on the reliability of your arxiwers.

(d)

Starting with the equation of the regression line of y on x, deduce the equation
the regression line of
y ot t where y is the temperature in oC and r is time in hours,
z on x where z is the ternperature in Kelvin (K) and x is time in minutes.
(A temperature in "C is converted to Kby addng273)

(a)
(b)

(iii)

Comment on the values

of

r obtained in (r) and (ii)?

of

'i

H2 Mathematics JC 2

S tl

Student's Copy

Properties of Regression Lines

In general, the regression line

of y

on

i.e.,

ofxony i.e.,x-c+dy.

! = a+bx is different from the regression line

These are some observations about the lines and

(r)

Both the regression line of


through the

(i,

If r > 0, both regression coefficients


If r < 0, both regression coemcients

on

x= ctdy

! = a*bx

as

r:

well as the regression line of

and
D and
D

x on y

passes

d are positive.
d

we negative.

= a+bx

=c

td!

(iii)

12

(iv)

The larger the yalue

=bd.

of lrl, i.e., almost I , the "closer" the regression line of y on x


is to the regression line of x on y .
lf r = *1, the regression lines of y on x and x on y are identical.
lf r = 0, the regression line of y on x and the regression line of x on y are a
pair ofhorizontal and vrtical lines.

l9

:
H2 Mathematics JC 2

Student's

Copy

S t2 Transformations
Not all relatio4ships betwegn x and y Ne linear.'If the relationship between x and y is not
linear, we can sometimes ube a suitable transformation to linearise the relationship. Here are
some examples:

Relationship

Transformation

Linear Relationship

!=axb

Take natural logarithm


(or take logarithm of another base)

h.y = lna+blnx
i.e., lny and lnx have

lnY =lna+bx

Take natural logarithm


(or take logarithm of another base)

= aeb'

y=Jtb

a linear

relationship.

i.e., lny and x have a linear


relationship.
Y2

Square both sides

=ax+b

i.e.,

y'

and x have a linear

relationship.

L=
'

ax+b

Take reciprocal

ax+b

i.".,

1 and x have a linear


v

relationship.

Example 7

A school bookshop sells a popular guidebook. The


successive years (Yr I - 5) me given.

Year(x)
Sales (y )

(t)
(ii)
(iii)

I
1000

2
3000

sales of the guidebook in each

of five

7000

14000

21000

Draw a scatter plot of the above data and find the product moment correlation
coefficient between r and y. Comment on the suitability of the use of a linear
model, y = @c+bfor the sales of the guidebook.
By calculating the product moment correlation coefficient between lny and lnx,
comment on the zuifability of the use of the model ! = axb as compared to the linear
model in (i)
Find the least squares regression line of ln y on ln x . Hence estimate the values of a
and b.

20

''

H2 Mathematics JC 2

Student's Copy

Solution
(1)

2s000
20000
15000
10000
5000

Using GC, the product moment correlation coefficient,

r between x and y is

LI

ffi
Key in x andy data into Lr and L2 respectively.
Scroll up to the header Lr and highlight it by
pressing EmrR-].

,E

1
a
3

1000
3000
7000

'l5

E100{l

1t(l00

L3(lI=
LE

Key in the transformation, in this case

L, = ln(L, )

*d

Press

to generate the transformed data.

[ENTER'I

tu = ln(L, )

1000
3000
7000

(l

21000

1.5t9r

1t000

.65315
1.098E
1.3E83

Find the equation of the regression line using L3


and Le.

IJ=E+bH

E=6.815631664
b=l.919318251
Fz=.9932916127
r.=. 996648 1 62 1

(iii)

Note

2t

H2 Mathematics JC 2

Student's

Copy

From part (i) of the above example, we have seen that the value of r may be very close to 1
but it does not necessarily imply tlnt alinear model = a+bx is the most suitable for the
data. It is always important to draw a scatter diagram to decide which model is more suitable.

Summary

scatter diagram is obtained when each


i =1,2,...,n is plotted on the Cartesian plane.
The product moment correlation coefficient is

I(,-;X

of the observed

bivariate data

(x,,!),

v-v)

I(,-,)'I,0-il'

zr-tr!l[r,,qir]

In order to determine if there is a linear relationship, a scatter diagrarn, together with the
product moment correlation coefficient should be used.
I
t

-1Sr<l

When r ) 0, there is positive linear correlation between x and y .


When r 10 , there is negative linear correlation between x and y .
When r = 0 , there is no linear correlation between x and y .
r is independent of the scale of measurement of x and y .

The least squares regression line of

Equation of the regression line

y on x is the Line that produces

of y on x

y-V-b(x-t)',
r

Equation of the regression line

x-7=d(y-lt),

U=ffi

is

where

O=ffi

If there is a clear independent variable x, we will always

Zr?

is

where

of x on y

the smallest

use the regression line

of .y on

to do estimation.

For cases where there is no clear independent variable, if we want to estimate y for a
given value of x, we use regression line of y on r. If we want to estimate r for a given
value of y , use regression line of r on y .

22

H2 Mathematics JC 2

Student's Copy

Appendix A - Anscombe Quartet


To illustrate the importance of using the scatter diagram to support the product moment
correlation coefficient, F.J Anscombe constructed the following 4 data sets which have
exactly the same r value but appear very different when graphed:
Anscombe's Ouartet

II

10.0

8.04

10.0

9.t4

10.0

7.46

8.0

6.s8

8.0

6.95

8.0

8.14

8.0

6.77

8.0

s.76

13.0

7.58

13.0

8.74

13.0

t2.74

8.0

7.71

9.0

8.81

9.0

8.77

9.0

7.lr

8.0

8.84

11.0

8.33

I1.0 9.26

11.0

7.81

8.0

8.47

14.0

9.96

14.0

8.10

14.0

8.84

8.0

7.04

6.0

7.24

6.0

6.13

6.0

6.08

8.0

5.2s

4.0

4.26

4.0

3.10

4.0

5.39

19.0

n.5a

12.0

10.84

12.0

9.13

12.0

8.15

8.0

s.s6

7.0

4.82

7.0

7.26

7.0

6.42

8.0

7.91

5.0
5.68
5.0 4.74 5.0
5.73
8.0
6.89
Note that all 4 sets of data have the same mean, variance, r value and linear regression line.

oo-l

t)

lt)

EE

t
I

ilr)

IV)

23

I
H2 Mathematics JC 2

Student's Copy

Appendix B - Properties of

The measure, r will not change if we add a constant or multiply a positive constant to all the
values of a variable.
If s = a*bx and , = c*dy, where a,b,c,d are constants, with b and d bothpositive,
thertrr, =

rr.

I("

-s-X

7)

I("-s')'I u-r)'
(a + bx) - (a + ut)
(a + bx) - (a + ot))'z

ll(c + ay) - (" * @)

ll(c

+ ay)- (" *

bal(x-xXy-I)

l(ox-m)(ay-@)
l(tx - tt)'Z(tl, - @)'

ml(x-tXy-r)

ffi

O))'

*a2l(x-r1'le -t)'
(Note

Jt'a' =bd '.'bd ispositive.)

Appendix C- Partial Differentiation to Determine Regression Coefficients


Let !1,!2,...,ln ba the observed values corresponding to the values xt,)c2,...,xn.
The least squares line y = a * bx is the line that minimises the sum of the squares of the
residuals ie. S =

L"?

=it
i=l

,-a-bx,)z

Calculating the partial derivative of ,S with respect to a:


AS
( v, a(note the notation
hx, ) (not.
notal' A't k being
= -2y
- * - bx,)

Oa L\t i
=

r"t

-z(\r,

''

-na

;d instead or 9{ .t
use

da

-blx,)

=o to ro o=b--bl*' =y-b,
$oann

"

Calculating the partial derivative of S with respect to b:


oS
x.( v. -a-hr,)
q'v let
k u*i ) and
= 'L)*i\li
'-' Ab -= o.
Ab

-zI

Thus

x, y

i - aZx,

* bZ*,' = p + Zr,yi - Q - ffi) n7 - bI r,'

*b=W+b=ffid=*

= 0.

24

H2 Mathematics JC 2

Student's Copy

Questions

l.

The product moment correlation coefficient is denoted by r. Comment on the validity

of

fo

llowing statements

(a) r = 0 for a set of datu (*, y) implies that X and Y are unrelated.
(b) If X is the number of cigarettes smoked per dayby people dymg of lung cancer
and I is the age at deat[ then r = -0.9 implies that smoking more cigarettes
per day causes a person to die younger.

(c)

The value

of

for a set of sample data (*,

holds for the population

y)

is

means that a linear relation

(X,Y).

I,

25

H2 Mathematics JC 2

2.
(a)

The product moment correlation coemcient for a set of data


State a data point which can be included strch that
set

(b)

Student's

(r,y)

is denoted by

Copy
,

'

r will remain the same for the new

ofdata.

Express, in terms

(t)
(ii)

of

r,

the product moment correlation coefficient

of (y, x).

the product moment correlation coefficient when all values

of x

are inueased

bv 5.

(m)

the product moment correlation coefficient for the set

of (*,* y).

26

H2 Mathematics JC 2

3.
of I

Student's Copy

With the aid of a suitable diagrarn, describe the difference between the regression line
on

and that

ofX on L

Under what circumstanoes would the above two lines be co-

The following summarizes the data from

I,
(l)
(ii)
(iii)

l0

= 17 82,2 y = I 483,1 x2 = 3 I 8086,

sets

of lengths(r) and breadths(y) in mm:

Z y' = 22A257,2.y

264582

Find the value ofthe product moment.correlation coefficient.


Find the equation of the regression line ofy on x and that ofx ony.
Predict the breadth when the length is 185 mm. Comment on the reliability
your prediction.

of

l(i)0.744 (ii) y =0.584x+44.3; x=0.949y+37.4 (iii) l52l

2V

H2 Mathematics JC 2

4.

Student's

Copy

Explain with the aid of a diagrarq what is rneant by the term "least squares" in the
context of regression lines.
Delegates who travelled by car to a Statistics Conference were asked to report d,the
distance travelled (in

km)

and

t,the time taken (in minutes). A random sample of the

reported values was given in the table below:

113

t4

98

130

75

120

143

55

t27

130

25

180

148

100

120

196

48

r6s

(r)

Find the value of the product moment correlation coefficient.

(ii)

Find the equation of the least squares regression line of r on d in the form
t = atbd. Interpret the coefficient b in the context of the question.

(iii)

Explain why it is rnore appropriate to regress t ond.

(1v)

Estimate the time takenby the delegates who travelled

(a)

100

km

O)

lsO km

and comment on the reliability of your estimates.

[(i)

0.8e4 (ii) r =3.43+1.24d (iv)

(a)

127

(b) 189]

!
I

28

rr

H2 Mathematics JC

5.

Student's Copy

F.M N94/4/9

The following summary datarefers to concentrations of carbon dioxide in the atmosphere (y)
in parts per million for the past 8 years 1971,1973,..., 1985 (x).

l{*

-let l) = 56, ZO -tzs) = 6%1{* -tstl)2

It*-tstt)(y

= 560,

ZO

-tzs)2

887

-32s) =704
[Source: Council on Environmental Quality, 1987 .]

(r)

Let u = x -1971and v = y -325. Calculate the equation of the least squares regression
line of v on a. Hence find the equation of the least squares regression line of y on x.

.
tf

(ii)

Calculate the product moment correlation coefficient for x andy. Comment on what

its value implies about the regression line.

(iii)

Estimate the concentration of carbon dioxide in the atmosphere in (a) 1974 and (b)
1988.

Comment on the reliability of your answers.

[(i) v = 1.32u -0.583 ) ! =1.32x-2268 (ii) 0.998 (iii)328;3a7)

29

H2 Mathematics JC 2

Student's

Copy

6.

A comprehensive guide: II2 Mathematics for'A' Level. Ql6 p22l (modified)


The data shows the result of an experiment to invostigate the relationship between two
variables x and /, where

(r)
(ii)

r is dependent

on /.

x 22.5

25.0

28.0

30.5

38.0

40.5

42.5

48.0

54.5

55.0

42.0

33.s

28.0

18.0

13.6

15.0

10.3

9.0

6.3

44.0

Obtain the scatter diagram and comment on anyrelationship between

70.0
4.0

x and t.

Statg with a reasor! which ofthe following models is more appropriate to fit the data
points:

(a)
(b)
(iii)

x=atb wherea>0andD<0

x=a+bt2 wherea>0andb<0.

For the appropriate model, find the product moment correlation coefficient for

the

transformed data. Estimate the values of a and. b.

[(iii) -0.990, a=136, b=*0.453]

30

H2 Mathematics JC 2

7.

Student's Copy

N2009/IU6

The table gives the world record time, in seconds above 3 minutes 30 seconds, for running

mile as at ls January in various years.

(1)
(ii)

Year, x

1930

1940

1950

1960

t970

1980

1990

2000

Time, /

40.4

36.4

31.3

24.5

2t.t

19.0

16.3

13. r

Draw a scatter diagram to illustrate the data.


Comment on whether a linear model would be appropriate, referring both to the scatter
diagram and the context ofthe question.

(iii)

Explain why in this context a quadratic model would probably not be appropriate for
lo

(iv)

ng-term predictions.

Fit a model of the form lnr = a*bx to the data, and use it to predict the world record
time as at l$ January 2010. Comment on the reliability of your prediction.

t(ii) inappropriate

(1v) 3 mins 41.4 secs, unreliablel

3l

student'scopy '

H2 Mathematics JC 2

8.

'

fM2O03/IUllOR modilied

A random sample of eight pairs of values


equations ofthe regression lines

ofy onx

ofx

and,y is used to obtain the following

and )c

71517
Y='-fr*tl0,

ony respectively.

7s=--Y+20

Seven ofthe pairs of data are given in the table.

x
v

l0

11

l2

11

t7

t4

l9

Find the eighth pair of values ofx and.y.


Determine the value of the product nnoment corelation coefficient, and say what it leads you

to expect about the scatter diagram for this sample.

Let Y be the value obtained by substituting a sample value of -r into the equation of the
regression line

ofy

on x. Evaluate Y for each of the eight values of x and state the value

of

z?-v)'.
For each of the eigtrt sample values of x, Y'is givr:n by Y'
constants. What can you say about the value

of

: a * bx, where a and b are any

I(y -y')'Z
l(to,s;,

r:

-0.904,

8.81

32

H2 Mathematics JC 2

9.
For

Student's Copy

FM2006/rU10
a

random sample of 12 observations of pairs of values (x, y), the equation of the

regression line

ofy onx is ! = 4.82-2.25x. The sum ofthe

12 values

of; is 20.64 and the

product moment correlation coefficient for the sample is -0.3.

(r)
(ii)
(iii)

Find the sum of the t2 values ofy.


Find the equation ofthe regression line

ofr ony.

Find the estimated value ofy when x =,2.8 and comment on the reliability of this
estimate.

[(i) 11.4 (ii) x=1.76-0.04y (iii) -1.48]

33

\
H2 Mathematics JC 2

10.

Student's

Copy r.

.:

N2007/IU11

Research is carried out into how the concentration of a druS in the bloodstream varies

with

time, rneasured from when the drug is given. Observations at successive times give the data

shown in the following table.

Time(/minutes)

30
65

15

Concentration ( x microgms

82

60
43

90
37

120
22

150

180

240

300

r9

t2

litre )

It is given that the value of the product moment correlation coefficient for this data is -0.912
correct to 3 decimal places.

Obtain the scatter diagram

of x on r and calculate the equation of the regression line ofx on r.

Calculate the corresponding estimated value

of r when /:

300, comment on the suitability

of the linear model.

Thevariableyisdefinedby y =ln:r. Forthevariables ywrd t,

(i)

calculate the product moment correlation coefficient and comment on its value,

(ii) calculate equation of the appropriate regression line.


Use a regression line to give the best estimate that you can of the time when the drug
concentration is l5 micrograms per litre.

[(il x = -0.260 r + 66.2: x :

-l

1.7 or

-l

1.8

(ii) r

- 0.994:ln x :

4.6t

- 0 01 ?1 r' r :

551

Potrebbero piacerti anche