Sei sulla pagina 1di 52

Chi-square test

or
 test
2

A
nonparametric
hypothesis test
1
Parametric vs. Nonparametric
Tests
• Parametric hypothesis test
– about population parameter ( or p)
– z, t, tests
– interval/ratio data
• Nonparametric tests
– do not test a specific parameter
– nominal & ordinal data
– frequency data ~
2
Chi-square test

Used to test the counts of


categorical data
Three types
Goodness of fit (univariate)
Independence (bivariate)
Homogeneity (univariate with
two samples)
3
2 distribution –

df=3

df=5

df=10

4
2 distribution

Different df have different


curves
Skewed right
As df increases, curve shifts
toward right & becomes
more like a normal curve
5
2 assumptions
SRS – reasonably random sample
Have counts of categorical data & we
expect each category to happen at these
Combine
least once together:
All expected
Sample size – to insure thatcounts
the are at
sample size is large enough we least
should
5.
expect at least five in each category.

***Be sure to list expected counts!!


6
2 formula

 
2  Oi  Ei  2

Ei
where
Oi is the observed frequency
Ei is the expected frequency
7
2 Goodness of fit test
Uses univariate data
Want to see how well the
observed counts “fit” what we
expect the counts to be

Based on df –

df = number of categories - 1

8
Hypotheses – written in words
H0: the observed counts equal the
expected counts, i.e., there
is no significant difference between the
observed and the expected counts
H1: the observed counts are not equal to
the expected counts

Be sure to write in context!


9
Steps for Computation of 2
(1)Compute the expected frequencies E1, E2,…,
En corresponding to the observed frequencies
O1, O2,…, On under some hypothesis.
(2) Compute the deviations (O-E) for each
frequency and then square them to obtain
(O-E)2 .
(3)Divide the square of the deviations (O-E)2 by
the corresponding expected frequency to
obtain (O-E)2 / E.
(4) Add the values obtained in step 3 to
compute:
 
2  Oi  Ei2

Ei 10
Steps for Computation of 2 (con’t)
5 Under the null hypothesis test the theory fits
well, the above statistic follows 2 distribution
with v=n-1 d.f.
6 Look up the tabulated value of 2 for (n-1) d.f at
given level of significance.
7 If the calculated value of 2 is less than the
corresponding tabulated value obtained in step 6,
then it is said to be non-significant at the
required level of significance.
8 If the calculated value of 2 is greater than the
corresponding tabulated value obtained in step 6,
then it is said to be significant at the required
11

level of significance.
Let’s test our dice!

12
CASELETS
1. A dice is rolled 100 times with the following
distribution:
Number : 1 2 3 4 5 6
Observed frequency : 17 14 20 17 17 15
At the 0.01 level of significance, determine whether
dice is true (unbiased).
Solution. We are given:
Number of categories = 6
N = total frequency = 17+14+20+17+17+15 =100

Null Hypothesis: Ho : The die is unbiased, i.e., the


probability of obtaining six faces is same
13

Alternate Hypothesis: H1 : The dice is biased


CASELETS (con’t)
Expected frequency for each face (E) = N*p =100/6 =16.67
(all expected values are greater than 5)
Number Observed Expected (O-E) (O-E)2 (O-E)2/E
Frequency Frequency
(O) (E)
1 17 16.67 0.33 0.1089 0.0065
2 14 16.67 -2.67 7.1289 0.4276
3 20 16.67 3.33 11.088 0.6652
4 17 16.67 0.33 0.1089 0.0065
5 17 16.67 0.33 0.1089 0.0065
6 15 16.67 -1.67 2.7889 0.1673
Total 214=1.2796
CASELETS (con’t)

 2  Oi  Ei  2
 1.2796
Ei

Since the calculated value of2 = 1.2796 is less


than the tabulated value of 2, i.e., 15.086 (for
5 d.f at 1% level of significance) therefore the null
hypothesis is accepted and we conclude that the dice
is regarded as true (unbiased).

15
CASELETS
2. Offspring of certain fruit flies may have
yellow or ebony bodies and normal wings or
short wings. Genetic theory Since there are
predicts that4
categories,
these traits willcounts:
Expected appear in the ratio 9:3:3:1
(yellow Y
&&normal, yellow & short,
N = 56.25 df = 4ebony
– 1 =&3
normal, Yebony
& S =& 18.75
short) A researcher checks
100 suchE flies
& N =and finds the distribution of
18.75
traits toEbe
& S = 20,
59, 6.2511, and
We 10, respectively.
expect 9/16 of the
What are the expected100 counts?
flies to df?
have yellow
and normal wings. (Y & N)
Are the results consistent with the
theoretical distribution predicted by the
genetic model? (5% level of significance) 16
CASELETS (con’t)
Assumption:
All expected counts are greater than 5.
Expected counts:
Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25
H0: The distribution of fruit flies is the same as the theoretical
model.
Ha: The distribution of fruit flies is not the same as the
theoretical model.
Number Obser. Expec (O-E) (O-E)2 (O-E)2/E
Freq. (O) Freq. (E)
1 59 56.25 2.75 7.5625 0.135
2 20 18.75 1.25 1.5625 0.083
3 11 18.75 -7.75 60.062 3.203
5
17
4 10 6.25 3.75 14.062 2.25
CASELETS (con’t)

 
2  Oi  Ei  2
 5.671
Ei

Since the calculated value of2 = 5.671 is less


than the tabulated value of 2, i.e., 7.815 (for
3 d.f at 5% level of significance) therefore the null
hypothesis is accepted and we conclude that the
distribution of fruit flies is the same as the
theoretical model.
18
CASELETS
3.Does your zodiac sign determine how successful you
will be? Fortune magazine collected the zodiac signs of
256 heads of the largest 400 companies. Is there
sufficient evidence to claim that successful people are
more likely to be born under some signs than others?
Aries 23 Libra 18 Leo 20
Taurus 20 Scorpio 21 Virgo 19
Gemini 18 Sagittarius 19 Aquarius 24
Cancer 23 Capricorn 22 Pisces 29
How many would you expect in each sign if there were
I would expect CEOs to be equally born under all signs.
no difference between them?
So 256/12 = 21.333333
Since
How many degrees of there are 12 signs –
freedom?
df = 12 – 1 = 11
19
CASELETS (con’t)
Assumption
All expected counts are greater than 5. (I expect
21.33 CEO’s to be born in each sign.)
H0: The number of CEO’s born under each sign is the
same.
H1: The number of CEO’s born under each sign is the
different.

 2

23  21.3 
2

20  21.3 
2
 ... 
29  21.3 
2
 5.094
21.3 21.3 21.3
20
CASELETS (con’t)

 2  Oi  Ei  2
 5.094
Ei

Since the calculated value of2 = 5.094 is less


than the tabulated value of 2, i.e., 19.675 (for
11 d.f at 1% level of significance) therefore the null
hypothesis is accepted and we conclude that number
of CEO’s born under each sign is the same.

21
CASELETS
4.Records taken of the number of male and female
births in 800 families having four children are given as
follows:
No. of births Frequency
Male Female
0 4 32
1 3 178
2 2 290
3 1 236
4 0 64
Test whether the data are consistent with the
hypothesis that the binomial law holds and the chance
of a male birth is equal to that of female birth.
22
CASELETS (con’t)
Let us set up the null hypothesis that the data are consistent
with the binomial law of equal probability for male and female
births No. of Expected
We are given n = 4, N = 800 male frequency
births F(r)
According to binomial
probability law, the frequency 0 50 * 4C0 = 50
of r male births is given by: 50 * 4C1 = 200
1
F(r) = N*p(r) = N* nCr * pr *qn-r 50 * 4C2 = 300
2
= 800* 4Cr * (0.5)r *(0.5)4-r 3 50 * 4C3 = 200
= 50 * 4Cr; (r = 0,1,2,3,4) 4 50 * 4C4 = 50
Total 800
23
CASELETS (con’t)
No. of Obser. Expec (O-E) (O-E)2 (O-E)2/E
male Freq. (O) Freq. (E)
birth
0 32 50 -18 324 6.48
1 178 200 -22 484 2.42
2 290 300 -10 100 0.33
3 236 200 36 1296 6.48
4 64 50 14 196 3.92
Total 2=19.63
2  
 Oi  Ei  2  19.63
Ei
Since the calculated value of 2 = 19.63 is greater than the
tabulated value of 2, i.e., 9.488 (for 4 d.f at 5% level of
significance) therefore the null hypothesis is rejected and we
conclude that hypothesis of equal male and female births 24 is

wrong
CASELETS
5. A company says its premium mixture of nuts contains
10% Brazil nuts, 20% cashews, 20% almonds, 10%
hazelnuts and 40% peanuts. You buy a large can and
separate the nuts. Upon weighing them, you find there
are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds,
71 g or hazelnuts, and 446 g of peanuts. You wonder
whether you mix is significantly different from what the
company advertises?
Because we do NOT
have counts
Why is the chi-square goodness-of-fit of the
test NOT
appropriate here? type of nuts.
We could count the number
What might you do instead of of weighing the nuts
each type in and
of nut
order to use chi-square? then perform a 2 test.
25
Practice CASELETS
1. The following figures show the distribution of digits in
numbers chosen at random from a telephone directory:
Digit :0 1 2 3 4 5 6 7 8 9 10
Frequency:1026 1107 997 966 1075 933 1107 972 964 853
Test whether the digits may be taken to occur equally
frequently in the directory.(tabular value for 9 d.f at 5%
level of significance is 16.92)
2. The number of scooter accidents per month in a certain
town were as follows:
12, 8, 20, 2, 14, 10, 15, 6, 9, 4
Are these frequencies in agreement with the belief that
accidents conditions were the same during this 10 month
period? (tabular value for 9 d.f at 5% level is 16.92)
26
2 test for independence

Used with categorical,


bivariate data from ONE
sample
Used to see if the two
categorical variables are
associated (dependent) or not
associated (independent)
27
Assumptions & formula
remain the same!

28
Hypotheses – written in words
H0: two variables are
independent
H1: two variables are dependent

Be sure to write in context!

29
CASELETS
1. A beef distributor wishes to determine
whether there is a relationship between
geographic region and cut of meat preferred.
If there is no relationship, we will say that
beef preference is independent of geographic
region. Suppose that, in a random sample of
500 customers, 300 are from the North and
200 from the South. Also, 150 prefer cut A,
275 prefer cut B, and 75 prefer cut C.
Also suppose that in the actual sample of 500
consumers the observed numbers were as
follows:
30
CASELETS (con’t)
North South Total

Cut A 100 50 150

Cut B 150 125 275

Cut C 50 25 75

Total 300 200 500

 Is there sufficient evidence to suggest that


geographic regions and beef preference are not
independent? (Is there a difference between
the expected and observed counts?) 31
CASELETS (con’t)
Solution
Expected Counts

Assuming H0 is true,

row total  column total


expected counts 
table total
32
CASELETS (con’t)

Degrees of freedom

df  (r  1)(c  1)
Or cover up one row & one
column & count the number of
cells remaining!
33
CASELETS (con’t)
If beef preference is independent of
geographic region, how would we expect this
table to be filled in?
North South Total

Cut A 90 60 150

Cut B 165 110 275

Cut C 45 30 75

Total 300 200 500


34
CASELETS (con’t)
Assumptions:
All expected counts are greater than 5.

H0: geographic region and beef preference are independent


H1: geographic region and beef preference are dependent
Obser. Expec (O-E) (O-E)2 (O-E)2/E
Freq. (O) Freq. (E)
100 90 10 100 0.11
50 60 -10 100 1.67
150 165 -15 225 0.33
125 110 15 225 1.36
50 45 5 25 0.56
25 30 -5 25 0.83
2=4.86 35
CASELETS (con’t)
Since the calculated value of2 = 4.86 is less than
the tabulated value of 2, i.e., 5.991 (for
(3-1)*(2-1)=2 d.f at 5% level of significance)
therefore the null hypothesis is accepted and we
conclude that geographic region and beef preference
are independent

36
CASELETS
2. In a certain sample of 2000 families 1400 families
are consumer of tea. Out of 1800 Hindu families,
1236 families consume tea. Use Chi-Square test and
state whether there is any significant difference
between consumption of tea among Hindu and non-
Hindu families. (5% level of significance)

Number of Hindu Non-Hindu Total


Families 1236 164 1400
consuming tea
Families not 564 36 600
consuming tea
Total 1800 200 2000
37
CASELETS (con’t)
Solution: Expected Counts
• Assuming H0 is true,
row total  column total
expected counts 
table total
Number of Hindu Non-Hindu Total
Families 1400
consuming tea 1260 140
Families not 600
540 60
consuming tea
Total 1800 200 2000 38
Assumptions: CASELETS (con’t)
All expected counts are greater than 5.

H0: consumption of tea and community are independent, i.e.,


there is no significance difference between the consumption of
tea among Hindu and non-Hindu families
H1: consumption of tea and community are dependent
Obser. Expec (O-E) (O-E)2 (O-E)2/E
Freq. (O) Freq. (E)
1236 1260 -24 576 0.457
564 540 24 576 1.067
164 140 24 576 4.114
36 60 -24 576 9.6
Total 2=15.238
39
CASELETS (con’t)
Since the calculated value of2 = 15.238 is greater
than the tabulated value of 2, i.e., 3.841 (for
(2-1)*(2-1)=1 d.f at 5% level of significance)
therefore the null hypothesis is rejected and we
conclude that the two communities differ significantly
as regards the consumption of tea among them.

40
CASELETS
3. A sample of 400 students of under-graduate and 400
students of post-graduate classes was taken to know
their opinion about autonomous colleges. 290 of the
under-graduate and 310 of the post-graduate students
favored the autonomous status. Present these facts in
the form of a table and test at 55 level, that the opinion
regarding autonomous status of college is independent of
the level of classes of students.

Observed Frequencies
Solution Class Number of Students Total

Favoring Opposin
g
110
Under Graduate 290 400
90
Post Graduate 310 400
600 200 800 41

Total
CASELETS (con’t)
Expected Counts
• Assuming H0 is true,
row total  column total
expected counts 
table total
Class Number of Students Total

Favoring Opposin
g
300 100
Under Graduate 400
300 100
Post Graduate 400 42

Total 600 200 800


Assumptions: CASELETS (con’t)
All expected counts are greater than 5.

H0: Opinion about autonomous colleges is independent of the


level of classes
H1: consumption of tea and community are dependent

Obser. Expec (O-E) (O-E)2 (O-E)2/E


Freq. (O) Freq. (E)
290 300 -10 100 0.33
110 100 10 100 1.00
310 300 10 100 0.33
90 100 -10 100 1.00
Total 2=2.66
43
CASELETS (con’t)
Since the calculated value of2 = 2.66 is less than
the tabulated value of 2, i.e., 3.841 (for
(2-1)*(2-1)=1 d.f at 5% level of significance)
therefore the null hypothesis is accepted and we
conclude that the opinion about autonomous colleges
may be regarded to be independent of the level of
classes of the students

44
CASELETS
4. Suppose that, in a public opinion survey answers to
the questions-
(a) Do you drink
(b) Are you in favor of local option on sale of liquor?
Were as given in the table
Questio Question (a) Total
n (b) Yes No

Yes 56 31 87
No 18 6 24
Total 74 37 111
Can you infer that opinion on local option is dependent
45

on whether or not an individual drinks?


CASELETS (con’t)
Solution: Expected Counts
• Assuming H0 is true,
row total  column total
expected counts 
table total
Questio Question (a) Total
n (b) Yes No

Yes 58 29 87
No 16 8 24
46
Total 74 37 111
Assumptions: CASELETS (con’t)
All expected counts are greater than 5.

H0: the local option on sale of liquor is independent of whether


or not an individual drinks
H1: geographic region and beef preference are dependent

Obser. Expec (O-E) (O-E)2 (O-E)2/E


Freq. (O) Freq. (E)
56 58 -2 4 0.069
31 29 2 4 0.138
18 16 2 4 0.25
6 8 -2 4 0.5
Total 2=0.957
47
CASELETS (con’t)
Since the calculated value of2 = 0.957 is less than
the tabulated value of 2, i.e., 3.841 (for
(2-1)*(2-1)=1 d.f at 5% level of significance)
therefore the null hypothesis is accepted and we
conclude that the opinion on local option on sale of
liquor is independent (not dependent) of whether or not
an individual drinks

48
Yates correction for
Continuity
If any cell frequency in 2X2 table is less than 5,
then for the application of Chi-Square test it has
to be pooled with the preceding or succeeding
frequency so that total is greater than 5. This
results in the loss of 1 d.f. Since for 2X2 table,
d.f. = (2-1)X(2-1) = 1; the d.f. left after
adjusting for pooling are v = 1-1 = 0, which is
absurd. In such situation we apply Yates correction
for ‘continuity’. In this method we add 0.5 to the
cell frequency which is less than 5 and adjusting
the remaining frequencies accordingly, since row
and column totals are fixed and then applying Chi-
Square. 49
2 test for homogeneity

Used with a single categorical


variable from two (or more)
independent samples
Used to see if the two
populations are the same
(homogeneous)

50
Assumptions & formula remain
the same!
Expected counts & df are found
the same way as test for
independence.

Only change is the hypotheses!


51
Hypotheses – written in words
H0: the two (or more)
distributions are the same
H1: the distributions are
different

Be sure to write in context!

52

Potrebbero piacerti anche