Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Outline
● Statistical Background
● Significance Testing
● Hypothesis T esting
● An algorithm applied to a set of data will usually produce some
result(s)
– There h ave b een claims that the results reported in more than 5 0%
of p ublished p apers a re false. (Ioannidis)
● Results may be a result of random variation
– Any p articular d ata set is a finite sample from a larger p opulation
– Often significant variation a mong instances in a d ata set o r
heterogeneity in the p opulation
– Unusual e vents o r coincidences d o h appen, e specially when looking
at lots o f e vents
– For this a nd o ther reasons, results may n ot replicate, i.e., g eneralize
to o ther samples o f d ata
● Results may not have domain significance
– Finding a d ifference that makes n o d ifference
● Data scientists need to help ensure that results of data analysis
are not false discoveries, i.e., not meaningful or reproducible
02/14/2018 Introduction to Data Mining, 2nd Edition 3
Statistical Testing
Gaussian Distribution
\mathcal{N }
Statistical Testing …
Significance Testing
●s
Hypothesis Testing …
● The value of a blood test is used as the test statistic, 𝑅, to
identify whether a patient has a particular disease or not.
– H0:
For
p atients
𝐧𝐨𝐭
having
the
d isease,𝑅
h as
distribution
𝒩(40, 5 )
– H1:
For
p atients
having
the
d isease,𝑅
h as
distribution
𝒩(60, 5 )
rQs S r Qvw S
G K G K
𝑒?
𝑒?
𝛼 =
∫tu STS 𝑑𝑅 = ∫tu xw 𝑑𝑅 = 0.023, µ = 4 0, σ = 5
LMN S tuM
rQs S r Qzw S
tu K tu K
𝑒?
𝑒?
𝛽 =
∫?G STS 𝑑𝑅 = ∫?G xw 𝑑𝑅 = 0.023, µ = 60, σ = 5
LMN S tuM
Probability Density
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0
0
20 30 40 50 60 70 80 20 30 40 50 60 70 80
R R
Distribution of test statistic for the alternative hypothesis (rightmost density curve)
and null hypothesis (leftmost density curve). Shaded region in right subfigure is 𝛼.
0.08 0.08
0.07 0.07
0.06 0.06
Probability Density
Probability Density
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0
0
20 30 40 50 60 70 80 20 30 40 50 60 70 80
R R
Shaded region in left subfigure is β and shaded region in right subfigure is power.
Bonferroni Procedure
● Goal of F WER is to ensure that F WER < 𝛼,
where 𝛼 is often 0.05
● Bonferroni Procedure:
– m results are to be tested
– Require FWER < α
– set the significance level, 𝛼 ∗ for every test to be
𝛼 ∗ = 𝛼/m.
0.6
0.5
0.4
0.3
0.2
0.1
0.05
0
0 10 20 30 40 50 60 70 80 90 100
Number of Results, m
The family wise error rate (FWER) curves for the naïve approach and the
Bonferroni procedure as a function of the number of results, m. 𝜶 = 𝟎 .𝟎𝟓 .
𝐹𝑃 𝐹𝑃
𝑄= =
if
𝑃𝑝𝑟𝑒𝑑 > 0
𝑃𝑝𝑟𝑒𝑑 𝑇𝑃 + 𝐹𝑃
= 0
if
𝑃𝑝𝑟𝑒𝑑 =
0 ,
where 𝑃𝑝𝑟𝑒𝑑 is the n umber o f p redicted p ositives
● If we know FP, the n umber o f a ctual false p ositives, then FDR = FP.
– Typically we don’t k now FP in a t esting s ituation
0.1 Naive Approach
Bonferroni
False Discovery Rate (FDR)
BH Procedure
0.08
0.06
0.05
0.04
0.02
0
0 10 20 30 40 50 60 70 80 90 100
Number of stockbrokers, m
0.8
0.7
0.6
0.5
0.4
0.3
Naive Approach
0.2 Bonferroni
BH Procedure
0.1
0
0 10 20 30 40 50 60 70 80 90 100
Number of stockbrokers, m