Sei sulla pagina 1di 56

Testing for normality

and
Transforming data
How to tell whether data is
appropriate for parametric tests?
IF DATA
NOT normally distributed

can utilize
Parametric Tests
POWERFUL & ROBUST
IF DATA
NOT normally distributed

can utilize limited to


Parametric Tests Non- Parametric Tests
POWERFUL & ROBUST LESS POWERFUL
HOWEVER
Are all normal distributions the same?

Normal distributions are defined by two parameters, the mean (μ)


and the standard deviation (σ).
ALSO
Visual prediction of normal distribution
greatly limited by sample size

n=10 n=15 n=20

n=25 n=30 n=120


SOLUTION:
Maximize possibility of parametric testing
only if data is appropriate

Test “normality” of data distribution

Transform data if needed


SOLUTION:
Maximize possibility of parametric testing
only if data is appropriate

Test “normality” of data distribution

Transform data if needed


Parametric tests hypothesis test which provides generalizations for
making statements about the mean of the parent population

- assume underlying statistical distributions in the data

- if each sample follows a normal distribution and if sample variances are


homogeneous.

- will have more statistical power

Nonparametric tests hypothesis test which is not based on underlying


assumptions, i.e. it does not require population’s distribution to be denoted by
specific parameters.

are also called distribution-free tests because they don’t assume that your
data follow a specific distribution.
ASSIGNMENT: Give examples of data sets/analysis using 1 sample t test, 2 samples t test,
1 way anova, 2 way anova, chi square test.

1. Variables to be measured ( ex. zone of bacterial inhibition, difference of gene


expression in normal and cancer groups)
2. Results and interpretation/conclusion ( include graph/table)
3. Title of the journal /reference
4. submit hard copy
TEST :
“shape” of distribution can be measured

SKEWNESS
“asymmetry”

KURTOSIS
“height of hump”

MODALITY
“no. of humps”
TEST :
“shape” of distribution can be measured

SKEWNESS
“asymmetry”

KURTOSIS
“height of hump”

MODALITY
“no. of humps”
SKEWNESS: how sample differs in shape
from a symmetrical distribution

Normal Skewed right Skewed left


symmetric longer right tail longer left tail

Mean = Median Mean > Median Mean < Median


Many software programs compute
adjusted Fisher-Pearson coefficient of SKEWNESS

sample
skewness
= G1

an adjustment for robustness of different sample sizes


Many software programs compute
adjusted Fisher-Pearson coefficient of SKEWNESS

sample
skewness
= G1

syntax in
= SKEW (number1,[number2],…)
MS Excel
Many software programs compute
adjusted Fisher-Pearson coefficient of SKEWNESS

sample
skewness
= G1

= zero (or near 0) : symmetric

= between 1 to -1 : moderate (+/-) skew

= beyond 1 to -1 : substantial (+/-) skew


TEST :
“shape” of distribution can be measured

SKEWNESS
“asymmetry”

KURTOSIS
“height of hump”

MODALITY
“no. of humps”
KURTOSIS: extent to which data are distributed
“tails vs. center” of distribution

Mesokurtic Leptokurtic Platykurtic


standard bell more peaked flatter
Many software programs compute
excess KURTOSIS: actual value - 3

sample
kurtosis = G2

actual “kurtosis” of standard normal distribution is 3.


Many software programs compute
excess KURTOSIS: actual value - 3

sample
kurtosis = G2

syntax in
= KURT (number1,[number2],…)
MS Excel
Many software programs compute
excess KURTOSIS: actual value - 3

sample
kurtosis = G2

= zero (0) : mesokurtic, “normal”


= + value : leptokurtic, “fat in tails”
= - value : platykurtic, “thin in tails”
TEST :
“shape” of distribution can be measured

SKEWNESS
“asymmetry”

KURTOSIS
“height of hump”

MODALITY
“no. of humps”
MODALITY: “mode” is more than just
the “observation value” that occurs most frequently

Unimodal Bimodal Multimodal


one peak two peaks more than two peaks

exactly how "high" must “nth" peak be to qualify as bi/multimodal


rather than unimodal with some un-smooth regions?
hap-hazard way to count number of “modes” is using
multi-mode function in Excel

syntax in
= MODE.MULT (number1,[number2],…)
MS Excel

Provides an array of frequently occurring values from a data range


Estimate multimodality in by superimposing
Kernel Density Estimates to distribution

REMEMBER, we are estimating “probability density function”


of underlying distribution
SOLUTION:
Maximize possibility of parametric testing
only if data is appropriate

Test “normality” of data distribution

Transform data if needed


TEST:
“Normality tests” are 1st inferential statistics you will employ

1. State the QUESTION


Shapiro-Wilk Test
NULL HYPOTHESIS

Jarque-Bera Test VISUALIZATION

TEST STATISTIC

Anderson-Darling Test SIGNIFICANCE

INFERENCE
TEST:
“Normality tests” are 1st inferential statistics you will employ

Shapiro-Wilk Test
best “power” when using same probability significance
compared to other two tests.

Jarque-Bera Test Shapiro–Wilk test is a test of normality in


frequentist statistics. It was published in 1965
by Samuel Sanford Shapiro and Martin Wilk.

Anderson-Darling Test
TEST:
“Normality tests” are 1st inferential statistics you will employ

1. State the QUESTION


Shapiro-Wilk Test
not advised for samples beyond n > 2,000
NULL HYPOTHESIS

Jarque-Bera Test VISUALIZATION


best for testing normality of large samples, n > 2,000
TEST STATISTIC

Anderson-Darling Test SIGNIFICANCE

INFERENCE
TEST:
“Normality tests” are 1st inferential statistics you will employ

1. State the QUESTION


Shapiro-Wilk Test
best “power” when using same probability significance
compared to other two tests. NULL HYPOTHESIS

Jarque-Bera Test VISUALIZATION


best for testing normality of large samples, n > 2,000
TEST STATISTIC

Anderson-Darling Test SIGNIFICANCE


measures how data fits a specific distribution
(not just normal distribution!) INFERENCE
TEST:
Purpose of usage common for all three tests

1. State the QUESTION

2. Formulate the NULL HYPOTHESIS


Does my sample data set come from a
population with a normal distribution? VISUALIZATION

TEST STATISTIC

SIGNIFICANCE

INFERENCE
TEST:
TAKE NOTE! Null hypothesis are different for the three tests

1. State the QUESTION


Shapiro-Wilk Test
H0: population from which sample is sourced
is normally distributed. NULL HYPOTHESIS

Jarque-Bera Test VISUALIZATION


H0: skewness and excess kurtosis are both zero (0).
TEST STATISTIC

Anderson-Darling Test SIGNIFICANCE


H0: data comes from specified distribution.
INFERENCE
https://youtu.be/YrWzDc7VoWI
TEST:
Visualization is as important as the test statistics

Histogram
with Kernel Density Estimates 1. State QUESTION

2. Formulate NULL HYPOTHESIS

VISUALIZATION

TEST STATISTIC

SIGNIFICANCE

INFERENCE
TEST:
Visualization is as important as the test statistics
The quantile-quantile (q-q) plot is a graphical
technique for determining if two data sets come
from populations with a common distribution.
Quantile-Quantile Plot
1. State QUESTION

2. Formulate NULL HYPOTHESIS

VISUALIZATION

TEST STATISTIC

SIGNIFICANCE

INFERENCE
q-q plots of normal data.
Normal Quantile-Quantile Plots:
your data set vs. theoretical normal distribution

Normally Distributed Skewed Leptokurtic

points tend to fall points form a curve points fall along a line in
in a straight line Instead of straight line middle of graph, but curve off
in extremities
ASSIGNMENT:

Histogram / Q Q Plot using r studio Shapiro wilk using


density plot r studio
library("ggpubr") library(ggpubr)
ggdensity(my_data$len,
ggqqplot(my_data$len)
main = "Density plot
of tooth length", xlab
= "Tooth length") library("car")
qqPlot(my_data$len) shapiro.test(my_data$len)

Shapiro-Wilk normality test data:


my_data$len
W = 0.96743, p-value = 0.1091
TEST:
understanding of equation components is important!

1. State the QUESTION


Shapiro-Wilk
NULL HYPOTHESIS

Jarque-Bera VISUALIZATION

TEST STATISTIC

Anderson-Darling SIGNIFICANCE

INFERENCE
TEST:
Take note of the deeper meaning of p-values

p-value definition: 1. State QUESTION


probability of getting a result as or more extreme than observed
result, assuming null hypothesis is true. 2. Formulate NULL HYPOTHESIS

VISUALIZATION
It is not
probability of rejecting null hypothesis. TEST STATISTIC

SIGNIFICANCE

INFERENCE
TEST:
Take note of the deeper meaning of p-values

DECISION FOR OBSERVED p-VALUE:


1. State QUESTION
If p-value very small, less than or equal to
significance level (traditionally 5% or 1% ) 2. Formulate NULL HYPOTHESIS

VISUALIZATION
suggests that observed data is inconsistent with
assumption that null hypothesis is true
TEST STATISTIC

SIGNIFICANCE

INFERENCE
SOLUTION:
Maximize possibility of parametric testing
only if data is appropriate

Test “normality” of data distribution

Transform data if needed


If a measurement variable does not fit a normal distribution or
has greatly different standard deviations in different groups, you
should try a data transformation.
TRANSFORM :
data transformations are important tools

Playing around
“re-expression”
with data
VS.
TRANSFORM :
data transformations are important tools

Playing around
“re-expression”
with data
VS.
Trying different transformations better to use
until one gives
significant result is
transformations that other
researchers commonly use in your field
cheating
TRANSFORM :
Two most common
data transformations in Biology

Log transformation
often useful when high degree of variation within variables or
among attributes within a sample.

Square-root transformation
used for reducing right skewness, and also has advantage of
being applied to zero values.
TRANSFORM :
Log and square root transformations
have different advantages

Log transformation
compresses high values and spreads low values by expressing
the values as orders of magnitude.

Square-root transformation
can convert data from Poisson (discrete) distribution to a
normal (continuous) distribution
The log transformation can
be used to make
highly skewed distributions
less skewed. This can be
valuable both for making
patterns in the data more
interpretable and for helping
to meet the assumptions of
inferential statistics.

X Log10(X)

1 0
10 1
100 2

reduce skewness
The square root, x to x^(1/2) = sqrt(x),
is a transformation with a moderate
effect on distribution shape: it is
weaker than the logarithm and the
cube root. It is also used for reducing
right skewness, and also has the
advantage that it can be applied to
zero values.

Note that the square root of an area


has the units of a length. It is
commonly applied to counted data,
especially if the values are mostly
rather small.
TRANSFORM :
Data transformations suggested by
Tabachnick & Fidell (2007) and Howell (2007)

Moderate positive skewness

Substantial positive skewness

If with zero values


newx = lg10(x + C)

C = a constant added to each score


so that the smallest score is 1.
TRANSFORM :
Data transformations suggested by
Tabachnick & Fidell (2007) and Howell (2007)

Moderate negative skewness negatively skewed requires


reflected transformation

each data point must be reflected,


and then transformed

Substantial negative skewness


k = constant from which each score
is subtracted so that smallest
score = 1

usually equal to largest score + 1.


TRANSFORM :
Data transformations suggested by
Tabachnick & Fidell (2007) and Howell (2007)

Heavy positive skewness


IF log transform does not
normalize data:

try inverse (1/x) transformation.

RELATED TO LOGIT transformation


Heavy negative skewness applicable to proportions data.

Potrebbero piacerti anche