Testing For Normality and Transforming Data

Testing for normality
and
Transforming data
How to tell whether data is
appropriate for parametric tests?
IF DATA
NOT normally distributed
can utilize
Parametric Tests
POWERFUL & ROBUST
IF DATA
NOT normally distributed
can utilize limited to

Parametric Tests Non- Parametric Tests
POWERFUL & ROBUST LESS POWERFUL
HOWEVER
Are all normal distributions the same?
Normal distributions are defined by two parameters, the mean (μ)

and the standard deviation (σ).
ALSO
Visual prediction of normal distribution
greatly limited by sample size
n=10 n=15 n=20
n=25 n=30 n=120

SOLUTION:
Maximize possibility of parametric testing
only if data is appropriate
Test “normality” of data distribution
Transform data if needed

SOLUTION:

Parametric tests hypothesis test which provides generalizations for
making statements about the mean of the parent population
- assume underlying statistical distributions in the data
- if each sample follows a normal distribution and if sample variances are

homogeneous.
- will have more statistical power
Nonparametric tests hypothesis test which is not based on underlying

assumptions, i.e. it does not require population’s distribution to be denoted by
specific parameters.
are also called distribution-free tests because they don’t assume that your
data follow a specific distribution.
ASSIGNMENT: Give examples of data sets/analysis using 1 sample t test, 2 samples t test,
1 way anova, 2 way anova, chi square test.
1. Variables to be measured ( ex. zone of bacterial inhibition, difference of gene

expression in normal and cancer groups)
2. Results and interpretation/conclusion ( include graph/table)
3. Title of the journal /reference
4. submit hard copy
TEST :
“shape” of distribution can be measured
SKEWNESS
“asymmetry”
KURTOSIS
“height of hump”
MODALITY
“no. of humps”
TEST :
SKEWNESS
“asymmetry”
KURTOSIS
MODALITY
“no. of humps”
SKEWNESS: how sample differs in shape
from a symmetrical distribution
Normal Skewed right Skewed left

symmetric longer right tail longer left tail
Mean = Median Mean > Median Mean < Median

Many software programs compute
adjusted Fisher-Pearson coefficient of SKEWNESS
sample
skewness
= G1
an adjustment for robustness of different sample sizes

sample
skewness
= G1
syntax in
= SKEW (number1,[number2],…)
MS Excel
sample
skewness
= G1
= zero (or near 0) : symmetric
= between 1 to -1 : moderate (+/-) skew
= beyond 1 to -1 : substantial (+/-) skew

TEST :
SKEWNESS
“asymmetry”
KURTOSIS
MODALITY
“no. of humps”
KURTOSIS: extent to which data are distributed
“tails vs. center” of distribution
Mesokurtic Leptokurtic Platykurtic

standard bell more peaked flatter
excess KURTOSIS: actual value - 3
sample
kurtosis = G2
actual “kurtosis” of standard normal distribution is 3.

sample
kurtosis = G2
syntax in
= KURT (number1,[number2],…)
MS Excel
sample
kurtosis = G2
= zero (0) : mesokurtic, “normal”

= + value : leptokurtic, “fat in tails”
= - value : platykurtic, “thin in tails”
TEST :
SKEWNESS
“asymmetry”
KURTOSIS
MODALITY
“no. of humps”
MODALITY: “mode” is more than just
the “observation value” that occurs most frequently
Unimodal Bimodal Multimodal

one peak two peaks more than two peaks
exactly how "high" must “nth" peak be to qualify as bi/multimodal

rather than unimodal with some un-smooth regions?
hap-hazard way to count number of “modes” is using
multi-mode function in Excel
syntax in
= MODE.MULT (number1,[number2],…)
MS Excel
Provides an array of frequently occurring values from a data range

Estimate multimodality in by superimposing
Kernel Density Estimates to distribution
REMEMBER, we are estimating “probability density function”

of underlying distribution
SOLUTION:

TEST:
“Normality tests” are 1st inferential statistics you will employ
1. State the QUESTION

Shapiro-Wilk Test
NULL HYPOTHESIS
Jarque-Bera Test VISUALIZATION
TEST STATISTIC
Anderson-Darling Test SIGNIFICANCE
INFERENCE
TEST:
Shapiro-Wilk Test
best “power” when using same probability significance
compared to other two tests.
Jarque-Bera Test Shapiro–Wilk test is a test of normality in

frequentist statistics. It was published in 1965
by Samuel Sanford Shapiro and Martin Wilk.
Anderson-Darling Test
TEST:

Shapiro-Wilk Test
not advised for samples beyond n > 2,000
NULL HYPOTHESIS

best for testing normality of large samples, n > 2,000
TEST STATISTIC
INFERENCE
TEST:

Shapiro-Wilk Test
best “power” when using same probability significance
compared to other two tests. NULL HYPOTHESIS

best for testing normality of large samples, n > 2,000
TEST STATISTIC

measures how data fits a specific distribution
(not just normal distribution!) INFERENCE
TEST:
Purpose of usage common for all three tests
2. Formulate the NULL HYPOTHESIS

Does my sample data set come from a
population with a normal distribution? VISUALIZATION
TEST STATISTIC
SIGNIFICANCE
INFERENCE
TEST:
TAKE NOTE! Null hypothesis are different for the three tests

Shapiro-Wilk Test
H0: population from which sample is sourced
is normally distributed. NULL HYPOTHESIS

H0: skewness and excess kurtosis are both zero (0).
TEST STATISTIC

H0: data comes from specified distribution.
INFERENCE
https://youtu.be/YrWzDc7VoWI
TEST:
Visualization is as important as the test statistics
Histogram
with Kernel Density Estimates 1. State QUESTION
2. Formulate NULL HYPOTHESIS
VISUALIZATION
TEST STATISTIC
SIGNIFICANCE
INFERENCE
TEST:
Visualization is as important as the test statistics
The quantile-quantile (q-q) plot is a graphical
technique for determining if two data sets come
from populations with a common distribution.
Quantile-Quantile Plot
1. State QUESTION
2. Formulate NULL HYPOTHESIS
VISUALIZATION
TEST STATISTIC
SIGNIFICANCE
INFERENCE
q-q plots of normal data.
Normal Quantile-Quantile Plots:
your data set vs. theoretical normal distribution
Normally Distributed Skewed Leptokurtic
points tend to fall points form a curve points fall along a line in
in a straight line Instead of straight line middle of graph, but curve off
in extremities
ASSIGNMENT:
Histogram / Q Q Plot using r studio Shapiro wilk using

density plot r studio
library("ggpubr") library(ggpubr)
ggdensity(my_data$len,
ggqqplot(my_data$len)
main = "Density plot
of tooth length", xlab
= "Tooth length") library("car")
qqPlot(my_data$len) shapiro.test(my_data$len)
Shapiro-Wilk normality test data:

my_data$len
W = 0.96743, p-value = 0.1091
TEST:
understanding of equation components is important!

Shapiro-Wilk
NULL HYPOTHESIS
Jarque-Bera VISUALIZATION
TEST STATISTIC
Anderson-Darling SIGNIFICANCE
INFERENCE
TEST:
Take note of the deeper meaning of p-values
p-value definition: 1. State QUESTION

probability of getting a result as or more extreme than observed
result, assuming null hypothesis is true. 2. Formulate NULL HYPOTHESIS
VISUALIZATION
It is not
probability of rejecting null hypothesis. TEST STATISTIC
SIGNIFICANCE
INFERENCE
TEST:
Take note of the deeper meaning of p-values
DECISION FOR OBSERVED p-VALUE:

1. State QUESTION
If p-value very small, less than or equal to
significance level (traditionally 5% or 1% ) 2. Formulate NULL HYPOTHESIS
VISUALIZATION
suggests that observed data is inconsistent with
assumption that null hypothesis is true
TEST STATISTIC
SIGNIFICANCE
INFERENCE
SOLUTION:

If a measurement variable does not fit a normal distribution or
has greatly different standard deviations in different groups, you
should try a data transformation.
TRANSFORM :
data transformations are important tools
Playing around
“re-expression”
with data
VS.
TRANSFORM :
data transformations are important tools
Playing around
“re-expression”
with data
VS.
Trying different transformations better to use
until one gives
significant result is
transformations that other
researchers commonly use in your field
cheating
TRANSFORM :
Two most common
data transformations in Biology
Log transformation
often useful when high degree of variation within variables or
among attributes within a sample.
Square-root transformation
used for reducing right skewness, and also has advantage of
being applied to zero values.
TRANSFORM :
Log and square root transformations
have different advantages
Log transformation
compresses high values and spreads low values by expressing
the values as orders of magnitude.
Square-root transformation
can convert data from Poisson (discrete) distribution to a
normal (continuous) distribution
The log transformation can
be used to make
highly skewed distributions
less skewed. This can be
valuable both for making
patterns in the data more
interpretable and for helping
to meet the assumptions of
inferential statistics.
X Log10(X)
1 0
10 1
100 2
reduce skewness
The square root, x to x^(1/2) = sqrt(x),
is a transformation with a moderate
effect on distribution shape: it is
weaker than the logarithm and the
cube root. It is also used for reducing
right skewness, and also has the
advantage that it can be applied to
zero values.
Note that the square root of an area

has the units of a length. It is
commonly applied to counted data,
especially if the values are mostly
rather small.
TRANSFORM :
Data transformations suggested by
Tabachnick & Fidell (2007) and Howell (2007)
Moderate positive skewness
Substantial positive skewness
If with zero values

newx = lg10(x + C)
C = a constant added to each score

so that the smallest score is 1.
TRANSFORM :
Moderate negative skewness negatively skewed requires

reflected transformation
each data point must be reflected,

and then transformed
Substantial negative skewness

k = constant from which each score
is subtracted so that smallest
score = 1
usually equal to largest score + 1.

TRANSFORM :
Heavy positive skewness

IF log transform does not
normalize data:
try inverse (1/x) transformation.
RELATED TO LOGIT transformation

Heavy negative skewness applicable to proportions data.

Testing For Normality and Transforming Data

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Testing For Normality and Transforming Data

Caricato da

Copyright:

Formati disponibili

Testing for normality

can utilize limited to

Normal distributions are defined by two parameters, the mean (μ)

n=10 n=15 n=20

n=25 n=30 n=120

Test “normality” of data distribution

Transform data if needed

Test “normality” of data distribution

Transform data if needed

- assume underlying statistical distributions in the data

- if each sample follows a normal distribution and if sample variances are

- will have more statistical power

Nonparametric tests hypothesis test which is not based on underlying

1. Variables to be measured ( ex. zone of bacterial inhibition, difference of gene

Normal Skewed right Skewed left

Mean = Median Mean > Median Mean < Median

an adjustment for robustness of different sample sizes

= zero (or near 0) : symmetric

= between 1 to -1 : moderate (+/-) skew

= beyond 1 to -1 : substantial (+/-) skew

Mesokurtic Leptokurtic Platykurtic

actual “kurtosis” of standard normal distribution is 3.

= zero (0) : mesokurtic, “normal”

Unimodal Bimodal Multimodal

exactly how "high" must “nth" peak be to qualify as bi/multimodal

Provides an array of frequently occurring values from a data range

REMEMBER, we are estimating “probability density function”

Test “normality” of data distribution

Transform data if needed

1. State the QUESTION

Jarque-Bera Test VISUALIZATION

Anderson-Darling Test SIGNIFICANCE

Jarque-Bera Test Shapiro–Wilk test is a test of normality in

1. State the QUESTION

Jarque-Bera Test VISUALIZATION

Anderson-Darling Test SIGNIFICANCE

1. State the QUESTION

Jarque-Bera Test VISUALIZATION

Anderson-Darling Test SIGNIFICANCE

1. State the QUESTION

2. Formulate the NULL HYPOTHESIS

1. State the QUESTION

Jarque-Bera Test VISUALIZATION

Anderson-Darling Test SIGNIFICANCE

2. Formulate NULL HYPOTHESIS

2. Formulate NULL HYPOTHESIS

Normally Distributed Skewed Leptokurtic

Histogram / Q Q Plot using r studio Shapiro wilk using

Shapiro-Wilk normality test data:

1. State the QUESTION

p-value definition: 1. State QUESTION

DECISION FOR OBSERVED p-VALUE:

Test “normality” of data distribution

Transform data if needed

Note that the square root of an area

Moderate positive skewness

Substantial positive skewness

If with zero values

C = a constant added to each score

Moderate negative skewness negatively skewed requires

each data point must be reflected,

Substantial negative skewness