Sei sulla pagina 1di 27

SW388R7

Data Analysis &


Computers II

Slide 1
Assumption of normality

Assumption of normality

Transformations

Assumption of normality script

Practice problems
SW388R7
Data Analysis &
Computers II

Slide 2
Assumption of Normality
Many of the statistical methods that we will apply
require the assumption that a variable or variables
are normally distributed.

With multivariate statistics, the assumption is that
the combination of variables follows a multivariate
normal distribution.

Since there is not a direct test for multivariate
normality, we generally test each variable
individually and assume that they are multivariate
normal if they are individually normal, though this is
not necessarily the case.
SW388R7
Data Analysis &
Computers II

Slide 3
Evaluating normality
There are both graphical and statistical methods for
evaluating normality.

Graphical methods include the histogram and
normality plot.

Statistical methods include diagnostic hypothesis
tests for normality, and a rule of thumb that says a
variable is reasonably close to normal if its skewness
and kurtosis have values between 1.0 and +1.0.

None of the methods is absolutely definitive.
SW388R7
Data Analysis &
Computers II

Slide 4
Transformations
When a variable is not normally distributed, we can
create a transformed variable and test it for
normality. If the transformed variable is normally
distributed, we can substitute it in our analysis.

Three common transformations are: the logarithmic
transformation, the square root transformation, and
the inverse transformation.

All of these change the measuring scale on the
horizontal axis of a histogram to produce a
transformed variable that is mathematically
equivalent to the original variable.
SW388R7
Data Analysis &
Computers II

Slide 5
When transformations do not work
When none of the transformations induces normality
in a variable, including that variable in the analysis
will reduce our effectiveness at identifying statistical
relationships, i.e. we lose power.

We do have the option of changing the way the
information in the variable is represented, e.g.
substitute several dichotomous variables for a single
metric variable.
SW388R7
Data Analysis &
Computers II

Slide 6
Problem 1
In the dataset GSS2000.sav, is the following
statement true, false, or an incorrect application of
a statistic? Use 0.01 as the level of significance.

Based on a diagnostic hypothesis test of normality,
total hours spent on the Internet is normally
distributed.

1. True
2. True with caution
3. False
4. Incorrect application of a statistic
SW388R7
Data Analysis &
Computers II

Slide 7
Computing Explore descriptive statistics
To compute the statistics
needed for evaluating the
normality of a variable, select
the Explore command from
the Descriptive Statistics
menu.
SW388R7
Data Analysis &
Computers II

Slide 8
Adding the variable to be evaluated
First, click on the
variable to be included
in the analysis to
highlight it.
Second, click on right
arrow button to move
the highlighted variable
to the Dependent List.
SW388R7
Data Analysis &
Computers II

Slide 9
Selecting statistics to be computed
To select the statistics for the
output, click on the
Statistics command button.
SW388R7
Data Analysis &
Computers II

Slide 10
Including descriptive statistics
First, click on the
Descriptives checkbox
to select it. Clear the
other checkboxes.
Second, click on the
Continue button to
complete the request for
statistics.
SW388R7
Data Analysis &
Computers II

Slide 11
Selecting charts for the output
To select the diagnostic charts
for the output, click on the
Plots command button.
SW388R7
Data Analysis &
Computers II

Slide 12
Including diagnostic plots and statistics
First, click on the
None option button
on the Boxplots panel
since boxplots are not
as helpful as other
charts in assessing
normality.
Second, click on the
Normality plots with tests
checkbox to include
normality plots and the
hypothesis tests for
normality.
Third, click on the Histogram
checkbox to include a
histogram in the output. You
may want to examine the
stem-and-leaf plot as well,
though I find it less useful.

Finally, click on the
Continue button to
complete the request.
SW388R7
Data Analysis &
Computers II

Slide 13
Completing the specifications for the analysis
Click on the OK button to
complete the specifications
for the analysis and request
SPSS to produce the
output.
SW388R7
Data Analysis &
Computers II

Slide 14
TOTAL TIME SPENT ON THE INTERNET
100.0
90.0
80.0
70.0
60.0
50.0
40.0
30.0
20.0
10.0
0.0
Histogram
F
r
e
q
u
e
n
c
y
50
40
30
20
10
0
Std. Dev = 15.35
Mean = 10.7
N = 93.00
The histogram
An initial impression of the
normality of the distribution
can be gained by examining
the histogram.

In this example, the
histogram shows a substantial
violation of normality caused
by a extremely large value in
the distribution.
SW388R7
Data Analysis &
Computers II

Slide 15
Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET
Observed Value
120 100 80 60 40 20 0 -20 -40
E
x
p
e
c
t
e
d

N
o
r
m
a
l
3
2
1
0
-1
-2
-3
The normality plot
The problem with the normality of this
variables distribution is reinforced by the
normality plot.

If the variable were normally distributed,
the red dots would fit the green line very
closely. In this case, the red points in the
upper right of the chart indicate the
severe skewing caused by the extremely
large data values.
SW388R7
Data Analysis &
Computers II

Slide 16
Tests of Normality
.246 93 .000 .606 93 .000
TOTAL TIME SPENT
ON THE INTERNET
Stati sti c df Si g. Stati sti c df Si g.
Kolmogorov-Smi rnov
a
Shapi ro-Wi lk
Li ll i efors Si gni fi cance Correcti on
a.
The test of normality
Problem 1 asks about the results of the test of normality. Since the sample
size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample
size were 50 or less, we would use the Shapiro-Wilk statistic instead.

The null hypothesis for the test of normality states that the actual
distribution of the variable is equal to the expected distribution, i.e., the
variable is normally distributed. Since the probability associated with the
test of normality is < 0.001 is less than or equal to the level of significance
(0.01), we reject the null hypothesis and conclude that total hours spent on
the Internet is not normally distributed. (Note: we report the probability as
<0.001 instead of .000 to be clear that the probability is not really zero.)

The answer to problem 1 is false.

SW388R7
Data Analysis &
Computers II

Slide 17
The assumption of normality script
An SPSS script to produce all
of the output that we have
produced manually is
available on the course web
site.

After downloading the script,
run it to test the assumption
of linearity.
Select Run Script
from the Utilities
menu.
SW388R7
Data Analysis &
Computers II

Slide 18
Selecting the assumption of normality script
First, navigate to the folder containing your
scripts and highlight the
NormalityAssumptionAndTransformations.SBS
script.
Second, click on
the Run button to
activate the script.
SW388R7
Data Analysis &
Computers II

Slide 19
Specifications for normality script
The default output is to do all of the
transformations of the variable. To
exclude some transformations from the
calculations, clear the checkboxes.
Third, click on the OK
button to run the script.
First, move variables from
the list of variables in the
data set to the Variables to
Test list box.
SW388R7
Data Analysis &
Computers II

Slide 20
Tests of Normality
.246 93 .000 .606 93 .000
TOTAL TIME SPENT
ON THE INTERNET
Stati sti c df Si g. Stati sti c df Si g.
Kolmogorov-Smi rnov
a
Shapi ro-Wi lk
Li ll i efors Si gni fi cance Correcti on
a.
The test of normality
The script produces the same output that we
computed manually, in this example, the tests
of normality.

SW388R7
Data Analysis &
Computers II

Slide 21
Problem 2
In the dataset GSS2000.sav, is the following
statement true, false, or an incorrect application of
a statistic?

Based on the rule of thumb for the allowable
magnitude of skewness and kurtosis, total hours
spent on the Internet is normally distributed.

1. True
2. True with caution
3. False
4. Incorrect application of a statistic
SW388R7
Data Analysis &
Computers II

Slide 22
Descriptives
10.731 1.5918
7.570
13.893
8.295
5.500
235.655
15.3511
.2
102.0
101.8
10.200
3.532 .250
15.614 .495
Mean
Lower Bound
Upper Bound
95% Confi dence
Interval for Mean
5% Tri mmed Mean
Medi an
Vari ance
Std. Devi ation
Mini mum
Maximum
Range
Interquartil e Range
Skewness
Kurtosi s
TOTAL TIME SPENT
ON THE INTERNET
Stati sti c Std. Error
Table of descriptive statistics
To answer problem
2, we look at the
values for skewness
and kurtosis in the
Descriptives table.
The skewness and kurtosis for the variable both exceed the rule of
thumb criteria of 1.0. The variable is not normally distributed.

The answer to problem 2 if false.
SW388R7
Data Analysis &
Computers II

Slide 23
Problem 3
In the dataset GSS2000.sav, is the following
statement true, false, or an incorrect application of
a statistic? Use 0.01 as the level of significance.

Based on a diagnostic hypothesis test of normality,
"total hours spent on the Internet" is not normally
distributed. A logarithmic transformation of "total
hours spent on the Internet" results in a variable that
is normally distributed.

1. True
2. True with caution
3. False
4. Incorrect application of a statistic
SW388R7
Data Analysis &
Computers II

Slide 24
Tests of Normality
.047 93 .200* .994 93 .951
.118 93 .003 .868 93 .000
.288 93 .000 .495 93 .000
Logarithm of NETIME
[LG10(NETIME)]
Square Root of NETIME
[SQRT(NETIME)]
Inverse of NETIME
[1/(NETIME)]
Stati sti c df Si g. Stati sti c df Si g.
Kolmogorov-Smi rnov
a
Shapi ro-Wi lk
This i s a l ower bound of the true si gni fi cance.
*.
Li ll i efors Si gni fi cance Correction
a.
The test of normality
Problem 3 specifically asks about the results of the test of
normality for the logarithmic transformation. Since our sample
size is larger than 50, we use the Kolmogorov-Smirnov test.

The null hypothesis for the Kolmogorov-Smirnov test of normality
states that the actual distribution of the transformed variable is
equal to the expected distribution, i.e., the transformed variable
is normally distributed. Since the probability associated with the
test of normality (0.200) is greater than the level of significance,
we fail to reject the null hypothesis and conclude that the
logarithmic transformation of total hours spent on the Internet is
normally distributed.

The answer to problem 3 is true.
SW388R7
Data Analysis &
Computers II

Slide 25
Other problems on assumption of normality
A problem may ask about the assumption of normality
for a nominal level variable. The answer will be An
inappropriate application of a statistic since there is
no expectation that a nominal variable be normal.

A problem may ask about the assumption of normality
for an ordinal level variable. If the variable or
transformed variable is normal, the correct answer to
the question is True with caution since we may be
required to defend treating an ordinal variable as
metric.

Questions will specify a level of significance to use and
the statistical evidence upon which you should base
your answer.
SW388R7
Data Analysis &
Computers II

Slide 26
Steps in answering questions about the
assumption of normality question 1
The following is a guide to the decision process for answering
problems about the normality of a variable:
Does the statistical
evidence support
normality assumption?
Yes
No
Incorrect application
of a statistic
Yes
No
Is the variable to be
evaluated metric?

False
Are any of the metric
variables ordinal level?
Yes
True
No
True with caution
SW388R7
Data Analysis &
Computers II

Slide 27
Steps in answering questions about the
assumption of normality question 2
The following is a guide to the decision process for answering
problems about the normality of a transformation:
Statistical evidence
supports normality?
Yes
No
Incorrect application
of a statistic
Yes
No
Is the variable to be
evaluated metric?

Statistical evidence
for transformation
supports normality?
Either variable
ordinal level?
No
No
Yes
False
True
True with caution

Potrebbero piacerti anche