INTRODUCTION TO SIMPLE STATISTICS
THE SQUIRREL, THE NUT AND THE NONPARAMETRIC SOLUTION
Version 1.1 (August 1993)
Carlos Drews
University of Cambridge
Dept. of Zoology, Downing Street
Cambridge CB2 3EJ, U.K.
Aim of the seminar:
The aim of the seminar is to provide an overview of basic
concepts in statistics and familiarize the audience with the
value and applications of nonparametric statistical
procedures. The presentation is restricted to the essential
facts, leaving out formulae as well as mathematical
derivations of the central notions. The seminar will
unavoidably suffer from oversimplification. Nonetheless, it
should encourage the audience to start reading and learning
about statistics, which can be a fascinating subject in its
own right in addition to being a powerful and necessary tool
in research.
Recommended reading:
Siegel, S. & N.J.Castellan, Jr. (1988) Nonparametric statistics,
2nd Edition, McGraw-Hill, Singapore
Huff D. (1954) How to lie with statistics, Penguin Books, London
PART I. BASIC CONCEPTS
1 Why statistics?
Chance is a major determinant of the outcome of a sampling
procedure. Statistical analysis of data is a quantification of the
possible effect of chance. It enables an objective distinction
between what could be attributed to chance and a "true" effect,
such as a difference, a similarity or an association. Descriptive
statistical parameters, which describe e.g. the central tendency
or the spread of a data set, characterize a sample in a way that
enables comparisons with other samples or with values which would
be expected according to specific theoretical premises.
2 Characterizing distributions
The distribution of data is described by how often each value
occurs among the possible range of values of a scale.
Example 1: Distribution of nut sizes from a tropical tree grown in
a botanical garden in London and in a Brazilian forest patch.€.Drews INTRODUCTION TO STATISTICS
2.1 Various distributions
(Note: Figures 1-7 are to be drawn on the board)
Figure 1. Nut size in botanical garden (normal)
Figure 2. Nut size in Brazil where squirrels are common (left skew,
right skew, bimodal, even, I Don’t Know)
Figure 3. Nut sizes of two tree species in Botanical Garden (normal
distributions with different variance and central tendency)
2.2 Describing distributions in statistical terms
~ "average" is a neutral term for the central tendency, the
specific statistic used depends strictly on the distribution
Figure 4. The Normal distribution, the mean and standard deviation
Figure 5. Other measures of central tendency and spread: the mode,
the median and interquartile ranges
3 Levels of measurement
3.1 Nominal or categorical
E.g. number of different kinds of seeds eaten by the squirrel -
even if numeric codes are used for seed kinds, sex of squirrel.
Values cannot be ordered or ranked
3.2 Ordinal or ranking
There is a scale underlying the data, values can be ranked.
Continuous scale: weather conditions during foraging
Discontinuous scale: age classes infant, juvenile, adult
Use median (or mode) to describe central tendency, because it is
insensitive to changes in the numerical assignments to scores. The
median always remains in the middle of the distribution, while the
mean does not. 5
3.3 Interval and ratio scales
E.g. temperature in roost of squirrel, foraging height of squirrel
on trees in metres, weight of squirrel
The distances and differences between any two numbers on the scale
have meaning. The ratio of any two intervals is independent of the
unit used and the zero point, both of which are arbitrary. If a
true zero point exists then it is a ratio scale.
4 Parametric vs. Nonparametric (distribution free) Statistics
Conditions to apply parametric tests:
= normal distributions
- equal variances
- at least interval level of measurement
2C.Drews INTRODUCTION TO STATISTICS
Advantages and disadvantages of Nonparametric statistics:
Advantages: § - deal with nominal and ordinal level
> only alternative when sample sizes small
- no assumptions about the distribution
Disadvantages: - slightly less powerful than parametric stats
- some limitations in multivariate procedures
If interval data and large samples are available, then check
deviation from normality, use a transformation to normalize the
distribution or use a (conservative) nonparametric test.
5 Conclusions from statistical tests: the rejection region and p
p is the probability of a value equal or more extreme occurring
purely by chance.
Figure 6. Rejection region in a normal distribution
= p<0.05 is a convention and it remains a statistical probability
- Type I error: if the null hypothesis is rejected when it is, in
fact true (an effect is reported when there isn’t one). The
smaller is p the more unlikely is this error.
- Type II error: involves failure to reject the null hypothesis
when, in fact, it was false (the true effect was not
detected). A large, representative, sample makes this error
less likely.
One-tailed hypothesis if the direction of the effect is predicted
from a sound theoretical argumentation or from previous
results of similar analyses
~ Two~tailed hypotheses when the direction of the effect is not
predicted in advance (the test is more robust)
~ if p<0.05 then the result is significant regardless of the sample
size
- make sure that the sample is representative for the population
for which your generalization should hold
- report p along with non-significant results as well
6 Pseudoreplication or the "Pooling fallacy"
For all statistical tests, whether parametric or not, data points
have to be independent.