Sei sulla pagina 1di 288

MultiVariate Analysis

Objectives:
IntroducethetheoryofMultiVariate Analysis
Discussthenatureofcommunitycompositiondata

LearningOutcomes:
UnderstandtheprinciplesunderlyingMultiVariate Analysis
BecomefamiliarwithmatrixdataandtheiruseinPCORD

Multivariate DataAnalysis Background


The need for multi-variate analysis arises whenever more
than one characteristic is measured on a number of individuals,
and relationships among the characteristics make it necessary
for them to be studied simultaneously (Krzanowski 1972)
Community data are multi-variate because each sample
unit is characterized by:
the abundance (or presence / absence) of a number of
intercorrelated species
a set of (cross-correlated) environmental factors affecting
species distributions

Multivariate DataAnalysis
Why are species data cross-correlated ?
If there are no specific niches:
Species are still using a finite
amount of resources

If there are specific niches:


Species are affecting each
others distributions / abundances

CommunityAnalysis Introduction

Community ecologists analyze the effects of multiple


environmental factors (variables) on large numbers of
co-occurring species (simultaneously), and deal with
substantial statistical errors (measurement / structural).
Community analysis techniques fall into two groups:
classification and ordination

CommunityAnalysis Introduction
classification versus ordination
Classification is the placement of species and / or
sample units into (discrete) groups
Ordination is the arrangement or ordering of species
and / or sample units along gradients

WhatisanEcologicalCommunity?
Two views have
dominated the debate
over the nature of
ecological communities
since the 1920's:

Clements

Clements
discrete unit

Gleason
Gleasons
loose assemblage
of species

WhatisanEcologicalCommunity?
Clements Perspective:
Discrete entities with recognizable boundaries
The community fully integrated functionally
Species have coevolved, enhancing their interdependence
Gleason's Perspective:
Community is a chance association of species with
similar adaptations and ecological requirements
No distinct boundaries where communities meets

EcologicalCommunities ScaleMatters
An ecozone or
biogeographic
realm is the largest
scale biogeographic
division of the
earth's surface
based on historic
and evolutionary
distribution patterns
of organisms

(Spalding et al. 2007)

EcologicalCommunities ScaleMatters

Ecozones consist of clusters of adjacent


ecoregions that span several habitat types,
but have a strong biogeographic affinity
(Spalding et al. 2007)

Small areas where


physical / biological
properties change

WhatdoCommunityDataLookLike?
Number of species

Log-normal Distribution
Log-normal Distribution

Abundance of species

Transformed (log scale)

PropertiesofCommunityData

Most species infrequent:


Majority of species present
in a minority of locations,
and contributes little to the
overall abundance
Some species abundant:
Few species dominate
some are very numerous

Number of species

Data tend to be sparse:


large portion of entries are
zeros (species absent)

Abundance of species

PropertiesofCommunityData
The number of important factors is small:
A few factors can explain the majority of the
explainable variation (variance)
Especially we if create synthetic variables
combinations of multiple collinear variables
There is substantial noise in the data:
Even under ideal circumstances, replicate
samples vary from each other due to stochastic
events - and potentially though observer error
We can use this overlap (or smear) to quantify
the degree of sample similarity along gradients

PropertiesofCommunityData
There is much redundant information:
species often share similar distributions
replicate samples vary from each other
This redundancy is at the core of multi-variate statistics

DATA

Environmental
Variables

Species

Sample Units

Patterns
Associations Between:
Species * Environmental
Variables * Sample Units

FormatofCommunityData
Multi-variate methods operate on a community data matrix
(or species by sample matrix).
A community data
matrix has taxa
(species) as rows and
samples as columns:
PC ORD terms this a
NORMAL matrix:
Rows: Sample Units
Columns: Species

Plant Community Composition

FormatofCommunityData
Spp1
Sample1 10
Sample2 0.50
Sample3 1

Normal
Matrix

Spp2
2
0.40
0

Spp3
1
0.10
1

Samples refer to the basic unit of observation


The units are quadrats, transects, grid cells
Environmental
Matrix

Char1
Sample1 Oahu
Sample2 0.53
Sample3 25

Char2
Kauai
0.47
37

Char3
Oahu
0.12
56

FormatofCommunityData
In linear algebra, the transpose of a matrix A is another
matrix AT (also written Atr, tA, or A) created by:
writing the rows of A as the columns of AT
writing the columns of A as the rows of AT

PCORD InstalltheSoftware
Open Help and Read:
Getting Started
Using PC-ORD
Introduction

UsingPCORDforMultivariate DataAnalysis
PC ORD was written
for community ecologists
and includes procedures
best suited for species data
This class will help you in three areas:
1. to recognize which techniques are appropriate for
specific data types and analysis objectives
2. to become familiar with how to use PC ORD within
the context of a proper analysis process
3. to understand what PC ORD does and when you
might need specific features

UsingPCORDforMultivariate DataAnalysis
In this class we will cover the following issues / techniques:
General use of multivariate analysis
using PC ORD
summarizing data / results
Data analysis flow
data transformations
selecting a distance measure
Ordination techniques in PC-ORD:
PCA, NMS, polar ordination
Grouping techniques in PC-ORD:
MRPP, indicator species analysis, Mantel tests

PCORD ExampleofDatasets
PC ORD works with multiple datasets arranged in matrices

MAIN

Secondary

MATRIX

Matrix

PCORD ExampleofDatasets
PC ORD works with multiple datasets arranged in matrices

PCORD FileTypes
Data Matrices: Main / Secondary
Main = Data of Interest Secondary = Associated Data

PCORD FileTypes
Data Matrices: Main / Secondary
(*.wk1 spreadsheet files) Created with Excel

Can be exported
from Excel as:
worksheet format
WK (1-2-3)

MainMatrixFormat
There are certain conventions:
List number of sample
units and species
Name of sample units
(list them on column 1)
Describe variable
types and names
(list them on rows)
Add data on columns
(species abundance)

SecondMatrixFormat
There are certain conventions:
List number of sample
units and variables
Name of sample units
(list them on column 1)
Describe variable
types and names
(list them on rows)
Add data on columns
(environmental data)

Main/SecondaryMatrixFormat
1. Cells A1 and B1 indicate the number of entities (rows) and
what the entities are (8-character maximum).
2. Cells A2 and B2 indicate the number of attributes (columns)
and what the attributes are (8-character maximum).
3. Row 3 contains single letters which indicate the variable
type for each column. There are three acceptable values:
Q = Quantitative C = Categorical M = Mixed
4. Row 4 contains the name for each column variable. ONLY
THE FIRST 8 characters are used.
5. Rows 5 and below contain a row name in the first cell,
followed by numeric values for each column attribute. Row
names, like the column names, are limited to 8 characters.

ImportOptions
Several formats allowed from Excel

NOTE:
The same
formatting
rules hold
for all
import file
types

PCORD OtherFileTypes
Graph Row / Graph Column = Ordination scores for rows
and columns (*.gph text files) created by PD-ORD
Result = Procedure output (*.txt text file) created by PC-ORD
Dendograms / Plots = (*.den) (*.str) created by PC-ORD
Project = After saving a main, second, graph and result file,
they can be grouped together (*.prj file) specific to PC-ORD

SampleDataset:BryophytematsfromOregon
Datasets posted on the course web-page:
aMoss1M = normal matrix of average relative % volume
of epiphytic bryophytes
aMoss2M = environmental matrix with measurements from
the same sample units where bryophytes sampled
Objectives:
Investigate association of different species in space
Study environmental correlates of species distributions

StartExploringPDORD
Objectives:
Discussspecies environmentalassociations
Explorethenatureofcommunitycompositiondata

LearningOutcomes:
Understandthevalueofbivariateplotsandspecies/env.plots
Becomefamiliarwithgraphingoptionsandformating inPCORD

SpeciesonEnvironmentalGradients Ideal
Robert H. Whittaker (Ed), Classification of Plant Communities,
1978 (Handbook of Vegetation Science), Kluwer
Academic Publishers
Ideal species distributions across environmental gradients:
Gaussian Response: Smooth normal curves
Characterized by: mean + SD, peak

Linear Response:

Smooth lines
Characterized by range, slope

Gaussian(Normal)Distribution

Figure 5.1. Hypothetical species abundance in response


to an environmental gradient. Lettered curves represent
different species. Figure adapted from Whittaker (1954).

LinearDistribution

Figure 5.2. Hypothetical linear responses of species


abundance to an environmental gradient. Lettered lines
represent different species.

Again:AMatterofScale
Study 1:
(46 km depth):
Species prefers
shallow habitat

Gaussian

Study 3:
(3-4 km depth):
Species shows
no preference

Linear

Study 2:
(0 - 3 km depth):
Species prefers
depth habitat

Some times the answer


is scale-dependent

SpeciesonEnvironmentalGradients Real
3 important issues to consider:
Zero-truncation Problem:
Solid Curves:
Complex Curves:

Observation: Species are


often below their optimal
abundance, given the
environmental factor: Why?
Other limiting factors
(e.g., Other environmental factors,
other species, life-history, chance)

Points
represent
species
abundance

SpeciesonEnvironmentalGradients Real
Linear Regression
r = 0.30

r = 0.34

Abrupt Ranges
(Boundaries)
Multiple Modes

Fitted Envelope
r = 0.21

r = 0.44

Linear Responses
Peaks (Optima)

CommunityAnalysis BivariatePlots
More fruitful to explore how pairs of species
abundances are related with bivariate plots.
Joint Absences
(0,0)
Species B

Joint Occurrences
(lots, lots)
Perfect correlation
No correlation
Species A

CommunityAnalysis BivariatePlots
Bivariate plots from pairs of species responses
to the same environmental gradients
positively associated

negatively associated
Both
have joint
absence
(0, 0) data
Jointoccurrence
versus
single
occurrence

CommunityAnalysis BivariatePlots
positively associated

negatively associated

CommunityAnalysis Correlations
Normal
Clouds

r should be positive

Dust
Bunnies

r should be negative

CommunityAnalysis Correlations

Consider two species with


different habitat responses
(a negative association)

Beware:
As we sample beyond their
habitats, we record more
and more joint absences

Summary
Plots of species responses (abundance) to single
environmental variables are informative:
unimodal multimodal
linear / normal
peak (optimum)
Bivariate plots (sp1 vs sp2) are more informative, since
they allow us to integrate all the possible responses
of the species to the environmental variables
Typical responses: normal cloud and dust bunnies

PCORD GraphingTools

Two variables plotted at once


(from either matrix)
Jittering moves points randomly
to allow users to see overlap
Plots pair-wise combinations
(up to ten variables) at once
(from either matrix)

PCORD ScatterplotFigures
Examples of Scatterplots:
Single (2 species)

Multiple (Matrix)

PCORD DominanceCurves
Dominance curves: To study distribution
of abundance among species in sample.
How are individuals distributed across species?
Program creates result.txt file (with all species)
Species - name
RankAbun - ranking abundance (1: most numerous)
Log(SumAbund) - log (base 10) of total sum
Sum - sum of all abundance data for species
RankFreq - ranking frequency (1: most frequent)
Freq
- frequency (number of non-zero counts)
Mean - average
S.Dev. - SD
CV% - 100 * (SD / Mean)
V/M - Variance / Mean ratio

PCORD DominanceCurves

PCORD Distributions
Distributions:

How are your data distributed across


environmental conditions ?

Depends on whether a variable is continuous or discrete.


Discrete variables are easily summarized with a frequency
distribution counting the number of occurrences of each
discrete value in the variable.
Continuous variables are summarized using smooth density
distributions - the frequency of observations along a scale.
PC ORD uses several methods to represent distributions,
along with some classic distributions for comparison
(normal, lognormal, poisson, binomial, negative binomial).
The program estimates moments from the observed data,
and uses them as parameters in theoretical distributions.

PCORD Distributions
The Observed Data

Reference Data

The user can change the


number / size of bins
Other curves included to
show reference distributions

PCORD Distributions
The Main Matrix Distributions and Second Matrix
Distributions ask the following:
Variable (Choose one variable from the matrix)
Distribution type (Discrete or Continuous)
Curve steps (number of increments along the axis )
Constant for lognormal distribution (leave default)

Note: Kernels are Smoothing Parameters


(Read Instructions in PC ORD help)

YourTask
Start Exploring PC ORD using bryophyte matrices (1, 2)
Load data (import and open)
Format and export figures
Make scatterplots (species pairs & species env data)
Find good example of a dust bunny distribution
Find good example of normal cloud distribution
Try other plots: scatterplot matrix, dominance, distribution,
Create dominance curve for matrix 1
Pick one species and make distribution plot

QuantifyingDistances
Objectives:
DiscussSimilarityandDissimilarity
GoovertherulesforselectingaDistanceMeasure

LearningOutcomes:
UnderstandthevalueofDistancemetrics
BecomefamiliarwiththeapproachforcalculatingDistancemetrics

DistanceMeasures
First step of most multi-variate analysis involves creating a
matrix of distances or similarities for all pairs of samples
This step is extremely important:
If information is ignored it will not be included in results
If noise / outliers exaggerated add distorting influences
There are a myriad of indices and lots of details.
Lets begin with the general principles:

Distance = Difference

DistanceConcepts
Resemblance can be quantified in two ways:
dissimilarity or similarity
These two metrics can be translated:
Similarity = 1 Dissimilarity
Distance metrics can be applied to a variety of data:
Quantitative (discrete / continuous)
Binary (Presence / Absence)
Distances calculated among two types of objects:
either the rows or the columns of the primary data matrix
Sample unit distances -> sample space
Species distances -> species space

SpeciesSpace vs SampleSpace

Samples

Species

Sample Space: Compare species across pairs of samples


Species Space: Compare samples across pairs of species

SpeciesSpace vs SampleSpace

Species abundance shown


as points in sample space
Sp 1

4
3

Sp 2

0
2
3
4
Sample Unit A

SU B

2
1

SU A

1
0

Species space

Sample space
Species 2

Sample Unit B

Sample unit composition


as points in species space

2
3
Species 1

TypesofDistanceMetrics
Three types: Metric, Semimetric, Non-Metric
A good metric needs to meet four rules:
Minimum distance value is 0 (e.g., for identical samples)
When two items differ, distance > 0
Distances are symmetrical:
from A to B = from B to A
Triangle Axiom: Whenever we have 3 objects
Any one pair-wise distance
(between any two objects)
CANNOT be larger than the
sum of the other two distances

TypesofDistanceMetrics
Rules:

Metrics Semimetrics Nonmetrics

Minimum Distance = 0

Distance > 0

Symmetrical Distances

Triangle Axiom

Both metrics and semimetrics used in ecology


Watch out: if index violates any other rule Nonmetric
Do NOT use Nonmetrics

Examples Distanceofspecies

Because each object is represented by 4 variables,


we say that these objects has 4 dimensions.
The coordinate of Apple is (1,1,1,1)
The coordinate of Banana is (0,1,0,0).
Jaccard's coefficient between Apple and Banana is 1/4
Shared: 0 + 1 + 0 + 0
Total: 4 variables (union of banana and apple)
Jaccard's distance between Apple and Banana is 3/4

Examples Distanceofsamples
A = {2,3,4,6,7}

We have 2 sets:
7
4

6
2 3

Union between the sets is:


1 4 6 7

3 5

B = {1,4,5,7,8}
8
7
4

Intersection between sets is:


4 7

Jaccard's coefficient can be computed as the number of


elements in the intersection set divided by the number of
elements in the union set: 2 / 8 = 0.25
Jaccard's distance: 1 Jaccards coefficient = 0.75

SelectingaDistanceMetric
Two things to consider:
Input: Acceptable domain of input data
Presence / Absence vs Count Data
Positive / Negative values
Output: Range of output distances
Within bounded range
Meet triangle axiom

SelectingaMetric ProportionalOverlap

Jaccard Index BinaryDataExample


Two species (A, B). For each sample, values are 0 or 1.
Total number of each combination, as follows:
M11 = number of attributes where A = B = 1. (Joint Presence)
M01 = number of attributes where A = 0, B = 1. (B present)
M10 = number of attributes where A =1, B = 0. (A present)
M00 = number of attributes where A = B = 0. (Joint Absence)
Each sample can have one of four values:
M11 + M01 + M10 + M00 = n (sample size)
Jaccard similarity coefficient =
Jaccard distance =
.

SelectingaMetric ProportionalOverlap
The Jaccard index measures similarity
between samples, and is defined as the size
of the intersection divided by the size of the
union of the sample sets (Jaccard 1901)
The Jaccard distance, which measures
dissimilarity between samples, is obtained
by dividing the difference of the sizes of the
union and the intersection of two sets by the
size of the union

1-

Jaccard Distance is complementary to the Jaccard coefficient


and is obtained by subtracting Jaccard index from 1

SelectingaMetric ProportionalOverlap

JS = w / (A + B -w)

JD = (1 w) / (A + B -w)
Properties:

Proportion of combined
abundance not shared

Works with binary data or quantitative data (counts)


Output is metric (does meet triangle axiom)

SelectingaMetric ProportionalOverlap
Sorensen index (Sorensen 1948), also known as Sorensens
similarity coefficient (QS), is defined as:
Where:
A and B are the number of species in sample A and B,
respectively, and C is the number of shared species.

Sorensen Distance is complementary to the QS metric, and is


obtained by subtracting the QS index from 1

SelectingaMetric ProportionalOverlap

SS (Bray-Curtis)
= 2w / (A + B)

SD = (1 - 2w) / (A+B)
Properties:

Proportion of shared abundance


(divided by total abundance)

Works with binary data or quantitative data (counts)


Output is semimetric (does not meet triangle axiom)

SelectingaMetric ContinuousDistance
Euclidean Distance:

Species j
Sample units i and h

Widely Used:

A = Abundance

For both species abundances


and environmental conditions

Where p is the number of


dimensions (axes used)

Summary DistanceMetrics
Many Distance indices available How to select?
Consider 4 rules: metric / semimetric
Look for proportionality (scaled from 1 to 0)
Think about what makes intuitive sense to you
Check what indices are compatible with given tests
The choice of a distance metric is based on empirical
evidence (e.g., methodological studies, previous literature)
Recommendations: (According to PC ORD)
Sorensen index shown to be effective for assessing
species and sample similarity (community data)
Euclidean distance well suited for environmental data

Summary Recommendations
Sorensen:
Quantifies proportion shared abundance among species
Works well for community data (empirically)
Relative Sorensen:
Includes general relativization (by totals)

Summary Recommendations
Euclidean:
Sensitive to outliers
Bad with community data (lots of 0s)
Relativized Euclidean:
Euclidean after scaling abundances to %s
Focuses on relative abundance among species

Summary DistanceMetrics
Additional considerations dataset specific:
Are the data very noisy?
Is there a lot of variation in the data?
Are there many 0s in the data?
Are the environmental responses not normal?

PCORD AdvisorTools

Advisor menu provides two tools to help you decide what


transformations or analysis might be appropriate your data
Use Show Current Profile to generate summary statistics
describing some of the important properties of your data
The Wizard is a decision tree to help you decide what data
adjustments or analyses to use

PCORD CurrentProfile

PCORD ProfileWarnings
Notes are written at the end of the profile, if certain conditions
are encountered. These are listed below.
1. If fields are filled with asterisks, then they could not be
calculated -- incompatible with the data.
2. Negative numbers in main matrix are incompatible with CV
3. Negative numbers in second matrix are incompatible with CV
4. Warning: one or more CV (coefficient of variation) could not
be calculated; replaced with missing value indicator: 99999.99
5. Negative numbers are present in main matrix, so Sorensen
distance could not be used for outlier analysis or calculation of
average half changes.
6. Negative numbers are present in 2nd matrix, so Sorensen
distance could not be used for outlier analysis or calculation of
average half changes.

OrdinationMethods PCA
Objectives:
DiscussPCAwithincontextofOrdinationMethods
GoovertheoutputofOrdinationMethods

LearningOutcomes:
BecomefamiliarwiththeapproachfordoingaPCA

Ordination
Arranging items (samples / species) along one or more axes
Graphical summarization of complex relationships
Extracting one or more dominant patterns % variance
Synthesis (reduction) of large datasets into fewer variables
These variables are then related to environmental variables

Components are
independent
from each other

OrdinationDiagrams
Typically, a 2-dimensional plot of samples / species in terms
of synthetic axes (combinations of variables)
Ideally, the distance between points in ordination space is
proportional to the underlying distance measures
NOT LIKE A REGRESSION

F e s tu c a id aSU17
h o e n s is
S TA ND5
Topo
Class

STAND5

(Axes uncorrelated, by definition)

S TA ND3

STAND3
STAND2

S TA ND2

draw
flat
slope
ridge
S TA N1 0

S TA ND7

STAND7

S TA ND1 S TA ND6
STAND1 STAND6

SU19

Axis 2

Axis 2

If possible, use points for


samples, overlays for species

Axis 2

Plot samples / species

SU18
SU16STAN10
SU4

SU10

SU5STAN13
STAND9

SU7
S TA N1 3
S TA ND9
STAN11

STAN14

SU3

SU15

S TA N1 1
STAN18

SU1

S TA N1 4

SU6

SU2

S TA N1 8
STAN19

SU14
S TA N1 9

SU11
S TA N1 6

STAN16

STAN17
STAN15

Also can code samples by


habitat types (using a key
environmental variable)

SU12 SU8

SU9

S TA N1 5

S TA N1 7

SU13
S TA N1 2

STAN12
STAND8

Axis 1

S TA ND8

STAND4

Axis 1 A xis 1

S TA ND4

OrdinationResults
How many axes? How many discrete signals in dataset
Different rules for selecting number of axes
Significance assigned to axes and contributing variables
Yet, most studies select 2 or 3 axes
Interpretation of results: overlays, correlations with axes

OrdinationResults
How many axes? Number of discrete signals in dataset
Coefficient of
determination:
% variance
represented

Rules of thumb: relative


(% variance vs NMDS axes)

vs

absolute
(PCA eigenvalues)

Yet, most studies select 2 (or 3 axes): Intuitive explanation


Strength assigned to axes and contributing variables (r , r 2)

OrdinationResults
Interpretation of results: overlays, correlations with axes
Conclusions:
Species ALSA
negatively
correlated with
Axis 1. Variance
explained= 25%

r = + 0.031
tau = + 0.045

r = - 0.534
tau = - 0.327

Interpretation of time change: successional vectors

OrdinationResults
Beware when interpreting correlation coefficients:
outliers can have strong influence
coefficients meaningless if relationships not linear
correlations coefficients invalid with binary data
2.0

1.5

1.0

Axis 2

Graphical representation of
environmental correlates using
Joint Plots: the angles / lengths
indicate strength and direction
of environmental variable
association with ordination axes

Bryo%

Age

0.5

Last burn
0.0

Canopy

-0.5

Radiation

Topo position
UnderRad

-1.0

-1.5
-1.5

-1.0

-0.5

0.0

0.5

Axis 1

1.0

1.5

2.0

OrdinationResults
External Evaluation: Correlations with second matrix
(e.g., how are environmental variables related to the axes?)
Beware of biases: proving the expected results
Comparisons with null model: Comparing results of
real data with those from randomized data is promising, but
can yield no clear results if strong outliers cause spurious
significant patterns with the random dataset. Beware

PrincipalComponents(PCA)
Using the best-fit straight line to describe a system of points
in multiple dimensions using straight lines (Pearson 1901)
Y = a 0 + a1 X1 + a2 X2 +
Start with cloud of n points
in p-dimensional space
Center the axes in the
point cloud (centroid)
Rotate axes to maximize
the variance along axes
As rotation angle changes,
the variance changes

PrincipalComponents(PCA) WhentoUse
Normality: Ideal for normal data with approximately linear
relationships amongst variables Rarely for community data
Beware of heterogeneous community data
Critical to justify the use of this linear approach
Sample size: Need a good estimate of correlation structure
Stronger patterns require smaller sample sizes
Rule of thumb: 5 sample units per variable
(Tabachnick and Fidell 1989)
Increasing number of variables, strengthens results
(Pillar 1999)

PrincipalComponents(PCA) Normality
Assessed with skewness (asymmetry) / kurtosis (peakiness)
skew= 0
(normal data)

skew > 0 (right tail too long) skew < 0 (left tail too long)
kurtosis = 0
(normal data)
kurtosis > 0 (more peaky)

kurtosis < 0 (less peaky)

Rule of thumb (McCune & Grace 2002):

-1 < Skew < 1

PrincipalComponents(PCA) Linearity
Use bivariate scatterplots to assess linear relationships
Solutions:
Data transformations

Beware of outliers they can change cross-correlations


r = 0.92

Solutions:
Transform the data

r = -0.96

Remove those data

PrincipalComponents(PCA) Reporting
What type of cross-correlation matrix you used?
Correlation or Covariance - Use euclidean distance

If used with community data, justify using this linear


model for species data?
Were assumptions of linearity / normality met?

How many axes were interpreted, and what proportion


of variance was explained by these axes?
Describe the axes and the individual / cumulative variance

Principal eigenvectors - Test of significance?


Not necessary, but an option using randomization tests

Rotation of the solution? Use of interpretation aids?


Explain overlays and correlations of variables with axes

PrincipalComponents(PCA) Example
Setup:
PCA uses only Euclidean
Distances (real metric)
Matrix can be calculated
in three ways:
Correlation: very susceptible
to outliers (DO NOT USE)
Variance / Covariance: Less
sensitive to outliers (USE)
Non-centered: Experimental
(DO NOT USE)

PrincipalComponents(PCA) Example
Setup II:
Scores for species can be
calculated in two ways:
Distance-based: Relates
species- samples to each
axis represents species as
vectors from centroid
(Standard) USE
Weighted Average: Species
represented as points outliers
(DO NOT USE)

PrincipalComponents(PCA) Example
Setup III:
Output Options:
Cross-Product Matrix:
Shows pair-wise distances
(USE)
Randomization tests: Use
bootstrap to assess
significance of the results
(USE)

PrincipalComponents(PCA) Example
Variance

Distance-based

Cross-product
Matrix

Randomization

PrincipalComponents(PCA) Example
Setting up the Randomization Test:
Seed: make
multiple tests
comparable by
using the same
sequence of
random numbers
(supply seed)
Runs: number
of permutations
used in the test
(determines p
value of statistic)

PrincipalComponents(PCA) Example

Enter descriptive explanation to document the


analysis this label will be added to results

PrincipalComponents(PCA) Example
Results: Covariance matrix species distances

PrincipalComponents(PCA) Example
Results: Eigenvalues - Variance explained (up to 10 axes)

Eigenvalues are
proportional to
variance explained

Broken-stick
eigenvalues are
produced by chance

PrincipalComponents(PCA) Example
Results: Never explain 100% of variance (axes = variables)

Observed

Variance Explained

Expected

PrincipalComponents(PCA) Example
Results: Randomization tests

PrincipalComponents(PCA) FinalResult
Correlation with Axes

PrincipalComponents(PCA) Example
Results: Graphs (for Eigenvalues > 100)

PrincipalComponents(PCA) Example
Results: Graphs

Samples: points

Species: vectors

PrincipalComponents(PCA) Example
Results: Graphs
Samples: points
Species: vectors

PrincipalComponents(PCA) Example
Results: Species Response Graphs
Loadings STLI
Loadings NEDO
Axis 1: +0.51

Axis 2: -0.74

Axis 1: 0.00 Axis 2: 0.00

PrincipalComponents(PCA) Example
Variance

Weight-average

Cross-product
Matrix

Randomization

PrincipalComponents(PCA) Example
Setting up the Randomization Test:
Seed: make
multiple tests
comparable by
using the same
sequence of
random numbers
(supply seed)
Runs: number
of permutations
used in the test
(determines p
value of statistic)

PrincipalComponents(PCA) Example

Enter descriptive explanation to document the


analysis this label will be added to results

PrincipalComponents(PCA) Example
Results: Covariance matrix species distances

Same Result - GOOD

PrincipalComponents(PCA) Example
Results: Eigenvalues

Same Result - GOOD

PrincipalComponents(PCA) Example
Results: Graphs

Samples Labeled

Species Labeled

PrincipalComponents(PCA) Example
Display Recommendation:
PC ORD recommends displaying
species as vectors / samples as points
Rotation:
Useful to show patterns you are interested in.
Need to keep track and report in results

PrincipalComponents(PCA) Example
Rotation by NEDO
We stretch the graph
along direction of most
variation of the species

Loadings of
NEDO
Axis 1: +0.51
Axis 2: -0.74

PCA Application&Examples
Objectives:
ShowcasePCAanalysis inPCORDandtheliterature

LearningOutcomes:
Becomefamiliarwiththeoutput/resultsofPCA

PrincipalComponents(PCA) Example
Results: Covariance matrix species distances

PrincipalComponents(PCA) Example
Results: Eigenvalues - Variance explained (up to 10 axes)

Eigenvalues are
proportional to
variance explained

Broken-stick
eigenvalues are
produced by chance

PrincipalComponents(PCA) Example
Results: Never explain 100% of variance (axes = variables)

Observed

Variance Explained

Expected

PrincipalComponents(PCA) Example
Results: Species Loadings onto the PC Axes

Use

the scaled eigenvectors

PrincipalComponents(PCA) Example
Results: Randomization tests

PrincipalComponents(PCA) FinalResult
Results: Correlation with Axes

PrincipalComponents(PCA) Example
Results: Graphs

Samples: points

Species: vectors

PrincipalComponents(PCA) Example
Results: Graphs

PC ORD recommends
displaying
species as vectors /
samples as points

Samples: points
Species: vectors

PrincipalComponents(PCA) Example
Results: Graphs

Samples Labeled

Species Labeled

PrincipalComponents(PCA) Example
Results: Species Response Graphs
Loadings STLI
Loadings NEDO
Axis 1: +0.51

Axis 2: -0.74

Axis 1: 0.00 Axis 2: 0.00

PrincipalComponents(PCA) Example
Rotation: Highlights certain patterns. Report in results

NEDO Axes
Correlations
Axis 1: +0.51
Axis 2: -0.74
Rotation by NEDO
Stretch plot along
direction of most
variation for species

PrincipalComponents(PCA) PaperI
Published Example: Ainley, D.G. et al. (2005).
Objective: Relate densities of the 12 most abundant
species of seabirds to 12 habitat variables:
5 biological, 4 oceanographic, 3 geographic (spatial)

PrincipalComponents(PCA) Paper I
Oceanographic variables examined:
sea-surface temperature / salinity, thermocline depth / strength

Date Distance to Fronts

Chl
Max

Acoustic
Biomass

PrincipalComponents(PCA) PaperI
Data Manipulations To Avoid Biases:
Densities log-transformed to meet normality assumptions
Nevertheless, residuals generated in the regressions for
some species did not meet those assumptions (Skewness /
Kurtosis Test for Normality of Residuals, P < 0.05)
Least-squares regression analysis (ANOVA), however,
is a very robust procedure with respect to non-normality
(Seber, 1977, Kleinbaum et al., 1988)
Yet, while these analyses yield the best linear unbiased
estimator in the absence of normally distributed residuals, Pvalues near 0.05 must be viewed with caution (Seber, 1977)

PrincipalComponents(PCA) PaperI
To avoid double-absences:
Only 15-min transects in which any given species was
recorded were analyzed
The total sample size for the 12 species was 1209
Is this an adequate sample size ?
Rule of thumb:
5 samples per variable (Tabachnick and Fidell 1989)
1209 / 12 ~ 100 samples per variable

PrincipalComponents(PCA) PaperI
Analysis Methods:
Principal components analysis (PCA), in combination
with Sidak multiple comparison tests, used to assess
differences in habitat selection among 12 seabird species
To test for significant differences in habitat affinities
among seabird species, we used two one-way ANOVAs:
In the first, we tested for differences among PC1 scores of
each species; in the second, we compared the PC2 scores
Considered differences between two species to be
significant if either one or both of the PC1 or PC2 scores
differed significantly

PrincipalComponents(PCA) PaperI
Community-Wide Result: The first and second PC axes
explain 60% of variance in habitat use by 12 seabird species

PrincipalComponents(PCA) PaperI
Species-specific Results:
Salty, Green

Species mapped onto two


(independent) dimensions
Near
Fronts

Pair-wise associations
denoted by circles
Zoop
Prey
Fish Prey

PrincipalComponents(PCA) PaperII
Published Example: Weichler et al. (2004).
Objective: Relate seabird densities to seven
environmental parameters:
(1) water depth, (2) distance to nearest land, (3) number
of trawlers within a radius of 5 km, (4) sea surface
temperature, (5) water temperature difference (0 10 m) ,
(6) water temperature difference (0 30 m), and (6) water
temperature difference (10 50 m)
Did Not Report Cross-correlations of Habitat Variables

PrincipalComponents(PCA) PaperII
Data Manipulations To Avoid Biases:
Species densities were selected as variables and 10 min
intervals (samples), were selected as cases
Only species seen in at least five counting intervals were
included, an arbitrary choice that allowed covering a wide
spectrum of species while ignoring those with few occurrences
Only commoner species with numbers exceeding 1% of all
individuals counted were included in the analysis
Dataset of 46 sections of the cruise tracks. Each section
comprised a hydrographic station approximately midway and
10 min intervals in two opposite directions (4 8 km away)
Sample Size: 46 samples / 7 variables: Ratio of 6.5

PrincipalComponents(PCA) PaperII
Community-Wide Result: Six principal eigenvalues (> 1),
showing % of variation explained and ecological interpretation

PrincipalComponents(PCA) PaperII
Community-Wide Result:
Loadings for the 11 seabird
species and 7 variables on
the six principal eigenvalues
3 principal components:
50 % of variance

6 principal components:
78 % of variance

PrincipalComponents(PCA) Comparisons
Number of Axes:
- Selected 2 easy to interpret (Ainley et al. 2005)
- Selected 6 based on eigenvalues > 1 (Weichler et al. 2004)
Display of Results:
- Plot and table of eigenvalues (Ainley et al. 2005)
- Eigenvalues and interpretation (loadings) (Weichler et al. 2004)
Significance Tests:
- Pairwise species comparisons (ANOVA) (Ainley et al. 2005)
- Correlations with selected variables (Weichler et al. 2004)

PrincipalComponents(PCA) Tools

Percent of pattern
explained in original
distance matrix

Orthogonality of PCA axes

PrincipalComponents(PCA) Tools

Ranking of species scores according to Axis 1


Showing Presence / Absence of species on samples
Categorizing samples by a categorical variable

PrincipalComponents(PCA) Examples
Ainley DG, Spear LB, Tynan CT, Barth JA, Pierce SD, Ford RG, Cowles
TJ (2005). Physical and biological variables affecting seabird distributions
during the upwelling season of the northern California Current. Deep-Sea
Research II 52: 123143
Weichler T, Garthe S, Luna-Jorquera G, Moraga J (2004). Seabird
distribution on the Humboldt Current in northern Chile in relation to
hydrography, productivity, and fisheries. ICES J. Marine Science
61 (1):148-154

DisclaimerReferences
Seber, G.A.F. (Ed.), 1977, Linear Regression Analysis. Wiley, New York.
Kleinbaum, D.G., Kupper, L.L., Muller, K.E., 1988. Applied Regression
Analysis and other Multivariable Methods. PWS-KENT Publishing Company,
Boston.
Tabachnik, B.G. and L.S. Fidell. 1989. Using Multivariate Statistics. 2nd ed.
New York: Harper and Row.

DataScreeningandTransformations
Objectives:
DiscussStepsforAnalysis:DataScreening,DataManipulation
Goovertheprinciplesofdataexploration

LearningOutcomes:
Bereadytoplanyouranalysis:DevelopMetadataandAnalysisLog
BeabletoscreenandmanipulateyourdatawithPCORD

DataExploration DocumentingFlow
Flow diagram: sequence of changes / analysis
Analysis log: input, output, results
Save all input and output files and data edits

Metadata:
data
about
data

List of
errors

Clean
Data

Clean
Data

List of
errors

DataExploration DocumentingFlow
File Names
PC-ORD

File Contents
Connections
Links to other
software
Products:
- figures
- tables
- results
GIS - Stats

DataExploration DocumentingFlow
Use clear,
descriptive
titles (dated)
Save all
output files
Keep a
flowchart or
dated record
Record WHY
you did it
you will

forget!

DataExploration DocumentingFlow
Screening:

Are column / rows means and


ranges reasonable?
Are the sample sizes correct?
Are there missing data / outliers?

Cleaning:

Fix typos
Erase / Correct incomplete data
Check effects of corrections

Transformations:
Look up assumptions of test
Check data distributions
Make transformations (re-check)

DataExploration DataScreening
Metadata
96 samples
5 variables
Data type?
Explanation

Show Current Profile


% zeros, data ranges, skewness

DataExploration CurrentProfile
% zeros:
species data
Lowest / highest value:
typos (errors)
Skewness:
non-normality
-1 < SK < 1
Outliers:
(in SD units)
2 SD -> 96%

DataExploration SummaryI
Data Summary:

Mean

SD

Range

Diversity

S = Richness = number of non-zero elements in row


E = Evenness = H / ln (Richness)
H = Diversity = - sum (Pi*ln(Pi)) = Shannon`s diversity index
D = Simpson`s diversity index for infinite population = 1 - sum (Pi*Pi)

DataExploration SummaryII
Skewness:

Steps to Fix
Skewness:
Taking the log
or square root
works for data
with moderate
skewness

DataExploration 1DOutliers

Frequency
distribution of
a univariate
outlier falling
5.5 standard
deviations
above mean

DataExploration 1DOutliers
Describe the distribution:
In graph and tabular form

DataExploration 1DOutliers

Discrete Distribution

Continuous Distribution
Test for significance: off PC-ORD

DataExploration 2DOutliers
25

Sp2

20
15
10
5
0
0

10

15

20

25

Sp1

A bivariate outlier that is not a univariate outlier


for either of the two variables Sp1 and Sp2

DataExploration 2DOutliers

DataExploration 2DOutliers

DataExploration Outliers

1.20

1.15

1.10

1.05

1.00

0.95

6
5
4
3
2
1
0

0.90

Frequency

DataExploration Outliers

Average Distance

Frequency distribution of average relative Euclidean


distances to a sample unit, given a sample size of
25. The sample marked with the red circle is 3.2 SD
units above the mean of the average distances

DataManipulations
You can manipulate data directly in PC-ORD
Modify / Append Data
Delete Columns / Rows
Multiply / Add Constant
Randomly Sample
Shuffle Data
Note: Beals smoothing is Experimental DO NOT USE

DataTransformations
What are the two reasons for data transformations?
Statistical:
Meet assumptions (normality, linearity, variances,)
Express variables in the same units (km, km/hr):
Ecological:
Make distance measures work better
Reduce influence of total quantity (sample totals)
Deal with importance of rare / common species
Identify informative species

DataTransformations Nomenclature
Monotonic: Element values are changed, but
ranks stay the same (e.g., change unit from km to m)
Relativization: Adjusts matrix elements by one
column / row standard (e.g., total, maximum)

Note: Not all transformations are reasonable or


feasible with all types of data (e.g., negative, P/A)

DataTransformation

DataTransformation
Monotonic transformations retain ranks, but change values

P/A

(x)

f(x)

Power exponents: (square root), 2 (squared), 3 (cubed)


Note: 0 used to recode data as Presence / Absence (0 / 1)

DataTransformations ExampleI
Logarithmic transformation fx = ln(x) OR log(x)

This transformation is useful when:


high degree of variation within attributes (e.g., Chl Conc.)
high degree of variation among attributes within a sample
helps if there are large outliers and lots of zeros
Note: to log-transform data containing zeros, a
small number should be added to all data points.
With count data, add one, so that: fx = log(0+1) =0
With density data, add constant smaller than smallest
possible sample, so that: fx = log(0+0.001) = -3

DataTransformations ExampleII
Arcsine / Arcsine-squareroot transformation

This transformation is useful when:


normalizing proportion data (e.g., Percent Cover)
Note: data must range between zero and one, inclusive.
If they are not, you should relativize (general relativization
or relativization by maximum) before selecting this option.
The constant 2 / pi scales the result of arcsin(x) [in
radians] to range from 0 to 1, assuming that 0 < x < 1.

DataTransformation Howto
Note:
Need to
accept
TEMP file

DataRelativization
Relativization

re-scales data using some criterion / standard.

When its done by columns (e.g., species), variation across


plots is retained, but variation across species is standardized.
Two approaches:
General Relativization: (by totals or sums) makes area
under each species distribution response curve = 1.
(input: x > 0; output: from 0 to 1)
Relativization by Maximum: (by max for column or row)
equalizes the heights of the peaks along the gradient
(input: x > 0; output: from 0 to 1)

DataRelativization

General Relativization: (by totals or sums) makes area


under each species distribution response curve = 1.
(input: x > 0; output: from 0 to 1)
Relativization by Maximum: (by max for column or row)
equalizes the heights of the peaks along the gradient
(input: x > 0; output: from 0 to 1)

DataRelativization

Deviations: Value Mean


Z scores: (Value Mean) / SD
Binary response: Above (1) / Below (0)
Ranks: Assigns ranks
(e.g., 0, 0, 6, 9 would receive the ranks 1.5, 1.5, 3, 4)

DataRelativization Howto
Note:
Need to
accept
TEMP file

DataExploration Summary
Create naming convention for your files (metadata record)
(DATE_AREA_SP_suffix)
9710_Oahu_WTSH_raw
Create a data flow archive in your analysis notebook
Check assumptions of statistical tests / approaches
(PCA: normality of data, linear relationships)
Visually inspect your data: 1-D, 2-D, many-D.
Look for missing data and outliers in individual datasets
Inspect

relationships between variables (pairs, multiple)

DataManipulation Summary
Add missing data and fix typos
Ensure variables expressed in the same units (km / m)
Select the number and identify of species
(Rare species that occur in a single sample
contribute virtually no information, but add noise)
Look for and deal with outliers: (Remove OR Transform)
Deal with confounding factors, such as the different
magnitude of environmental variables (e.g., depth in m or km)
and the proportional representation of different species
(Relativize your data)

CHAPTER 9

Data Transformations

Tables, Figures, and Equations

From: McCune, B. & J. B. Grace. 2002. Analysis of


Ecological Communities. MjM Software Design,
Gleneden Beach, Oregon http://www.pcord.com

A general procedure for data adjustments


Species data
Table 9.3. Suggested procedure for data adjustments of species data matrices.
Action to be considered
1. Calculate descriptive statistics. Repeat
this after each step below. (In PC-ORD run
Row & column summary)
Beta diversity (community data sets)
Average skewness of columns
Coefficient of variation (CV, %)
CV of row totals
CV of column totals
2. Delete rare species (< 5% of sample units)

Criteria
Always

Usually applied to community data sets,


unless contrary to study goals

Species data, cont.


3. Monotonic transformation (if applied to species,
then usually applied uniformly to all of them, so that
all are scaled the same)

A. Average skewness of columns (species)


B. Data range over how many orders of magnitude?
(Count and biomass data often are extreme.)
C. Beta diversity. (Consider presence/absence
transformation for community data when is high.)

Species data, cont.

3. Monotonic transformation (if applied to species,


then usually applied uniformly to all of them, so that
all are scaled the same)

A. Average skewness of columns (species)


B. Data range over how many orders of magnitude?
(Count and biomass data often are extreme.)
C. Beta diversity. (Consider presence/absence
transformation for community data when is high.)

4. Row or column relativizations

What is the question?


Are units for all variables the same?
Is relativization built into the subsequent analysis?
CV of row totals
CV of column totals
What distance measure do you intend to use?
Note: regardless of your decision to relativize or not,
you should state your decision and justify it briefly on
biological grounds.

Species data, cont.


5. Check for outliers based on the average distance of
each point from all other points. Calculate standard
deviation of these average distances. Describe
outliers and take steps to reduce influence, if
necessary

standard
deviation
----------<2
2 - 2.3
2.3 - 3
>3

degree of
problem
----------------------no problem
weak outlier
moderate outlier
strong outlier

Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered

Criteria

1. Calculate descriptive statistics for


quantitative variables. Repeat this
after each step below. (In PC-ORD
run Row & column summary)
Skewness and range for each
variable (column)

Always

2. Monotonic transformation (applied


to individual variables, depending on
need)

Consider log or square root transformation for variables with


skewness > 1 or ranging over several orders of magnitude.
Consider arcsine squareroot transformation for proportion data.

Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered

Criteria

1. Calculate descriptive statistics for


quantitative variables. Repeat this
after each step below. (In PC-ORD
run Row & column summary)
Skewness and range for each
variable (column)

Always

2. Monotonic transformation (applied


to individual variables, depending on
need)

Consider log or square root transformation for variables with


skewness > 1 or ranging over several orders of magnitude.
Consider arcsine squareroot transformation for proportion data.

3. Column relativizations

Consider column relativization (by norm or standard deviates) if


environmental variables are to be used in a distance-based
analysis that does not automatically relativize the variables (for
example, using MRPP to answer the question: do groups of
sample units defined by species differ in environmental space?).
Column relativization is not necessary for analyses that use the
variables one at a time (e.g., ordination overlays) or for analyses
with built-in standardization (e.g., PCA of a correlation matrix).

Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered

Criteria

1. Calculate descriptive statistics for


quantitative variables. Repeat this
after each step below. (In PC-ORD
run Row & column summary)
Skewness and range for each
variable (column)

Always

2. Monotonic transformation (applied


to individual variables, depending on
need)

Consider log or square root transformation for variables with


skewness > 1 or ranging over several orders of magnitude.
Consider arcsine squareroot transformation for proportion data.

3. Column relativizations

Consider column relativization (by norm or standard deviates) if


environmental variables are to be used in a distance-based
analysis that does not automatically relativize the variables (for
example, using MRPP to answer the question: do groups of
sample units defined by species differ in environmental space?).
Column relativization is not necessary for analyses that use the
variables one at a time (e.g., ordination overlays) or for analyses
with built-in standardization (e.g., PCA of a correlation matrix).

4. Check for univariate outliers and


take corrective steps if necessary.

Examine scatterplots or frequency distributions or relativize by


standard deviates (z scores) and check for high absolute
values.

NonmetricMultidimensionalScaling
(NMS)

Objectives:

DiscussStepsforAnalysis:Advantages/Disadvantages
GooveroutputandinterpretationofAutopilotAnalysis

LearningOutcomes:
UnderstandwhatanNMSanalysisdoesandtellsyou
BeabletodoaNMSanalysiswithPCORD

NMS Whatisit?
Non-metric: Non-parametric data analysis (ranks)
(Relationships between object pair-wise
distances and dissimilarities are not linear)
Output:

Representation of relationships between


objects (samples, species) and descriptors
(environmental variables) in a reduced
number of dimensions (axes)
Axes do not correspond to eigenvectors
(User cannot deduce contribution of various
descriptors / objects to described axes)

NMS Howdoesitwork?
NMS searches for best position of n objects on k
dimensions (axes) to minimize stress of k-d configuration
Compares the pair-wise distances (difference) of the
objects in reduced space (expressed in terms of the axes)
and the dissimilarity of the objects in the real world
(expressed in terms of the samples / species / variables):
The Real World
(e.g., 3D)

Reduced Space
(e.g., 1D)

NMS Howdoesitwork?
Approach:

Mechanics:

Iterative procedure
(Manipulates the coordinates of pairs of
observations so they fit as closely as
possible the measured object similarities)

Using a random initialization, NMS uses


multiple iterations to find a robust pattern
(Goodness of fit is measured using stress,
which relates distances between objects in
reduced space with their dissimilarities)

NMS TheGood
Being based on ranked distances, it tends to linearize
relationship between environmental / species distances
Can deal with any distance measure, data normalization,
and data transformation
Can handle non-metric, semiquantitative and subjective
data (e.g., good / bad, beaufort sea state)
Solves zero truncation problem and some missing data
Empirical studies have shown that:
- Use of ranks makes NMS robust even if relationships
between distances and dissimilarities are not linear
- Provides appropriate distance summary with small
number of dimensions

NMS TheBad
Computationally intensive
Does not provide formula loadings
For a given number of dimensions, the solution for a
particular axis is unique. (First dimension in 2-D solution
not the same as first dimension in 3-D or 1-D)
Axis numbers are arbitrary, so the percent of variance on
a given axis does not decrease with increasing axis number
Difficulties in detecting discontinuities
Fails to find the global solution (minimum global stress)
because of multiple local minima.
Need to account for random start of iterative process
(e.g., repeat analysis to see if random start matters)

NMS Approach
1. Calculate dissimilarity matrix () of real data.
2. Assign sample units to starting configuration in the kspace (define initial X). Starting locations (scores on
axes) are assigned with a random number generator.
3. Normalize X by subtracting axis means for each axis
l and dividing by overall standard deviation of scores:

normalized

x il =

x il x l
k

(x
l =1

i =1

(n = samples, k = dimensions)

xl )

il

/ (n k )

NMS Approach
4. Calculate D using the Euclidean distances between
sample units in k-space.
5. Rank elements of in ascending order.
6. Put the elements of D in the same order as .

$ (with elements d$ )
7. Calculate D
ij
created by replacing elements of D
which do not meet monotonicity).
Software creates a plot of sample
pair-wise dissimilarities (y axis)
versus distances in k-space
We compute distance in k-space

NMS Approach

Plot of distance in ordination space (dij, horizontal axis)


vs. dissimilarity in original p-dimensional space (dij, vertical
axis). Points are labeled with the ranked distance
(dissimilarity) in the original space.

NMS Approach

Calculate d terms: shifts in k-dimensional distances (x axis)


to reach monotonic change in distances in original data

NMS Stress
8. Calculate raw stress, S*
n-1

S =

i=1 j=i +1

2
$
( d ij - d ij )

Note: S* measures the departure from monotonicity.


If S* = 0, the relationship is perfectly monotonic.

NMS Stress
9. Because raw stress is altered if the configuration of points
changes (e.g., point locations, number of dimensions) it is
necessary to standardize ("normalize") stress.
Kruskals stress formula one:
n-1

S = S /
*

2
ij

i=1 j=i +1

PC-ORD reports SR, the square root of scaled stress:


Analogous to standard deviation, then multiplied by 100 to
rescale the result from zero to 100:
S
R

= 100 S

NMS Approach
10. Now the program tries to minimize S by changing the
configuration of the sample units in the k-space.
Calculate "negative gradient of stress" for each point i.
11. The amount of movement in direction of the negative
gradient is set by the step length, a, which is about 0.2 initially.
The step size is recalculated after each step such that the step
size gets smaller as reductions in stress become smaller.
12. Iterate (go to step 3) until either:
- a set maximum number of iterations is reached OR
- a criterion of stability is met

NMS Approach
Crawling through the landscape in search of the optimum

Stress Landscape
Changing
positions of
the samples

Axis 1

Axis 2

The goal is to minimize stress


(to end up in a valley)

Some landscapes are


trickier than others

NMS Approach
The starting configuration can influence the result
Beware of local minima (pits)
Avoid unstable solutions (saddle points)
The starting configuration can be selected in two ways:
Use a random starting configuration
Use coordinates from another ordination method
Recommendation: Use a random start
A high number of random starting configurations often
provides a solution with lower stress
This approach avoids having to decide on what other
method to use lose the great benefits of NMS

NMS Approach
Possible to evaluate whether NMS is extracting stronger
axes than expected by chance
Statistical Significance Based on Randomization Test
(Monte Carlo approach):
p = (1+n) / (1+N)
n = number of randomized runs with final stress
less than or equal to the observed minimum stress
(one tailed test) N = number of randomized runs
Recommendation: Use a large number of runs
This is a time intensive computational method, that will
take a great deal of time (even if runs = 20)
We need to have a large enough number of runs to
calculate the p value with the desired resolution

NMS Approach
Statistical Significance Based on Randomization Test
( p value: p = (1+n) / (1+N) )
(20 runs)

(50 runs)

Stress declines with increasing dimensions


Real data have lower stress than randomized data

NMS Approach
Stress Interpretation:
Real Data:
Declines with increasing
dimensions (from 1 to 5)
Randomized Data:
Real data below the
distribution of
randomized data
(for dimensions 1 to 5)

NMS AutopilotMode
The automatic procedure determines
most appropriate dimensionality,
assigns statistical significance with
randomizations, and avoids local
minima (using random iterations)
Advantages: Uses default settings
and decides number of axes for you

Disadvantages: User may want


additional output products. Number of
axes based on additional considerations

NMS AutopilotMode
The autopilot NMS mode
provides three settings
Speed vs Thoroughness
Quick and Dirty
Medium
Slow and Thorough

NMS AutopilotMode
The autopilot NMS mode provides three settings

NMS Results
Examine Results.txt file: Settings / Options

Up to 6 dimensions (for sake of interpretation)


Random start (to avoid local minima)
Reduction in dimensionality (D: 6,5,4,3,2,1)

NMS Results
Examine Results.txt file: Settings / Options (all Dimensions)

Cannot monitor changing stress


Cannot assess linearity of distances / dissimilarities
Cannot see scores for all the runs just for final run
Cannot see scores for species just for final run

NMS Results
Examine Results.txt file: Results for best result
Stress

P values

Scores

NMS Results
Examine Results.txt file: Results for best result
Scores

NMS Results
Examine Results.txt file: Plotting Stress vs Iteration

Note: This graph provided only for best answer (3-D)

NMS Results
Examine Results.txt file: Interpret Stress (Clarke 1993)

NMS Results
Examine Results.txt file: Run Log

Random data: 0 = not randomized, 1 = randomized


Start file: 0 = random starting coordinates, 1 = read from file
Seeds: initial seeds for random number generator
* Stability criterion not met.

NMS Results

Examine Results.txt file: Run Log

**To run single NMS ordination repeating best result, use this
file as starting configuration, rather than using random start.
Save this file with new name, to avoid overwriting it with next
NMS test. To do this, open file using File | Open | Graph Row
file, then File | Save as | Graph Row file (specify new name).
.

NMS Results
Examine graphs: Species scores
Select Weighted Average Scores

Species as Vectors

Species as Points

NMS Results
Examine graphs: 2D Ordination plots

Tau: non parametric correlation

NMS Results
Correlations with Matrices:
Tau (rank correlation)
DO NOT use r 2 value
Percent of Variance:
Use same distance metric
used for NMS analysis

NMS Results
Coefficient of Determination (% of Variance):
For each axis together

FINE to use r 2 value


Orthogonality:
Measure of independence of the three axes

NonmetricMultidimensionalScaling
(NMS)

Objectives:

GooversettingsandresultsofManualAnalysis
Discussconstraintswhendecidingonnumberofaxes

LearningOutcomes:
Understandwhatresultsneedtobereported

NMS SuggestedProcedure
This suggested procedure
determines appropriate the most
dimensionality, assigns statistical
significance with randomizations,
and avoids local minima.

Recommendation: Request a 6-dimensional solution,


stepping down to a 1-dimensional solution, with instability
criterion of 0.0005 (or smaller), 200-500 iterations, 20-50
runs with real data, and 20-50 runs of randomized data
(for randomization tests of statistical significance)

NMS SuggestedProcedure:Step1
First, pick distance measure

Second, set up parameters

Step
Down

Relative Sorensen

Dimensions (max = 6)

Relative Euclidean

200 Iterations, 10 Runs

NMS SuggestedProcedure
Third, pick the output options
Write final
configuration
Run Log
Plot Stress
vs.. Iteration
Provides scores

Statistics
Dimensionality

Plot distance
vs. dissimilarity
Randomization
Statistical Test
Species Scores
(for plotting)

NMS SuggestedProcedure
1. Preliminary runs: Stress Test determines dimensionality
Use time of day random seed

Graph messages

NMS Results
Examine Results.txt file: Settings / Options

NMS Results
Examine Results.txt file: Results for each run / dimension
Stress
Scores

NMS Results
Differences in Real Space

Examine Results.txt file: Shepard Diagram


(6 - D)

(2 - D)

Final
Stress:
4.137

Distances in 6-D space

Final
Stress:
23.138
Distances in 2-D space

NMS Results
Examine Results.txt file: Plotting Stress vs. Iteration

Note: This process is repeated for each run

NMS Results
Examine Results.txt file: Stress

13.4178 = final stress

0.0031 = final instability

NMS SuggestedProcedure:Step2
Goal: Select the Best Solution:
Plot stress vs.. number of dimensions
How: Just after running NMS
Do this in PC-ORD by selecting
Graph | NMS Scree Plot
If the stress
increases with
additional
dimensions,
the model is
over-fitted

NMS SuggestedProcedure:
PC-ORD uses the following criteria (for reference):

Comparing the final stress values among the best


solutions, one best solution for each dimensionality.

Additional dimensions considered useful if they reduce


final stress by 5 or more (on a scale of 0-100). PCORD selects the highest dimensionality that meets this
criterion.

At that dimensionality, the final stress must be lower


than that for 95% of the randomized runs (i.e. p <=
0.05).

If this criterion is not met, PC-ORD does not accept that


solution and chooses a lower-dimensional solution,
provided that it passes the randomization test.

NMS SuggestedProcedure:
Other metrics for selecting number of dimensions:
marginal change in stress
p values

*
*

If stress does
not increase,
computer
considers
marginal
decline with
added Ds
Consider the
p values

NMS SuggestedProcedure:
Check for a better-than-random solution by using the
results of the Randomization test.
Limitations: Helpful but not foolproof.
The most common problems are:
Strong outliers, single super-abundant species,
small data sets (e.g.,<10 SUs), many zeros
Note: The first axis with randomized community data is
often nearly as strong or stronger than the real data,
even when the pattern in the real data is strong. The
randomization creates rows with unequal abundances
some rows can have higher or lower totals the real data.
Thus a 1-D NMS solution from shuffled data tends to
describe variation in row totals. Interpret carefully

NMS SuggestedProcedure:
Goal: Select number of dimensions beyond which
additional dimensions provide only small stress reductions
Suggestion: Follow PC-ORDs recommendation
but check for some safeguards
Note:
No firm fixed criterion for selecting an appropriate
number of dimensions (Kruskal and Wish 1978)
Axis scores depend on the number of axes. The first
dimension on a 2-D and a 3-D result will be different

NMS SuggestedProcedure:
Trade-Offs:
Do not trust results with large stress values (> 20)
Final stress decreases and the proportion of the variance
represented increases with more axes
Pick as few dimensions as possible based on stress
reductions but if in doubt, add an extra dimension
Beware of unstable results (stress wiggles with iterations)
Consult the instability of the final answer

NMS SuggestedProcedure:
Check the following: (a) plot of stress vs.. iteration for
stability of the solution at the selected number of
dimensions; and (b) final instability value for the chosen
solution, as listed in the numerical output from NMS
Look for smooth curves
30

35

25

Stable

Unstable

20
Stress

Stress

30
25
20
15

15
10

10
5

50

100

150

200

Step

Strive for instability < 10 -4 (< 0.001)

50

100
Step

150

200

NMS SuggestedProcedure:
Use Data Exploration to decrease stress of NMS analysis
16

30

12
10

Stress (%) or Spp count

Final Stress (%)

14

8
6
4
2
0
0

10

20

30

40

Number of Sample Units

50

25

Species remaining, count


Final Stress

20
15
10
5
0
0
20
40
60
80
100
Criterion for species retention (% of SU's)

Dependence of stress on sample Dependence of stress on


size, by subsampling rows of a
progressive removal of
matrix of 50 units by 29 species rare species from data set

NMS WhattoReport
Samples / Species Considered
Data Transformations

Distance Metric and Software Used


Did you use a random starting point
Number of runs with real / random data
Number of dimensions considered

How did you select the dimensions


Final stress / instability of best solution
Monte Carlo tests results (p values)
Proportion of variance explained by each axis (r 2)
Overlays (env. data / species)
Correlations of env. data / species with axes (Tau)

NMS References
PC-ORD uses the following algorithms:
Mather, P. M. 1976. Computational methods of multivariate
analysis in physical geography. J. Wiley & Sons, London.
532 pp.
Kruskal, J. B. 1964. Multidimensional scaling by optimizing
goodness of fit to a nonnumeric hypothesis. Psychometrical
29:1-27.
For a review of NMS, cite:
Clarke, K.R. 1993. Non-parametric multivariate analyses of
changes in community structure. Australian Journal of
Ecology 18:117-143.
Kneel, N.C., Or loci, L., 1986. Applying metric and nonmetric
multidimensional scaling to ecological studies: some new
results. Ecology 67, 919923.

NMS ExamplesI
Seabird communities of the Indian Ocean
We selected an
observation day as
the sampling unit for
the community-level
analysis because we
regarded the daily
transects as discrete
samples, separated
by night time periods
with no survey effort.
(Hyrenbach et al. 2007)

Our sample size was a matrix of 16 transects and 46 taxa.


We standardized the samples using the relative abundance of taxa.
To ensure each daily sample was weighted equally in the analysis,
we used the relative Sorensen (Bray- Curtis) index (Manly, 1994).

NMS ExamplesI
Seabird communities of the Indian Ocean
The NMDS selected 3 habitat axes, which accounted for
73.4 % of variance observed in the seabird community
- The first axis (r2 = 0.15) described lat gradients associated with
concurrent SST decrease and CHL increase.
- The second axis (r2 = 0.41) illustrated concurrent lat / long changes in
wind speed, depth, CHL, SST, and gradients in ocean depth and SST.
- The third axis (r2 = 0.17) captured the influence of onshoreoffshore
gradients in CHL, irrespective of lat and long.

Because axis 2 and 3 explained a higher proportion of the


observed variability, we plotted the survey transects and
species distributions in 2-dimensions
(Hyrenbach et al. 2007)

NMS ExamplesI
Seabird community structure in the Indian Ocean
Shallow

TRANSECTS

1.0

Axis 3

0.5
2
1
16

0.0

9
4 5 76
10 8
11
3
15 13
12 14

-0.5

North

-1.0
-1.5

Deep

-1.0

South
-0.5

0.0

0.5

Axis 2

Three seabird assemblages:


sub-Antarctic, subtropical offshore, subtropical nearshore
(Hyrenbach et al. 2007)

1.0

NMS ExamplesII
Seabirds and subsurface predators around Oahu
69 seabird foraging
observations recorded
Presence of subsurface
predators was not
ascertained in 7 schools
In 2 of remaining 62
observations, no
subsurface predators
were present

(Hebshi et al. 2008)

NMS ExamplesII
Seabirds and subsurface predators around Oahu
The NMDS analysis relied on a similarity matrix created using the
Sorensen (Bray-Curtis) index from the raw seabird counts and 13
explanatory variables describing:
-type of fishing (commercial vs.. sport)
- subsurface predator (skipjack tuna, mahimahi, spotted dolphin,
false killer whale, yellowfin tuna, unknown),
- geographic location around Oahu (Waianae, Penguin Bank,
Kaena Point, other *).
* Only those locations contributing at least 10%
(7 or more) observations considered in analysis.
(Hebshi et al. 2008)

NMS ExamplesII
Seabirds and subsurface predators around Oahu
NMS identified 2 highly (99.3%)
orthogonal axes (r = 0.082),
which explained 67.9% of the
cumulative observed variance
axis1, r2: 0.502
axis2: r2: 0.178
The NMS stress was 17.873,
suggesting that the test
performance was fair
(McCune & Grace 2002)

(Hebshi et al. 2008)

NMS ExamplesII
Seabirds and subsurface
predators around Oahu
The seabird community was
influenced by the presence of
wedge-tailed shearwaters, brown
noddies, and sooty terns
The first axis captured the
differences between commercial
and sport fishing vessels, while
the second axis captured variability
across geographic locations
This analysis also revealed
significant correlations with the first
axis for 2 subsurface predators:
mahimahi (+) and skipjack tuna (-)

(Hebshi et al. 2008)

TakeHomeMessages
NMS is a flexible and powerful tool
This inherent flexibility makes this technique difficult
to interpret (how many meaningful axes are there ?)
Yet, NMS allows the integration of different datasets into
multivariate patterns
Data exploration will help you use NMS most efficiently,
by carefully choosing the sample sizes and species /
variables to include in your analyses.
Use NMS to tell ecological stories that balance noise
against statistical significance

PCAExaminationKey
This exam is worth 10 points (two homeworks).
Just like in the homeworks, make sure you explain what
you are doing and how you are getting the answers. This
way, I can give you partial credit for incomplete answers.
In particular, explicitly state what PC-ORD command you
used to obtain the various figures / results.
You will turn in a ppt file with your images and text
inserted into the body of the presentation. To copy text
from PC-ORD screen, use CONTROL + Print Screen
When answering the questions, back up your responses
with figures / tables / numbers. An image / table is worth
1000 words!!!

Dataset
Data file: PCA1M.wk1 (main matrix)

96 samples and 5 variables


Samples are monthly values (Jan. 97 - Dec. 04)
Variables:
Time: decimal year
MEI: El Nio Multivariate Index (positive: warm, negative: cold)
PDO: Pacific Decadal Oscillation (positive: warm, negative: cold)
Up36: upwelling at 36 N (positive: upwelling, negative: downwelling)
Up39: upwelling at 39 N (positive: upwelling, negative: downwelling)

DataExploration
Use scatterplot matrix to make a
plot of all possible pair-wise
combinations of the 5
environmental variables

DataExploration Correlograms
Time Trends ?
Regional Indices
(PDO / MEI)

Local Indices
(up36 / up39)

DataExploration Advisor
Rows Skewed
Columns Not Skewed

Outliers: Samples
Look our for these
in the plot results

DataExploration VariableYear

- 0.2330 E-07 = skewness

DataExploration Skewness

0.22 = skewness

0.94 = skewness

0.60 = skewness

0.24 = skewness

StatisticalResults WithYear

StatisticalResults WithYear
Important Axes:
Eigenvalue: 1,2,3,4 Broken-stick: 1 P-values: 1,4
Interpretation:

Loadings > 0.5 highlighted (arbitrary)

up36 / up39 up36 / up39


Together

Opposite

Time

MEI / PDO
Together

DataTransformation TimesinceStart
Transformation:
Subtract 1970 (first year sample)
Recode as Time Since Start

Similar skewness
No more outliers

DataExploration Skewness

- 0.2330 E-07 = skewness

StatisticalResults WithTime

StatisticalResults WithTime
Important Axes:
Eigenvalue: 1,2,3,4
Interpretation:

Broken-stick: 1 P-values: 1,4

Loadings > 0.5 highlighted (arbitrary)

up36 / up39 up36 / up39


Together

Opposite

Time

MEI / PDO
Together

DataTransformation RemoveTime
Remove Column:
No Time

Less Skewness
(for rows)
Still No Outliers

StatisticalResults RemoveTime

StatisticalResults WithoutTime
Important Axes:
Eigenvalue: 1,2,3
Interpretation:

Broken-stick: 1 P-values: 1,3


Loadings > 0.5 highlighted (arbitrary)

up36 / up39 up36 / up39


Together

Opposite

MEI / PDO
Together

DataExploration WithTime

Independent
(orthogonal) variables

DataExploration Time

No correlation with axis 1 or 2

Positive correlation with axis 3

DataExploration Time

DataExploration WithoutTime

Independent
(orthogonal) variables

DataExploration WithTime
Axis 1:
Su02

Big:
More Upwelling
Small:
Less Upwelling
Axis 3:

SU98 WI98
Su99

Small: Warm
Big: Cool

SU97

WI97

DataExploration WithoutTime

PDO axis 1

MEI axis 1

DataExploration WithoutTime

Upwelling 39 axis 1

Upwelling 36 axis 1

Conclusions
Number of eigenvalues = Number of variables
Eigenvalues loadings did not change
- even after transforming YEAR data
Broken-stick results did not vary: YEAR / TIME
Randomization results did vary: YEAR / TIME

Removing time (linear variable)


- one less eigenvalue
- highlighted upwelling / PDO / MEI influence

PolarOrdination/MRPP
Objectives:
Discussgeneralapproachesofthesetwomethods
Gooversettingsandresultsforthesetwomethods

LearningOutcomes:
Understandhowtoperformtheseanalyses
Befamiliarwithwhatresultsneedtobereported

PolarOrdination Applications
Bray-Curtis Ordination
(Polar Ordination) arranges samples
with respect to poles (also termed
end points or reference points)
according to a distance matrix
These endpoints are two samples
with the highest ecological distance
between them (objective approach),
OR two samples suspected of being
at opposite ends of an important
gradient (subjective approach)
Recommendation This procedure is especially useful for
investigating ecological change (e.g., succession, recovery).

PolarOrdination Pros/Cons
Advantages:
Ideal for evaluating problems with discrete endpoints:
conceptually (arctic sample / tropical sample) or
practically (before disturbance / climax community)
Polar Ordination ideal for testing specific hypotheses
(e.g., reference condition or experimental design) by
subjectively selecting the end points
Disadvantages:

This technique does not provide a general-purpose


description of the community (perspective is biased)
Very sensitive to outliers (by definition end points)

PolarOrdination HowitWorks
Setting Up:
Select a distance measure (usually Sorensen Index) and
calculate matrix of distances (D) between all pairs of points
Calculate sum of squares of the
distances for calculation of the
variance represented by each axis
Select two points, A and B, as reference points for axis 1
Define End Points Subjectively OR Use Objective Method
3 Objective Methods: Recommend Variance-Regression
- find point with largest variance in pairwise distances
- select point which minimizes regression of distances

PolarOrdination HowitWorks
Selecting End Points:
Variance-Regression:
(Beals 1984)
Selects points at edges of main
cloud of points (Recommended)

Original (Bray & Curtis 1957):


Selects outliers looking for two
points with largest distances
Minimum-Centroid (Deviation):
Must pick geometry subjective, rarely used (DO NOT USE)

PolarOrdination HowitWorks
Once you have the first axis (g) linking the two points:
Calculate position (xgi) of each point i on the axis g. Point i
is projected onto axis k between reference points A and B

For Reference: Equation for projection onto the axis is:

PolarOrdination HowitWorks
Calculate variance represented by axis k as a percentage
of the original variance (V k %). The residual sum of
squares has same form as original sum of squares and
represents amount of variation from original distance matrix

PO SuggestedProcedure:Step1
First, pick distance measure
Distance: Sorensen
Second, select End Points
Lets try Subjective
Third, Geometry / Residuals
Recommend City-Block
NOTE: # Axes only changes reported results not solution.
Always try more than 1. Set List Residual Matrix = 0

PO SuggestedProcedure
Next, pick number of subjective axes
Note: Possible that
objective axes capture
more variation than the
subjective axis selected

Method for determining the


(remaining) objective axes

PO Results
Examine Results.txt file: Settings / Options

Ordination axes in order of variance explained


showing endpoints chosen and sample scores

PO Results
Examine scores on axis 1: Results.txt file

Examine performance of axes 2 and 3: Results.txt file

End Points: 88 & 92


Variance: 55.79%
End Points: 86 & 02
Variance: 16.05%

PO WhattoReport
Distance Metric Used (Use Sorensen)
Method for selecting End Points What are they
Use subjective for axis 1
Select City block Distance / Residuals similar to NMS
Use Variance-regression method for additional axes
Number of dimensions considered
Always use more than 1

Proportion of variance explained by each axis (r 2)


Bi-plot ordination plots
Correlations of env. data / species with axes (Tau)
Orthogonally of axes: May be lower than NMS / PCA

NOTE: No Randomizations (p values) working on it.

MRPP Applications
Multi-response Permutation Procedure
(MRPP) is a non-parametric approach for
testing the hypothesis of no differences
between two or more groups of entities
(species, variables): MRBP, ANOSIM, Qb
These pre-existing groups can be
defined using groups of samples on the
basis of categorical data:
The presence absence of given species
Categories of environmental variables
(e.g., early vs. late)
Recommendation This procedure yields a p value and
interpretation requires further exploration: indicator species

MRPP Pros/Cons
Advantages:
Ideal for evaluating specific hypotheses
differences between groups of samples

Disadvantages:

Cannot investigate interaction terms


Interpretation difficult to determine what species are
contributing to these differences in community
composition require additional exploration of the data
Recommend: Use Indicator Species Analysis

MRPP HowitWorks
Setting Up:
Include a Grouping Variable in the Main / Second matrix
Select a distance measure (usually Sorensen Index) and
calculate matrix of distances (D) between all pairs of points
within each of the pre-defined groups we are testing

Group 1

Group 2

Shuffle data and


recalculate distances,
for all possible
arrangements of
samples into groups

MRPP HowitWorks
Calculate distance matrix, D
Calculate average distance xi within each group i
Calculate delta (weighted mean within-group distance)

Note: for g groups, where C is a weight that


depends on the number of items in the groups
(Ci = ni / N, where ni is the number of items in
group i and N is the total number of items)

MRPP HowitWorks
Permutations:

M = N!/(n1! * n2!)

Determine probability of a this small or smaller


Species

SU

Groups

SU

1
1
2
2
3
etc.

MRPP HowitWorks
Calculating the p value:
Determine probability of a as small or smaller

MRPP HowitWorks
Output:
Test Statistic T: measures effect size

A: within-group agreement

BEWARE:
DO NOT
over-interpret
T and A:
Ongoing
Discussion

P-value: Null Hypothesis:


within-group distance the same as across-group distances

MRPP SuggestedProcedure:Step1
First, pick
distance measure
Distance: Sorensen
Second, select
Weights of Groups
Recommend:
n / sum (n)
Third, use Ranks

Useful for very heterogeneous data


More comparable to NMS

MRPP Results
Examine

Results.txt file: Distribution of samples into groups

MRPP Results
Examine Results.txt file: T & A Statistics
Smaller
observed delta
A>0
(more similar
within groups)
Significant result: p < 0.05
Fairly small Output: NO bi-plots, NO variance explained

MRPP WhattoReport
Distance Metric Used (Use Sorensen)
How groups were defined Relate back to Hypothesis
Chance corrected within-group agreement (A)
Associated p value

PO/MRPP References
Polar Ordination:
Bray, J. R. and J. T. Curtis. 1957. An ordination of upland
forest communities of southern Wisconsin. Ecological
Monographs 27: 325-349.
Beals, E. W. 1984. Bray-Curtis ordination: an effective
strategy for analysis of multivariate ecological data. Advances
in Ecological Research 14: 1-55
MRPP:
Mielke, P. W., Jr. 1991. The application of multivariate
permutation methods based on distance functions in the earth
sciences. Earth-Science Reviews 31:55-71.
Zimmerman, G. M., H. Goetz, and P. W. Mielke, Jr. 1985. Use
of an improved statistical method for group comparisons to
study effects of prairie fire. Ecology 66: 606-611.

ForthePeer Review
Look for a gradient (one axis):

Polar Ordination

Compare groups:

MRPP

Suggestions:
- If you have a categorical value (in canyon / outside): MRPP
- If you have continuously changing samples (across latitude or
depth): you can test for N / S OR shallow /deep gradients
- If you have the diet or habitat multiple species, you can use
them as groups

Potrebbero piacerti anche