Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Objectives:
IntroducethetheoryofMultiVariate Analysis
Discussthenatureofcommunitycompositiondata
LearningOutcomes:
UnderstandtheprinciplesunderlyingMultiVariate Analysis
BecomefamiliarwithmatrixdataandtheiruseinPCORD
Multivariate DataAnalysis
Why are species data cross-correlated ?
If there are no specific niches:
Species are still using a finite
amount of resources
CommunityAnalysis Introduction
CommunityAnalysis Introduction
classification versus ordination
Classification is the placement of species and / or
sample units into (discrete) groups
Ordination is the arrangement or ordering of species
and / or sample units along gradients
WhatisanEcologicalCommunity?
Two views have
dominated the debate
over the nature of
ecological communities
since the 1920's:
Clements
Clements
discrete unit
Gleason
Gleasons
loose assemblage
of species
WhatisanEcologicalCommunity?
Clements Perspective:
Discrete entities with recognizable boundaries
The community fully integrated functionally
Species have coevolved, enhancing their interdependence
Gleason's Perspective:
Community is a chance association of species with
similar adaptations and ecological requirements
No distinct boundaries where communities meets
EcologicalCommunities ScaleMatters
An ecozone or
biogeographic
realm is the largest
scale biogeographic
division of the
earth's surface
based on historic
and evolutionary
distribution patterns
of organisms
EcologicalCommunities ScaleMatters
WhatdoCommunityDataLookLike?
Number of species
Log-normal Distribution
Log-normal Distribution
Abundance of species
PropertiesofCommunityData
Number of species
Abundance of species
PropertiesofCommunityData
The number of important factors is small:
A few factors can explain the majority of the
explainable variation (variance)
Especially we if create synthetic variables
combinations of multiple collinear variables
There is substantial noise in the data:
Even under ideal circumstances, replicate
samples vary from each other due to stochastic
events - and potentially though observer error
We can use this overlap (or smear) to quantify
the degree of sample similarity along gradients
PropertiesofCommunityData
There is much redundant information:
species often share similar distributions
replicate samples vary from each other
This redundancy is at the core of multi-variate statistics
DATA
Environmental
Variables
Species
Sample Units
Patterns
Associations Between:
Species * Environmental
Variables * Sample Units
FormatofCommunityData
Multi-variate methods operate on a community data matrix
(or species by sample matrix).
A community data
matrix has taxa
(species) as rows and
samples as columns:
PC ORD terms this a
NORMAL matrix:
Rows: Sample Units
Columns: Species
FormatofCommunityData
Spp1
Sample1 10
Sample2 0.50
Sample3 1
Normal
Matrix
Spp2
2
0.40
0
Spp3
1
0.10
1
Char1
Sample1 Oahu
Sample2 0.53
Sample3 25
Char2
Kauai
0.47
37
Char3
Oahu
0.12
56
FormatofCommunityData
In linear algebra, the transpose of a matrix A is another
matrix AT (also written Atr, tA, or A) created by:
writing the rows of A as the columns of AT
writing the columns of A as the rows of AT
PCORD InstalltheSoftware
Open Help and Read:
Getting Started
Using PC-ORD
Introduction
UsingPCORDforMultivariate DataAnalysis
PC ORD was written
for community ecologists
and includes procedures
best suited for species data
This class will help you in three areas:
1. to recognize which techniques are appropriate for
specific data types and analysis objectives
2. to become familiar with how to use PC ORD within
the context of a proper analysis process
3. to understand what PC ORD does and when you
might need specific features
UsingPCORDforMultivariate DataAnalysis
In this class we will cover the following issues / techniques:
General use of multivariate analysis
using PC ORD
summarizing data / results
Data analysis flow
data transformations
selecting a distance measure
Ordination techniques in PC-ORD:
PCA, NMS, polar ordination
Grouping techniques in PC-ORD:
MRPP, indicator species analysis, Mantel tests
PCORD ExampleofDatasets
PC ORD works with multiple datasets arranged in matrices
MAIN
Secondary
MATRIX
Matrix
PCORD ExampleofDatasets
PC ORD works with multiple datasets arranged in matrices
PCORD FileTypes
Data Matrices: Main / Secondary
Main = Data of Interest Secondary = Associated Data
PCORD FileTypes
Data Matrices: Main / Secondary
(*.wk1 spreadsheet files) Created with Excel
Can be exported
from Excel as:
worksheet format
WK (1-2-3)
MainMatrixFormat
There are certain conventions:
List number of sample
units and species
Name of sample units
(list them on column 1)
Describe variable
types and names
(list them on rows)
Add data on columns
(species abundance)
SecondMatrixFormat
There are certain conventions:
List number of sample
units and variables
Name of sample units
(list them on column 1)
Describe variable
types and names
(list them on rows)
Add data on columns
(environmental data)
Main/SecondaryMatrixFormat
1. Cells A1 and B1 indicate the number of entities (rows) and
what the entities are (8-character maximum).
2. Cells A2 and B2 indicate the number of attributes (columns)
and what the attributes are (8-character maximum).
3. Row 3 contains single letters which indicate the variable
type for each column. There are three acceptable values:
Q = Quantitative C = Categorical M = Mixed
4. Row 4 contains the name for each column variable. ONLY
THE FIRST 8 characters are used.
5. Rows 5 and below contain a row name in the first cell,
followed by numeric values for each column attribute. Row
names, like the column names, are limited to 8 characters.
ImportOptions
Several formats allowed from Excel
NOTE:
The same
formatting
rules hold
for all
import file
types
PCORD OtherFileTypes
Graph Row / Graph Column = Ordination scores for rows
and columns (*.gph text files) created by PD-ORD
Result = Procedure output (*.txt text file) created by PC-ORD
Dendograms / Plots = (*.den) (*.str) created by PC-ORD
Project = After saving a main, second, graph and result file,
they can be grouped together (*.prj file) specific to PC-ORD
SampleDataset:BryophytematsfromOregon
Datasets posted on the course web-page:
aMoss1M = normal matrix of average relative % volume
of epiphytic bryophytes
aMoss2M = environmental matrix with measurements from
the same sample units where bryophytes sampled
Objectives:
Investigate association of different species in space
Study environmental correlates of species distributions
StartExploringPDORD
Objectives:
Discussspecies environmentalassociations
Explorethenatureofcommunitycompositiondata
LearningOutcomes:
Understandthevalueofbivariateplotsandspecies/env.plots
Becomefamiliarwithgraphingoptionsandformating inPCORD
SpeciesonEnvironmentalGradients Ideal
Robert H. Whittaker (Ed), Classification of Plant Communities,
1978 (Handbook of Vegetation Science), Kluwer
Academic Publishers
Ideal species distributions across environmental gradients:
Gaussian Response: Smooth normal curves
Characterized by: mean + SD, peak
Linear Response:
Smooth lines
Characterized by range, slope
Gaussian(Normal)Distribution
LinearDistribution
Again:AMatterofScale
Study 1:
(46 km depth):
Species prefers
shallow habitat
Gaussian
Study 3:
(3-4 km depth):
Species shows
no preference
Linear
Study 2:
(0 - 3 km depth):
Species prefers
depth habitat
SpeciesonEnvironmentalGradients Real
3 important issues to consider:
Zero-truncation Problem:
Solid Curves:
Complex Curves:
Points
represent
species
abundance
SpeciesonEnvironmentalGradients Real
Linear Regression
r = 0.30
r = 0.34
Abrupt Ranges
(Boundaries)
Multiple Modes
Fitted Envelope
r = 0.21
r = 0.44
Linear Responses
Peaks (Optima)
CommunityAnalysis BivariatePlots
More fruitful to explore how pairs of species
abundances are related with bivariate plots.
Joint Absences
(0,0)
Species B
Joint Occurrences
(lots, lots)
Perfect correlation
No correlation
Species A
CommunityAnalysis BivariatePlots
Bivariate plots from pairs of species responses
to the same environmental gradients
positively associated
negatively associated
Both
have joint
absence
(0, 0) data
Jointoccurrence
versus
single
occurrence
CommunityAnalysis BivariatePlots
positively associated
negatively associated
CommunityAnalysis Correlations
Normal
Clouds
r should be positive
Dust
Bunnies
r should be negative
CommunityAnalysis Correlations
Beware:
As we sample beyond their
habitats, we record more
and more joint absences
Summary
Plots of species responses (abundance) to single
environmental variables are informative:
unimodal multimodal
linear / normal
peak (optimum)
Bivariate plots (sp1 vs sp2) are more informative, since
they allow us to integrate all the possible responses
of the species to the environmental variables
Typical responses: normal cloud and dust bunnies
PCORD GraphingTools
PCORD ScatterplotFigures
Examples of Scatterplots:
Single (2 species)
Multiple (Matrix)
PCORD DominanceCurves
Dominance curves: To study distribution
of abundance among species in sample.
How are individuals distributed across species?
Program creates result.txt file (with all species)
Species - name
RankAbun - ranking abundance (1: most numerous)
Log(SumAbund) - log (base 10) of total sum
Sum - sum of all abundance data for species
RankFreq - ranking frequency (1: most frequent)
Freq
- frequency (number of non-zero counts)
Mean - average
S.Dev. - SD
CV% - 100 * (SD / Mean)
V/M - Variance / Mean ratio
PCORD DominanceCurves
PCORD Distributions
Distributions:
PCORD Distributions
The Observed Data
Reference Data
PCORD Distributions
The Main Matrix Distributions and Second Matrix
Distributions ask the following:
Variable (Choose one variable from the matrix)
Distribution type (Discrete or Continuous)
Curve steps (number of increments along the axis )
Constant for lognormal distribution (leave default)
YourTask
Start Exploring PC ORD using bryophyte matrices (1, 2)
Load data (import and open)
Format and export figures
Make scatterplots (species pairs & species env data)
Find good example of a dust bunny distribution
Find good example of normal cloud distribution
Try other plots: scatterplot matrix, dominance, distribution,
Create dominance curve for matrix 1
Pick one species and make distribution plot
QuantifyingDistances
Objectives:
DiscussSimilarityandDissimilarity
GoovertherulesforselectingaDistanceMeasure
LearningOutcomes:
UnderstandthevalueofDistancemetrics
BecomefamiliarwiththeapproachforcalculatingDistancemetrics
DistanceMeasures
First step of most multi-variate analysis involves creating a
matrix of distances or similarities for all pairs of samples
This step is extremely important:
If information is ignored it will not be included in results
If noise / outliers exaggerated add distorting influences
There are a myriad of indices and lots of details.
Lets begin with the general principles:
Distance = Difference
DistanceConcepts
Resemblance can be quantified in two ways:
dissimilarity or similarity
These two metrics can be translated:
Similarity = 1 Dissimilarity
Distance metrics can be applied to a variety of data:
Quantitative (discrete / continuous)
Binary (Presence / Absence)
Distances calculated among two types of objects:
either the rows or the columns of the primary data matrix
Sample unit distances -> sample space
Species distances -> species space
SpeciesSpace vs SampleSpace
Samples
Species
SpeciesSpace vs SampleSpace
4
3
Sp 2
0
2
3
4
Sample Unit A
SU B
2
1
SU A
1
0
Species space
Sample space
Species 2
Sample Unit B
2
3
Species 1
TypesofDistanceMetrics
Three types: Metric, Semimetric, Non-Metric
A good metric needs to meet four rules:
Minimum distance value is 0 (e.g., for identical samples)
When two items differ, distance > 0
Distances are symmetrical:
from A to B = from B to A
Triangle Axiom: Whenever we have 3 objects
Any one pair-wise distance
(between any two objects)
CANNOT be larger than the
sum of the other two distances
TypesofDistanceMetrics
Rules:
Minimum Distance = 0
Distance > 0
Symmetrical Distances
Triangle Axiom
Examples Distanceofspecies
Examples Distanceofsamples
A = {2,3,4,6,7}
We have 2 sets:
7
4
6
2 3
3 5
B = {1,4,5,7,8}
8
7
4
SelectingaDistanceMetric
Two things to consider:
Input: Acceptable domain of input data
Presence / Absence vs Count Data
Positive / Negative values
Output: Range of output distances
Within bounded range
Meet triangle axiom
SelectingaMetric ProportionalOverlap
SelectingaMetric ProportionalOverlap
The Jaccard index measures similarity
between samples, and is defined as the size
of the intersection divided by the size of the
union of the sample sets (Jaccard 1901)
The Jaccard distance, which measures
dissimilarity between samples, is obtained
by dividing the difference of the sizes of the
union and the intersection of two sets by the
size of the union
1-
SelectingaMetric ProportionalOverlap
JS = w / (A + B -w)
JD = (1 w) / (A + B -w)
Properties:
Proportion of combined
abundance not shared
SelectingaMetric ProportionalOverlap
Sorensen index (Sorensen 1948), also known as Sorensens
similarity coefficient (QS), is defined as:
Where:
A and B are the number of species in sample A and B,
respectively, and C is the number of shared species.
SelectingaMetric ProportionalOverlap
SS (Bray-Curtis)
= 2w / (A + B)
SD = (1 - 2w) / (A+B)
Properties:
SelectingaMetric ContinuousDistance
Euclidean Distance:
Species j
Sample units i and h
Widely Used:
A = Abundance
Summary DistanceMetrics
Many Distance indices available How to select?
Consider 4 rules: metric / semimetric
Look for proportionality (scaled from 1 to 0)
Think about what makes intuitive sense to you
Check what indices are compatible with given tests
The choice of a distance metric is based on empirical
evidence (e.g., methodological studies, previous literature)
Recommendations: (According to PC ORD)
Sorensen index shown to be effective for assessing
species and sample similarity (community data)
Euclidean distance well suited for environmental data
Summary Recommendations
Sorensen:
Quantifies proportion shared abundance among species
Works well for community data (empirically)
Relative Sorensen:
Includes general relativization (by totals)
Summary Recommendations
Euclidean:
Sensitive to outliers
Bad with community data (lots of 0s)
Relativized Euclidean:
Euclidean after scaling abundances to %s
Focuses on relative abundance among species
Summary DistanceMetrics
Additional considerations dataset specific:
Are the data very noisy?
Is there a lot of variation in the data?
Are there many 0s in the data?
Are the environmental responses not normal?
PCORD AdvisorTools
PCORD CurrentProfile
PCORD ProfileWarnings
Notes are written at the end of the profile, if certain conditions
are encountered. These are listed below.
1. If fields are filled with asterisks, then they could not be
calculated -- incompatible with the data.
2. Negative numbers in main matrix are incompatible with CV
3. Negative numbers in second matrix are incompatible with CV
4. Warning: one or more CV (coefficient of variation) could not
be calculated; replaced with missing value indicator: 99999.99
5. Negative numbers are present in main matrix, so Sorensen
distance could not be used for outlier analysis or calculation of
average half changes.
6. Negative numbers are present in 2nd matrix, so Sorensen
distance could not be used for outlier analysis or calculation of
average half changes.
OrdinationMethods PCA
Objectives:
DiscussPCAwithincontextofOrdinationMethods
GoovertheoutputofOrdinationMethods
LearningOutcomes:
BecomefamiliarwiththeapproachfordoingaPCA
Ordination
Arranging items (samples / species) along one or more axes
Graphical summarization of complex relationships
Extracting one or more dominant patterns % variance
Synthesis (reduction) of large datasets into fewer variables
These variables are then related to environmental variables
Components are
independent
from each other
OrdinationDiagrams
Typically, a 2-dimensional plot of samples / species in terms
of synthetic axes (combinations of variables)
Ideally, the distance between points in ordination space is
proportional to the underlying distance measures
NOT LIKE A REGRESSION
F e s tu c a id aSU17
h o e n s is
S TA ND5
Topo
Class
STAND5
S TA ND3
STAND3
STAND2
S TA ND2
draw
flat
slope
ridge
S TA N1 0
S TA ND7
STAND7
S TA ND1 S TA ND6
STAND1 STAND6
SU19
Axis 2
Axis 2
Axis 2
SU18
SU16STAN10
SU4
SU10
SU5STAN13
STAND9
SU7
S TA N1 3
S TA ND9
STAN11
STAN14
SU3
SU15
S TA N1 1
STAN18
SU1
S TA N1 4
SU6
SU2
S TA N1 8
STAN19
SU14
S TA N1 9
SU11
S TA N1 6
STAN16
STAN17
STAN15
SU12 SU8
SU9
S TA N1 5
S TA N1 7
SU13
S TA N1 2
STAN12
STAND8
Axis 1
S TA ND8
STAND4
Axis 1 A xis 1
S TA ND4
OrdinationResults
How many axes? How many discrete signals in dataset
Different rules for selecting number of axes
Significance assigned to axes and contributing variables
Yet, most studies select 2 or 3 axes
Interpretation of results: overlays, correlations with axes
OrdinationResults
How many axes? Number of discrete signals in dataset
Coefficient of
determination:
% variance
represented
vs
absolute
(PCA eigenvalues)
OrdinationResults
Interpretation of results: overlays, correlations with axes
Conclusions:
Species ALSA
negatively
correlated with
Axis 1. Variance
explained= 25%
r = + 0.031
tau = + 0.045
r = - 0.534
tau = - 0.327
OrdinationResults
Beware when interpreting correlation coefficients:
outliers can have strong influence
coefficients meaningless if relationships not linear
correlations coefficients invalid with binary data
2.0
1.5
1.0
Axis 2
Graphical representation of
environmental correlates using
Joint Plots: the angles / lengths
indicate strength and direction
of environmental variable
association with ordination axes
Bryo%
Age
0.5
Last burn
0.0
Canopy
-0.5
Radiation
Topo position
UnderRad
-1.0
-1.5
-1.5
-1.0
-0.5
0.0
0.5
Axis 1
1.0
1.5
2.0
OrdinationResults
External Evaluation: Correlations with second matrix
(e.g., how are environmental variables related to the axes?)
Beware of biases: proving the expected results
Comparisons with null model: Comparing results of
real data with those from randomized data is promising, but
can yield no clear results if strong outliers cause spurious
significant patterns with the random dataset. Beware
PrincipalComponents(PCA)
Using the best-fit straight line to describe a system of points
in multiple dimensions using straight lines (Pearson 1901)
Y = a 0 + a1 X1 + a2 X2 +
Start with cloud of n points
in p-dimensional space
Center the axes in the
point cloud (centroid)
Rotate axes to maximize
the variance along axes
As rotation angle changes,
the variance changes
PrincipalComponents(PCA) WhentoUse
Normality: Ideal for normal data with approximately linear
relationships amongst variables Rarely for community data
Beware of heterogeneous community data
Critical to justify the use of this linear approach
Sample size: Need a good estimate of correlation structure
Stronger patterns require smaller sample sizes
Rule of thumb: 5 sample units per variable
(Tabachnick and Fidell 1989)
Increasing number of variables, strengthens results
(Pillar 1999)
PrincipalComponents(PCA) Normality
Assessed with skewness (asymmetry) / kurtosis (peakiness)
skew= 0
(normal data)
skew > 0 (right tail too long) skew < 0 (left tail too long)
kurtosis = 0
(normal data)
kurtosis > 0 (more peaky)
PrincipalComponents(PCA) Linearity
Use bivariate scatterplots to assess linear relationships
Solutions:
Data transformations
Solutions:
Transform the data
r = -0.96
PrincipalComponents(PCA) Reporting
What type of cross-correlation matrix you used?
Correlation or Covariance - Use euclidean distance
PrincipalComponents(PCA) Example
Setup:
PCA uses only Euclidean
Distances (real metric)
Matrix can be calculated
in three ways:
Correlation: very susceptible
to outliers (DO NOT USE)
Variance / Covariance: Less
sensitive to outliers (USE)
Non-centered: Experimental
(DO NOT USE)
PrincipalComponents(PCA) Example
Setup II:
Scores for species can be
calculated in two ways:
Distance-based: Relates
species- samples to each
axis represents species as
vectors from centroid
(Standard) USE
Weighted Average: Species
represented as points outliers
(DO NOT USE)
PrincipalComponents(PCA) Example
Setup III:
Output Options:
Cross-Product Matrix:
Shows pair-wise distances
(USE)
Randomization tests: Use
bootstrap to assess
significance of the results
(USE)
PrincipalComponents(PCA) Example
Variance
Distance-based
Cross-product
Matrix
Randomization
PrincipalComponents(PCA) Example
Setting up the Randomization Test:
Seed: make
multiple tests
comparable by
using the same
sequence of
random numbers
(supply seed)
Runs: number
of permutations
used in the test
(determines p
value of statistic)
PrincipalComponents(PCA) Example
PrincipalComponents(PCA) Example
Results: Covariance matrix species distances
PrincipalComponents(PCA) Example
Results: Eigenvalues - Variance explained (up to 10 axes)
Eigenvalues are
proportional to
variance explained
Broken-stick
eigenvalues are
produced by chance
PrincipalComponents(PCA) Example
Results: Never explain 100% of variance (axes = variables)
Observed
Variance Explained
Expected
PrincipalComponents(PCA) Example
Results: Randomization tests
PrincipalComponents(PCA) FinalResult
Correlation with Axes
PrincipalComponents(PCA) Example
Results: Graphs (for Eigenvalues > 100)
PrincipalComponents(PCA) Example
Results: Graphs
Samples: points
Species: vectors
PrincipalComponents(PCA) Example
Results: Graphs
Samples: points
Species: vectors
PrincipalComponents(PCA) Example
Results: Species Response Graphs
Loadings STLI
Loadings NEDO
Axis 1: +0.51
Axis 2: -0.74
PrincipalComponents(PCA) Example
Variance
Weight-average
Cross-product
Matrix
Randomization
PrincipalComponents(PCA) Example
Setting up the Randomization Test:
Seed: make
multiple tests
comparable by
using the same
sequence of
random numbers
(supply seed)
Runs: number
of permutations
used in the test
(determines p
value of statistic)
PrincipalComponents(PCA) Example
PrincipalComponents(PCA) Example
Results: Covariance matrix species distances
PrincipalComponents(PCA) Example
Results: Eigenvalues
PrincipalComponents(PCA) Example
Results: Graphs
Samples Labeled
Species Labeled
PrincipalComponents(PCA) Example
Display Recommendation:
PC ORD recommends displaying
species as vectors / samples as points
Rotation:
Useful to show patterns you are interested in.
Need to keep track and report in results
PrincipalComponents(PCA) Example
Rotation by NEDO
We stretch the graph
along direction of most
variation of the species
Loadings of
NEDO
Axis 1: +0.51
Axis 2: -0.74
PCA Application&Examples
Objectives:
ShowcasePCAanalysis inPCORDandtheliterature
LearningOutcomes:
Becomefamiliarwiththeoutput/resultsofPCA
PrincipalComponents(PCA) Example
Results: Covariance matrix species distances
PrincipalComponents(PCA) Example
Results: Eigenvalues - Variance explained (up to 10 axes)
Eigenvalues are
proportional to
variance explained
Broken-stick
eigenvalues are
produced by chance
PrincipalComponents(PCA) Example
Results: Never explain 100% of variance (axes = variables)
Observed
Variance Explained
Expected
PrincipalComponents(PCA) Example
Results: Species Loadings onto the PC Axes
Use
PrincipalComponents(PCA) Example
Results: Randomization tests
PrincipalComponents(PCA) FinalResult
Results: Correlation with Axes
PrincipalComponents(PCA) Example
Results: Graphs
Samples: points
Species: vectors
PrincipalComponents(PCA) Example
Results: Graphs
PC ORD recommends
displaying
species as vectors /
samples as points
Samples: points
Species: vectors
PrincipalComponents(PCA) Example
Results: Graphs
Samples Labeled
Species Labeled
PrincipalComponents(PCA) Example
Results: Species Response Graphs
Loadings STLI
Loadings NEDO
Axis 1: +0.51
Axis 2: -0.74
PrincipalComponents(PCA) Example
Rotation: Highlights certain patterns. Report in results
NEDO Axes
Correlations
Axis 1: +0.51
Axis 2: -0.74
Rotation by NEDO
Stretch plot along
direction of most
variation for species
PrincipalComponents(PCA) PaperI
Published Example: Ainley, D.G. et al. (2005).
Objective: Relate densities of the 12 most abundant
species of seabirds to 12 habitat variables:
5 biological, 4 oceanographic, 3 geographic (spatial)
PrincipalComponents(PCA) Paper I
Oceanographic variables examined:
sea-surface temperature / salinity, thermocline depth / strength
Chl
Max
Acoustic
Biomass
PrincipalComponents(PCA) PaperI
Data Manipulations To Avoid Biases:
Densities log-transformed to meet normality assumptions
Nevertheless, residuals generated in the regressions for
some species did not meet those assumptions (Skewness /
Kurtosis Test for Normality of Residuals, P < 0.05)
Least-squares regression analysis (ANOVA), however,
is a very robust procedure with respect to non-normality
(Seber, 1977, Kleinbaum et al., 1988)
Yet, while these analyses yield the best linear unbiased
estimator in the absence of normally distributed residuals, Pvalues near 0.05 must be viewed with caution (Seber, 1977)
PrincipalComponents(PCA) PaperI
To avoid double-absences:
Only 15-min transects in which any given species was
recorded were analyzed
The total sample size for the 12 species was 1209
Is this an adequate sample size ?
Rule of thumb:
5 samples per variable (Tabachnick and Fidell 1989)
1209 / 12 ~ 100 samples per variable
PrincipalComponents(PCA) PaperI
Analysis Methods:
Principal components analysis (PCA), in combination
with Sidak multiple comparison tests, used to assess
differences in habitat selection among 12 seabird species
To test for significant differences in habitat affinities
among seabird species, we used two one-way ANOVAs:
In the first, we tested for differences among PC1 scores of
each species; in the second, we compared the PC2 scores
Considered differences between two species to be
significant if either one or both of the PC1 or PC2 scores
differed significantly
PrincipalComponents(PCA) PaperI
Community-Wide Result: The first and second PC axes
explain 60% of variance in habitat use by 12 seabird species
PrincipalComponents(PCA) PaperI
Species-specific Results:
Salty, Green
Pair-wise associations
denoted by circles
Zoop
Prey
Fish Prey
PrincipalComponents(PCA) PaperII
Published Example: Weichler et al. (2004).
Objective: Relate seabird densities to seven
environmental parameters:
(1) water depth, (2) distance to nearest land, (3) number
of trawlers within a radius of 5 km, (4) sea surface
temperature, (5) water temperature difference (0 10 m) ,
(6) water temperature difference (0 30 m), and (6) water
temperature difference (10 50 m)
Did Not Report Cross-correlations of Habitat Variables
PrincipalComponents(PCA) PaperII
Data Manipulations To Avoid Biases:
Species densities were selected as variables and 10 min
intervals (samples), were selected as cases
Only species seen in at least five counting intervals were
included, an arbitrary choice that allowed covering a wide
spectrum of species while ignoring those with few occurrences
Only commoner species with numbers exceeding 1% of all
individuals counted were included in the analysis
Dataset of 46 sections of the cruise tracks. Each section
comprised a hydrographic station approximately midway and
10 min intervals in two opposite directions (4 8 km away)
Sample Size: 46 samples / 7 variables: Ratio of 6.5
PrincipalComponents(PCA) PaperII
Community-Wide Result: Six principal eigenvalues (> 1),
showing % of variation explained and ecological interpretation
PrincipalComponents(PCA) PaperII
Community-Wide Result:
Loadings for the 11 seabird
species and 7 variables on
the six principal eigenvalues
3 principal components:
50 % of variance
6 principal components:
78 % of variance
PrincipalComponents(PCA) Comparisons
Number of Axes:
- Selected 2 easy to interpret (Ainley et al. 2005)
- Selected 6 based on eigenvalues > 1 (Weichler et al. 2004)
Display of Results:
- Plot and table of eigenvalues (Ainley et al. 2005)
- Eigenvalues and interpretation (loadings) (Weichler et al. 2004)
Significance Tests:
- Pairwise species comparisons (ANOVA) (Ainley et al. 2005)
- Correlations with selected variables (Weichler et al. 2004)
PrincipalComponents(PCA) Tools
Percent of pattern
explained in original
distance matrix
PrincipalComponents(PCA) Tools
PrincipalComponents(PCA) Examples
Ainley DG, Spear LB, Tynan CT, Barth JA, Pierce SD, Ford RG, Cowles
TJ (2005). Physical and biological variables affecting seabird distributions
during the upwelling season of the northern California Current. Deep-Sea
Research II 52: 123143
Weichler T, Garthe S, Luna-Jorquera G, Moraga J (2004). Seabird
distribution on the Humboldt Current in northern Chile in relation to
hydrography, productivity, and fisheries. ICES J. Marine Science
61 (1):148-154
DisclaimerReferences
Seber, G.A.F. (Ed.), 1977, Linear Regression Analysis. Wiley, New York.
Kleinbaum, D.G., Kupper, L.L., Muller, K.E., 1988. Applied Regression
Analysis and other Multivariable Methods. PWS-KENT Publishing Company,
Boston.
Tabachnik, B.G. and L.S. Fidell. 1989. Using Multivariate Statistics. 2nd ed.
New York: Harper and Row.
DataScreeningandTransformations
Objectives:
DiscussStepsforAnalysis:DataScreening,DataManipulation
Goovertheprinciplesofdataexploration
LearningOutcomes:
Bereadytoplanyouranalysis:DevelopMetadataandAnalysisLog
BeabletoscreenandmanipulateyourdatawithPCORD
DataExploration DocumentingFlow
Flow diagram: sequence of changes / analysis
Analysis log: input, output, results
Save all input and output files and data edits
Metadata:
data
about
data
List of
errors
Clean
Data
Clean
Data
List of
errors
DataExploration DocumentingFlow
File Names
PC-ORD
File Contents
Connections
Links to other
software
Products:
- figures
- tables
- results
GIS - Stats
DataExploration DocumentingFlow
Use clear,
descriptive
titles (dated)
Save all
output files
Keep a
flowchart or
dated record
Record WHY
you did it
you will
forget!
DataExploration DocumentingFlow
Screening:
Cleaning:
Fix typos
Erase / Correct incomplete data
Check effects of corrections
Transformations:
Look up assumptions of test
Check data distributions
Make transformations (re-check)
DataExploration DataScreening
Metadata
96 samples
5 variables
Data type?
Explanation
DataExploration CurrentProfile
% zeros:
species data
Lowest / highest value:
typos (errors)
Skewness:
non-normality
-1 < SK < 1
Outliers:
(in SD units)
2 SD -> 96%
DataExploration SummaryI
Data Summary:
Mean
SD
Range
Diversity
DataExploration SummaryII
Skewness:
Steps to Fix
Skewness:
Taking the log
or square root
works for data
with moderate
skewness
DataExploration 1DOutliers
Frequency
distribution of
a univariate
outlier falling
5.5 standard
deviations
above mean
DataExploration 1DOutliers
Describe the distribution:
In graph and tabular form
DataExploration 1DOutliers
Discrete Distribution
Continuous Distribution
Test for significance: off PC-ORD
DataExploration 2DOutliers
25
Sp2
20
15
10
5
0
0
10
15
20
25
Sp1
DataExploration 2DOutliers
DataExploration 2DOutliers
DataExploration Outliers
1.20
1.15
1.10
1.05
1.00
0.95
6
5
4
3
2
1
0
0.90
Frequency
DataExploration Outliers
Average Distance
DataManipulations
You can manipulate data directly in PC-ORD
Modify / Append Data
Delete Columns / Rows
Multiply / Add Constant
Randomly Sample
Shuffle Data
Note: Beals smoothing is Experimental DO NOT USE
DataTransformations
What are the two reasons for data transformations?
Statistical:
Meet assumptions (normality, linearity, variances,)
Express variables in the same units (km, km/hr):
Ecological:
Make distance measures work better
Reduce influence of total quantity (sample totals)
Deal with importance of rare / common species
Identify informative species
DataTransformations Nomenclature
Monotonic: Element values are changed, but
ranks stay the same (e.g., change unit from km to m)
Relativization: Adjusts matrix elements by one
column / row standard (e.g., total, maximum)
DataTransformation
DataTransformation
Monotonic transformations retain ranks, but change values
P/A
(x)
f(x)
DataTransformations ExampleI
Logarithmic transformation fx = ln(x) OR log(x)
DataTransformations ExampleII
Arcsine / Arcsine-squareroot transformation
DataTransformation Howto
Note:
Need to
accept
TEMP file
DataRelativization
Relativization
DataRelativization
DataRelativization
DataRelativization Howto
Note:
Need to
accept
TEMP file
DataExploration Summary
Create naming convention for your files (metadata record)
(DATE_AREA_SP_suffix)
9710_Oahu_WTSH_raw
Create a data flow archive in your analysis notebook
Check assumptions of statistical tests / approaches
(PCA: normality of data, linear relationships)
Visually inspect your data: 1-D, 2-D, many-D.
Look for missing data and outliers in individual datasets
Inspect
DataManipulation Summary
Add missing data and fix typos
Ensure variables expressed in the same units (km / m)
Select the number and identify of species
(Rare species that occur in a single sample
contribute virtually no information, but add noise)
Look for and deal with outliers: (Remove OR Transform)
Deal with confounding factors, such as the different
magnitude of environmental variables (e.g., depth in m or km)
and the proportional representation of different species
(Relativize your data)
CHAPTER 9
Data Transformations
Criteria
Always
standard
deviation
----------<2
2 - 2.3
2.3 - 3
>3
degree of
problem
----------------------no problem
weak outlier
moderate outlier
strong outlier
Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered
Criteria
Always
Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered
Criteria
Always
3. Column relativizations
Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered
Criteria
Always
3. Column relativizations
NonmetricMultidimensionalScaling
(NMS)
Objectives:
DiscussStepsforAnalysis:Advantages/Disadvantages
GooveroutputandinterpretationofAutopilotAnalysis
LearningOutcomes:
UnderstandwhatanNMSanalysisdoesandtellsyou
BeabletodoaNMSanalysiswithPCORD
NMS Whatisit?
Non-metric: Non-parametric data analysis (ranks)
(Relationships between object pair-wise
distances and dissimilarities are not linear)
Output:
NMS Howdoesitwork?
NMS searches for best position of n objects on k
dimensions (axes) to minimize stress of k-d configuration
Compares the pair-wise distances (difference) of the
objects in reduced space (expressed in terms of the axes)
and the dissimilarity of the objects in the real world
(expressed in terms of the samples / species / variables):
The Real World
(e.g., 3D)
Reduced Space
(e.g., 1D)
NMS Howdoesitwork?
Approach:
Mechanics:
Iterative procedure
(Manipulates the coordinates of pairs of
observations so they fit as closely as
possible the measured object similarities)
NMS TheGood
Being based on ranked distances, it tends to linearize
relationship between environmental / species distances
Can deal with any distance measure, data normalization,
and data transformation
Can handle non-metric, semiquantitative and subjective
data (e.g., good / bad, beaufort sea state)
Solves zero truncation problem and some missing data
Empirical studies have shown that:
- Use of ranks makes NMS robust even if relationships
between distances and dissimilarities are not linear
- Provides appropriate distance summary with small
number of dimensions
NMS TheBad
Computationally intensive
Does not provide formula loadings
For a given number of dimensions, the solution for a
particular axis is unique. (First dimension in 2-D solution
not the same as first dimension in 3-D or 1-D)
Axis numbers are arbitrary, so the percent of variance on
a given axis does not decrease with increasing axis number
Difficulties in detecting discontinuities
Fails to find the global solution (minimum global stress)
because of multiple local minima.
Need to account for random start of iterative process
(e.g., repeat analysis to see if random start matters)
NMS Approach
1. Calculate dissimilarity matrix () of real data.
2. Assign sample units to starting configuration in the kspace (define initial X). Starting locations (scores on
axes) are assigned with a random number generator.
3. Normalize X by subtracting axis means for each axis
l and dividing by overall standard deviation of scores:
normalized
x il =
x il x l
k
(x
l =1
i =1
(n = samples, k = dimensions)
xl )
il
/ (n k )
NMS Approach
4. Calculate D using the Euclidean distances between
sample units in k-space.
5. Rank elements of in ascending order.
6. Put the elements of D in the same order as .
$ (with elements d$ )
7. Calculate D
ij
created by replacing elements of D
which do not meet monotonicity).
Software creates a plot of sample
pair-wise dissimilarities (y axis)
versus distances in k-space
We compute distance in k-space
NMS Approach
NMS Approach
NMS Stress
8. Calculate raw stress, S*
n-1
S =
i=1 j=i +1
2
$
( d ij - d ij )
NMS Stress
9. Because raw stress is altered if the configuration of points
changes (e.g., point locations, number of dimensions) it is
necessary to standardize ("normalize") stress.
Kruskals stress formula one:
n-1
S = S /
*
2
ij
i=1 j=i +1
= 100 S
NMS Approach
10. Now the program tries to minimize S by changing the
configuration of the sample units in the k-space.
Calculate "negative gradient of stress" for each point i.
11. The amount of movement in direction of the negative
gradient is set by the step length, a, which is about 0.2 initially.
The step size is recalculated after each step such that the step
size gets smaller as reductions in stress become smaller.
12. Iterate (go to step 3) until either:
- a set maximum number of iterations is reached OR
- a criterion of stability is met
NMS Approach
Crawling through the landscape in search of the optimum
Stress Landscape
Changing
positions of
the samples
Axis 1
Axis 2
NMS Approach
The starting configuration can influence the result
Beware of local minima (pits)
Avoid unstable solutions (saddle points)
The starting configuration can be selected in two ways:
Use a random starting configuration
Use coordinates from another ordination method
Recommendation: Use a random start
A high number of random starting configurations often
provides a solution with lower stress
This approach avoids having to decide on what other
method to use lose the great benefits of NMS
NMS Approach
Possible to evaluate whether NMS is extracting stronger
axes than expected by chance
Statistical Significance Based on Randomization Test
(Monte Carlo approach):
p = (1+n) / (1+N)
n = number of randomized runs with final stress
less than or equal to the observed minimum stress
(one tailed test) N = number of randomized runs
Recommendation: Use a large number of runs
This is a time intensive computational method, that will
take a great deal of time (even if runs = 20)
We need to have a large enough number of runs to
calculate the p value with the desired resolution
NMS Approach
Statistical Significance Based on Randomization Test
( p value: p = (1+n) / (1+N) )
(20 runs)
(50 runs)
NMS Approach
Stress Interpretation:
Real Data:
Declines with increasing
dimensions (from 1 to 5)
Randomized Data:
Real data below the
distribution of
randomized data
(for dimensions 1 to 5)
NMS AutopilotMode
The automatic procedure determines
most appropriate dimensionality,
assigns statistical significance with
randomizations, and avoids local
minima (using random iterations)
Advantages: Uses default settings
and decides number of axes for you
NMS AutopilotMode
The autopilot NMS mode
provides three settings
Speed vs Thoroughness
Quick and Dirty
Medium
Slow and Thorough
NMS AutopilotMode
The autopilot NMS mode provides three settings
NMS Results
Examine Results.txt file: Settings / Options
NMS Results
Examine Results.txt file: Settings / Options (all Dimensions)
NMS Results
Examine Results.txt file: Results for best result
Stress
P values
Scores
NMS Results
Examine Results.txt file: Results for best result
Scores
NMS Results
Examine Results.txt file: Plotting Stress vs Iteration
NMS Results
Examine Results.txt file: Interpret Stress (Clarke 1993)
NMS Results
Examine Results.txt file: Run Log
NMS Results
**To run single NMS ordination repeating best result, use this
file as starting configuration, rather than using random start.
Save this file with new name, to avoid overwriting it with next
NMS test. To do this, open file using File | Open | Graph Row
file, then File | Save as | Graph Row file (specify new name).
.
NMS Results
Examine graphs: Species scores
Select Weighted Average Scores
Species as Vectors
Species as Points
NMS Results
Examine graphs: 2D Ordination plots
NMS Results
Correlations with Matrices:
Tau (rank correlation)
DO NOT use r 2 value
Percent of Variance:
Use same distance metric
used for NMS analysis
NMS Results
Coefficient of Determination (% of Variance):
For each axis together
NonmetricMultidimensionalScaling
(NMS)
Objectives:
GooversettingsandresultsofManualAnalysis
Discussconstraintswhendecidingonnumberofaxes
LearningOutcomes:
Understandwhatresultsneedtobereported
NMS SuggestedProcedure
This suggested procedure
determines appropriate the most
dimensionality, assigns statistical
significance with randomizations,
and avoids local minima.
NMS SuggestedProcedure:Step1
First, pick distance measure
Step
Down
Relative Sorensen
Dimensions (max = 6)
Relative Euclidean
NMS SuggestedProcedure
Third, pick the output options
Write final
configuration
Run Log
Plot Stress
vs.. Iteration
Provides scores
Statistics
Dimensionality
Plot distance
vs. dissimilarity
Randomization
Statistical Test
Species Scores
(for plotting)
NMS SuggestedProcedure
1. Preliminary runs: Stress Test determines dimensionality
Use time of day random seed
Graph messages
NMS Results
Examine Results.txt file: Settings / Options
NMS Results
Examine Results.txt file: Results for each run / dimension
Stress
Scores
NMS Results
Differences in Real Space
(2 - D)
Final
Stress:
4.137
Final
Stress:
23.138
Distances in 2-D space
NMS Results
Examine Results.txt file: Plotting Stress vs. Iteration
NMS Results
Examine Results.txt file: Stress
NMS SuggestedProcedure:Step2
Goal: Select the Best Solution:
Plot stress vs.. number of dimensions
How: Just after running NMS
Do this in PC-ORD by selecting
Graph | NMS Scree Plot
If the stress
increases with
additional
dimensions,
the model is
over-fitted
NMS SuggestedProcedure:
PC-ORD uses the following criteria (for reference):
NMS SuggestedProcedure:
Other metrics for selecting number of dimensions:
marginal change in stress
p values
*
*
If stress does
not increase,
computer
considers
marginal
decline with
added Ds
Consider the
p values
NMS SuggestedProcedure:
Check for a better-than-random solution by using the
results of the Randomization test.
Limitations: Helpful but not foolproof.
The most common problems are:
Strong outliers, single super-abundant species,
small data sets (e.g.,<10 SUs), many zeros
Note: The first axis with randomized community data is
often nearly as strong or stronger than the real data,
even when the pattern in the real data is strong. The
randomization creates rows with unequal abundances
some rows can have higher or lower totals the real data.
Thus a 1-D NMS solution from shuffled data tends to
describe variation in row totals. Interpret carefully
NMS SuggestedProcedure:
Goal: Select number of dimensions beyond which
additional dimensions provide only small stress reductions
Suggestion: Follow PC-ORDs recommendation
but check for some safeguards
Note:
No firm fixed criterion for selecting an appropriate
number of dimensions (Kruskal and Wish 1978)
Axis scores depend on the number of axes. The first
dimension on a 2-D and a 3-D result will be different
NMS SuggestedProcedure:
Trade-Offs:
Do not trust results with large stress values (> 20)
Final stress decreases and the proportion of the variance
represented increases with more axes
Pick as few dimensions as possible based on stress
reductions but if in doubt, add an extra dimension
Beware of unstable results (stress wiggles with iterations)
Consult the instability of the final answer
NMS SuggestedProcedure:
Check the following: (a) plot of stress vs.. iteration for
stability of the solution at the selected number of
dimensions; and (b) final instability value for the chosen
solution, as listed in the numerical output from NMS
Look for smooth curves
30
35
25
Stable
Unstable
20
Stress
Stress
30
25
20
15
15
10
10
5
50
100
150
200
Step
50
100
Step
150
200
NMS SuggestedProcedure:
Use Data Exploration to decrease stress of NMS analysis
16
30
12
10
14
8
6
4
2
0
0
10
20
30
40
50
25
20
15
10
5
0
0
20
40
60
80
100
Criterion for species retention (% of SU's)
NMS WhattoReport
Samples / Species Considered
Data Transformations
NMS References
PC-ORD uses the following algorithms:
Mather, P. M. 1976. Computational methods of multivariate
analysis in physical geography. J. Wiley & Sons, London.
532 pp.
Kruskal, J. B. 1964. Multidimensional scaling by optimizing
goodness of fit to a nonnumeric hypothesis. Psychometrical
29:1-27.
For a review of NMS, cite:
Clarke, K.R. 1993. Non-parametric multivariate analyses of
changes in community structure. Australian Journal of
Ecology 18:117-143.
Kneel, N.C., Or loci, L., 1986. Applying metric and nonmetric
multidimensional scaling to ecological studies: some new
results. Ecology 67, 919923.
NMS ExamplesI
Seabird communities of the Indian Ocean
We selected an
observation day as
the sampling unit for
the community-level
analysis because we
regarded the daily
transects as discrete
samples, separated
by night time periods
with no survey effort.
(Hyrenbach et al. 2007)
NMS ExamplesI
Seabird communities of the Indian Ocean
The NMDS selected 3 habitat axes, which accounted for
73.4 % of variance observed in the seabird community
- The first axis (r2 = 0.15) described lat gradients associated with
concurrent SST decrease and CHL increase.
- The second axis (r2 = 0.41) illustrated concurrent lat / long changes in
wind speed, depth, CHL, SST, and gradients in ocean depth and SST.
- The third axis (r2 = 0.17) captured the influence of onshoreoffshore
gradients in CHL, irrespective of lat and long.
NMS ExamplesI
Seabird community structure in the Indian Ocean
Shallow
TRANSECTS
1.0
Axis 3
0.5
2
1
16
0.0
9
4 5 76
10 8
11
3
15 13
12 14
-0.5
North
-1.0
-1.5
Deep
-1.0
South
-0.5
0.0
0.5
Axis 2
1.0
NMS ExamplesII
Seabirds and subsurface predators around Oahu
69 seabird foraging
observations recorded
Presence of subsurface
predators was not
ascertained in 7 schools
In 2 of remaining 62
observations, no
subsurface predators
were present
NMS ExamplesII
Seabirds and subsurface predators around Oahu
The NMDS analysis relied on a similarity matrix created using the
Sorensen (Bray-Curtis) index from the raw seabird counts and 13
explanatory variables describing:
-type of fishing (commercial vs.. sport)
- subsurface predator (skipjack tuna, mahimahi, spotted dolphin,
false killer whale, yellowfin tuna, unknown),
- geographic location around Oahu (Waianae, Penguin Bank,
Kaena Point, other *).
* Only those locations contributing at least 10%
(7 or more) observations considered in analysis.
(Hebshi et al. 2008)
NMS ExamplesII
Seabirds and subsurface predators around Oahu
NMS identified 2 highly (99.3%)
orthogonal axes (r = 0.082),
which explained 67.9% of the
cumulative observed variance
axis1, r2: 0.502
axis2: r2: 0.178
The NMS stress was 17.873,
suggesting that the test
performance was fair
(McCune & Grace 2002)
NMS ExamplesII
Seabirds and subsurface
predators around Oahu
The seabird community was
influenced by the presence of
wedge-tailed shearwaters, brown
noddies, and sooty terns
The first axis captured the
differences between commercial
and sport fishing vessels, while
the second axis captured variability
across geographic locations
This analysis also revealed
significant correlations with the first
axis for 2 subsurface predators:
mahimahi (+) and skipjack tuna (-)
TakeHomeMessages
NMS is a flexible and powerful tool
This inherent flexibility makes this technique difficult
to interpret (how many meaningful axes are there ?)
Yet, NMS allows the integration of different datasets into
multivariate patterns
Data exploration will help you use NMS most efficiently,
by carefully choosing the sample sizes and species /
variables to include in your analyses.
Use NMS to tell ecological stories that balance noise
against statistical significance
PCAExaminationKey
This exam is worth 10 points (two homeworks).
Just like in the homeworks, make sure you explain what
you are doing and how you are getting the answers. This
way, I can give you partial credit for incomplete answers.
In particular, explicitly state what PC-ORD command you
used to obtain the various figures / results.
You will turn in a ppt file with your images and text
inserted into the body of the presentation. To copy text
from PC-ORD screen, use CONTROL + Print Screen
When answering the questions, back up your responses
with figures / tables / numbers. An image / table is worth
1000 words!!!
Dataset
Data file: PCA1M.wk1 (main matrix)
DataExploration
Use scatterplot matrix to make a
plot of all possible pair-wise
combinations of the 5
environmental variables
DataExploration Correlograms
Time Trends ?
Regional Indices
(PDO / MEI)
Local Indices
(up36 / up39)
DataExploration Advisor
Rows Skewed
Columns Not Skewed
Outliers: Samples
Look our for these
in the plot results
DataExploration VariableYear
DataExploration Skewness
0.22 = skewness
0.94 = skewness
0.60 = skewness
0.24 = skewness
StatisticalResults WithYear
StatisticalResults WithYear
Important Axes:
Eigenvalue: 1,2,3,4 Broken-stick: 1 P-values: 1,4
Interpretation:
Opposite
Time
MEI / PDO
Together
DataTransformation TimesinceStart
Transformation:
Subtract 1970 (first year sample)
Recode as Time Since Start
Similar skewness
No more outliers
DataExploration Skewness
StatisticalResults WithTime
StatisticalResults WithTime
Important Axes:
Eigenvalue: 1,2,3,4
Interpretation:
Opposite
Time
MEI / PDO
Together
DataTransformation RemoveTime
Remove Column:
No Time
Less Skewness
(for rows)
Still No Outliers
StatisticalResults RemoveTime
StatisticalResults WithoutTime
Important Axes:
Eigenvalue: 1,2,3
Interpretation:
Opposite
MEI / PDO
Together
DataExploration WithTime
Independent
(orthogonal) variables
DataExploration Time
DataExploration Time
DataExploration WithoutTime
Independent
(orthogonal) variables
DataExploration WithTime
Axis 1:
Su02
Big:
More Upwelling
Small:
Less Upwelling
Axis 3:
SU98 WI98
Su99
Small: Warm
Big: Cool
SU97
WI97
DataExploration WithoutTime
PDO axis 1
MEI axis 1
DataExploration WithoutTime
Upwelling 39 axis 1
Upwelling 36 axis 1
Conclusions
Number of eigenvalues = Number of variables
Eigenvalues loadings did not change
- even after transforming YEAR data
Broken-stick results did not vary: YEAR / TIME
Randomization results did vary: YEAR / TIME
PolarOrdination/MRPP
Objectives:
Discussgeneralapproachesofthesetwomethods
Gooversettingsandresultsforthesetwomethods
LearningOutcomes:
Understandhowtoperformtheseanalyses
Befamiliarwithwhatresultsneedtobereported
PolarOrdination Applications
Bray-Curtis Ordination
(Polar Ordination) arranges samples
with respect to poles (also termed
end points or reference points)
according to a distance matrix
These endpoints are two samples
with the highest ecological distance
between them (objective approach),
OR two samples suspected of being
at opposite ends of an important
gradient (subjective approach)
Recommendation This procedure is especially useful for
investigating ecological change (e.g., succession, recovery).
PolarOrdination Pros/Cons
Advantages:
Ideal for evaluating problems with discrete endpoints:
conceptually (arctic sample / tropical sample) or
practically (before disturbance / climax community)
Polar Ordination ideal for testing specific hypotheses
(e.g., reference condition or experimental design) by
subjectively selecting the end points
Disadvantages:
PolarOrdination HowitWorks
Setting Up:
Select a distance measure (usually Sorensen Index) and
calculate matrix of distances (D) between all pairs of points
Calculate sum of squares of the
distances for calculation of the
variance represented by each axis
Select two points, A and B, as reference points for axis 1
Define End Points Subjectively OR Use Objective Method
3 Objective Methods: Recommend Variance-Regression
- find point with largest variance in pairwise distances
- select point which minimizes regression of distances
PolarOrdination HowitWorks
Selecting End Points:
Variance-Regression:
(Beals 1984)
Selects points at edges of main
cloud of points (Recommended)
PolarOrdination HowitWorks
Once you have the first axis (g) linking the two points:
Calculate position (xgi) of each point i on the axis g. Point i
is projected onto axis k between reference points A and B
PolarOrdination HowitWorks
Calculate variance represented by axis k as a percentage
of the original variance (V k %). The residual sum of
squares has same form as original sum of squares and
represents amount of variation from original distance matrix
PO SuggestedProcedure:Step1
First, pick distance measure
Distance: Sorensen
Second, select End Points
Lets try Subjective
Third, Geometry / Residuals
Recommend City-Block
NOTE: # Axes only changes reported results not solution.
Always try more than 1. Set List Residual Matrix = 0
PO SuggestedProcedure
Next, pick number of subjective axes
Note: Possible that
objective axes capture
more variation than the
subjective axis selected
PO Results
Examine Results.txt file: Settings / Options
PO Results
Examine scores on axis 1: Results.txt file
PO WhattoReport
Distance Metric Used (Use Sorensen)
Method for selecting End Points What are they
Use subjective for axis 1
Select City block Distance / Residuals similar to NMS
Use Variance-regression method for additional axes
Number of dimensions considered
Always use more than 1
MRPP Applications
Multi-response Permutation Procedure
(MRPP) is a non-parametric approach for
testing the hypothesis of no differences
between two or more groups of entities
(species, variables): MRBP, ANOSIM, Qb
These pre-existing groups can be
defined using groups of samples on the
basis of categorical data:
The presence absence of given species
Categories of environmental variables
(e.g., early vs. late)
Recommendation This procedure yields a p value and
interpretation requires further exploration: indicator species
MRPP Pros/Cons
Advantages:
Ideal for evaluating specific hypotheses
differences between groups of samples
Disadvantages:
MRPP HowitWorks
Setting Up:
Include a Grouping Variable in the Main / Second matrix
Select a distance measure (usually Sorensen Index) and
calculate matrix of distances (D) between all pairs of points
within each of the pre-defined groups we are testing
Group 1
Group 2
MRPP HowitWorks
Calculate distance matrix, D
Calculate average distance xi within each group i
Calculate delta (weighted mean within-group distance)
MRPP HowitWorks
Permutations:
M = N!/(n1! * n2!)
SU
Groups
SU
1
1
2
2
3
etc.
MRPP HowitWorks
Calculating the p value:
Determine probability of a as small or smaller
MRPP HowitWorks
Output:
Test Statistic T: measures effect size
A: within-group agreement
BEWARE:
DO NOT
over-interpret
T and A:
Ongoing
Discussion
MRPP SuggestedProcedure:Step1
First, pick
distance measure
Distance: Sorensen
Second, select
Weights of Groups
Recommend:
n / sum (n)
Third, use Ranks
MRPP Results
Examine
MRPP Results
Examine Results.txt file: T & A Statistics
Smaller
observed delta
A>0
(more similar
within groups)
Significant result: p < 0.05
Fairly small Output: NO bi-plots, NO variance explained
MRPP WhattoReport
Distance Metric Used (Use Sorensen)
How groups were defined Relate back to Hypothesis
Chance corrected within-group agreement (A)
Associated p value
PO/MRPP References
Polar Ordination:
Bray, J. R. and J. T. Curtis. 1957. An ordination of upland
forest communities of southern Wisconsin. Ecological
Monographs 27: 325-349.
Beals, E. W. 1984. Bray-Curtis ordination: an effective
strategy for analysis of multivariate ecological data. Advances
in Ecological Research 14: 1-55
MRPP:
Mielke, P. W., Jr. 1991. The application of multivariate
permutation methods based on distance functions in the earth
sciences. Earth-Science Reviews 31:55-71.
Zimmerman, G. M., H. Goetz, and P. W. Mielke, Jr. 1985. Use
of an improved statistical method for group comparisons to
study effects of prairie fire. Ecology 66: 606-611.
ForthePeer Review
Look for a gradient (one axis):
Polar Ordination
Compare groups:
MRPP
Suggestions:
- If you have a categorical value (in canyon / outside): MRPP
- If you have continuously changing samples (across latitude or
depth): you can test for N / S OR shallow /deep gradients
- If you have the diet or habitat multiple species, you can use
them as groups