Pcord - Exemple Envirenemment Analysis

MultiVariate Analysis
Objectives:
IntroducethetheoryofMultiVariate Analysis
Discussthenatureofcommunitycompositiondata
LearningOutcomes:
UnderstandtheprinciplesunderlyingMultiVariate Analysis
BecomefamiliarwithmatrixdataandtheiruseinPCORD
Multivariate DataAnalysis Background

The need for multi-variate analysis arises whenever more
than one characteristic is measured on a number of individuals,
and relationships among the characteristics make it necessary
for them to be studied simultaneously (Krzanowski 1972)
Community data are multi-variate because each sample
unit is characterized by:
the abundance (or presence / absence) of a number of
intercorrelated species
a set of (cross-correlated) environmental factors affecting
species distributions
Multivariate DataAnalysis
Why are species data cross-correlated ?
If there are no specific niches:
Species are still using a finite
amount of resources
If there are specific niches:

Species are affecting each
others distributions / abundances
CommunityAnalysis Introduction
Community ecologists analyze the effects of multiple

environmental factors (variables) on large numbers of
co-occurring species (simultaneously), and deal with
substantial statistical errors (measurement / structural).
Community analysis techniques fall into two groups:
classification and ordination
CommunityAnalysis Introduction
classification versus ordination
Classification is the placement of species and / or
sample units into (discrete) groups
Ordination is the arrangement or ordering of species
and / or sample units along gradients
WhatisanEcologicalCommunity?
Two views have
dominated the debate
over the nature of
ecological communities
since the 1920's:
Clements
Clements
discrete unit
Gleason
Gleasons
loose assemblage
of species
WhatisanEcologicalCommunity?
Clements Perspective:
Discrete entities with recognizable boundaries
The community fully integrated functionally
Species have coevolved, enhancing their interdependence
Gleason's Perspective:
Community is a chance association of species with
similar adaptations and ecological requirements
No distinct boundaries where communities meets
EcologicalCommunities ScaleMatters
An ecozone or
biogeographic
realm is the largest
scale biogeographic
division of the
earth's surface
based on historic
and evolutionary
distribution patterns
of organisms
(Spalding et al. 2007)
EcologicalCommunities ScaleMatters
Ecozones consist of clusters of adjacent

ecoregions that span several habitat types,
but have a strong biogeographic affinity
(Spalding et al. 2007)
Small areas where

physical / biological
properties change
WhatdoCommunityDataLookLike?
Number of species
Log-normal Distribution
Log-normal Distribution
Abundance of species
Transformed (log scale)
PropertiesofCommunityData
Most species infrequent:

Majority of species present
in a minority of locations,
and contributes little to the
overall abundance
Some species abundant:
Few species dominate
some are very numerous
Number of species
Data tend to be sparse:

large portion of entries are
zeros (species absent)
Abundance of species
The number of important factors is small:
A few factors can explain the majority of the
explainable variation (variance)
Especially we if create synthetic variables
combinations of multiple collinear variables
There is substantial noise in the data:
Even under ideal circumstances, replicate
samples vary from each other due to stochastic
events - and potentially though observer error
We can use this overlap (or smear) to quantify
the degree of sample similarity along gradients
There is much redundant information:
species often share similar distributions
replicate samples vary from each other
This redundancy is at the core of multi-variate statistics
DATA
Environmental
Variables
Species
Sample Units
Patterns
Associations Between:
Species * Environmental
Variables * Sample Units
FormatofCommunityData
Multi-variate methods operate on a community data matrix
(or species by sample matrix).
A community data
matrix has taxa
(species) as rows and
samples as columns:
PC ORD terms this a
NORMAL matrix:
Rows: Sample Units
Columns: Species
Plant Community Composition
Spp1
Sample1 10
Sample2 0.50
Sample3 1
Normal
Matrix
Spp2
2
0.40
0
Spp3
1
0.10
1
Samples refer to the basic unit of observation

The units are quadrats, transects, grid cells
Environmental
Matrix
Char1
Sample1 Oahu
Sample2 0.53
Sample3 25
Char2
Kauai
0.47
37
Char3
Oahu
0.12
56
In linear algebra, the transpose of a matrix A is another
matrix AT (also written Atr, tA, or A) created by:
writing the rows of A as the columns of AT
writing the columns of A as the rows of AT
PCORD InstalltheSoftware
Open Help and Read:
Getting Started
Using PC-ORD
Introduction
UsingPCORDforMultivariate DataAnalysis
PC ORD was written
for community ecologists
and includes procedures
best suited for species data
This class will help you in three areas:
1. to recognize which techniques are appropriate for
specific data types and analysis objectives
2. to become familiar with how to use PC ORD within
the context of a proper analysis process
3. to understand what PC ORD does and when you
might need specific features
UsingPCORDforMultivariate DataAnalysis
In this class we will cover the following issues / techniques:
General use of multivariate analysis
using PC ORD
summarizing data / results
Data analysis flow
data transformations
selecting a distance measure
Ordination techniques in PC-ORD:
PCA, NMS, polar ordination
Grouping techniques in PC-ORD:
MRPP, indicator species analysis, Mantel tests
PCORD ExampleofDatasets
PC ORD works with multiple datasets arranged in matrices
MAIN
Secondary
MATRIX
Matrix
PCORD ExampleofDatasets
PC ORD works with multiple datasets arranged in matrices
PCORD FileTypes
Data Matrices: Main / Secondary
Main = Data of Interest Secondary = Associated Data
PCORD FileTypes
Data Matrices: Main / Secondary
(*.wk1 spreadsheet files) Created with Excel
Can be exported
from Excel as:
worksheet format
WK (1-2-3)
MainMatrixFormat
There are certain conventions:
List number of sample
units and species
Name of sample units
(list them on column 1)
Describe variable
types and names
(list them on rows)
Add data on columns
(species abundance)
SecondMatrixFormat
There are certain conventions:
List number of sample
units and variables
Name of sample units
(list them on column 1)
Describe variable
types and names
(list them on rows)
Add data on columns
(environmental data)
Main/SecondaryMatrixFormat
1. Cells A1 and B1 indicate the number of entities (rows) and
what the entities are (8-character maximum).
2. Cells A2 and B2 indicate the number of attributes (columns)
and what the attributes are (8-character maximum).
3. Row 3 contains single letters which indicate the variable
type for each column. There are three acceptable values:
Q = Quantitative C = Categorical M = Mixed
4. Row 4 contains the name for each column variable. ONLY
THE FIRST 8 characters are used.
5. Rows 5 and below contain a row name in the first cell,
followed by numeric values for each column attribute. Row
names, like the column names, are limited to 8 characters.
ImportOptions
Several formats allowed from Excel
NOTE:
The same
formatting
rules hold
for all
import file
types
PCORD OtherFileTypes
Graph Row / Graph Column = Ordination scores for rows
and columns (*.gph text files) created by PD-ORD
Result = Procedure output (*.txt text file) created by PC-ORD
Dendograms / Plots = (*.den) (*.str) created by PC-ORD
Project = After saving a main, second, graph and result file,
they can be grouped together (*.prj file) specific to PC-ORD
SampleDataset:BryophytematsfromOregon
Datasets posted on the course web-page:
aMoss1M = normal matrix of average relative % volume
of epiphytic bryophytes
aMoss2M = environmental matrix with measurements from
the same sample units where bryophytes sampled
Objectives:
Investigate association of different species in space
Study environmental correlates of species distributions
StartExploringPDORD
Objectives:
Discussspecies environmentalassociations
Explorethenatureofcommunitycompositiondata
LearningOutcomes:
Understandthevalueofbivariateplotsandspecies/env.plots
Becomefamiliarwithgraphingoptionsandformating inPCORD
SpeciesonEnvironmentalGradients Ideal
Robert H. Whittaker (Ed), Classification of Plant Communities,
1978 (Handbook of Vegetation Science), Kluwer
Academic Publishers
Ideal species distributions across environmental gradients:
Gaussian Response: Smooth normal curves
Characterized by: mean + SD, peak
Linear Response:
Smooth lines
Characterized by range, slope
Gaussian(Normal)Distribution
Figure 5.1. Hypothetical species abundance in response

to an environmental gradient. Lettered curves represent
different species. Figure adapted from Whittaker (1954).
LinearDistribution
Figure 5.2. Hypothetical linear responses of species

abundance to an environmental gradient. Lettered lines
represent different species.
Again:AMatterofScale
Study 1:
(46 km depth):
Species prefers
shallow habitat
Gaussian
Study 3:
(3-4 km depth):
Species shows
no preference
Linear
Study 2:
(0 - 3 km depth):
Species prefers
depth habitat
Some times the answer

is scale-dependent
SpeciesonEnvironmentalGradients Real
3 important issues to consider:
Zero-truncation Problem:
Solid Curves:
Complex Curves:
Observation: Species are

often below their optimal
abundance, given the
environmental factor: Why?
Other limiting factors
(e.g., Other environmental factors,
other species, life-history, chance)
Points
represent
species
abundance
SpeciesonEnvironmentalGradients Real
Linear Regression
r = 0.30
r = 0.34
Abrupt Ranges
(Boundaries)
Multiple Modes
Fitted Envelope
r = 0.21
r = 0.44
Linear Responses
Peaks (Optima)
CommunityAnalysis BivariatePlots
More fruitful to explore how pairs of species
abundances are related with bivariate plots.
Joint Absences
(0,0)
Species B
Joint Occurrences
(lots, lots)
Perfect correlation
No correlation
Species A
Bivariate plots from pairs of species responses
to the same environmental gradients
positively associated
negatively associated
Both
have joint
absence
(0, 0) data
Jointoccurrence
versus
single
occurrence
positively associated
negatively associated
CommunityAnalysis Correlations
Normal
Clouds
r should be positive
Dust
Bunnies
r should be negative
CommunityAnalysis Correlations
Consider two species with

different habitat responses
(a negative association)
Beware:
As we sample beyond their
habitats, we record more
and more joint absences
Summary
Plots of species responses (abundance) to single
environmental variables are informative:
unimodal multimodal
linear / normal
peak (optimum)
Bivariate plots (sp1 vs sp2) are more informative, since
they allow us to integrate all the possible responses
of the species to the environmental variables
Typical responses: normal cloud and dust bunnies
PCORD GraphingTools
Two variables plotted at once

(from either matrix)
Jittering moves points randomly
to allow users to see overlap
Plots pair-wise combinations
(up to ten variables) at once
(from either matrix)
PCORD ScatterplotFigures
Examples of Scatterplots:
Single (2 species)
Multiple (Matrix)
PCORD DominanceCurves
Dominance curves: To study distribution
of abundance among species in sample.
How are individuals distributed across species?
Program creates result.txt file (with all species)
Species - name
RankAbun - ranking abundance (1: most numerous)
Log(SumAbund) - log (base 10) of total sum
Sum - sum of all abundance data for species
RankFreq - ranking frequency (1: most frequent)
Freq
- frequency (number of non-zero counts)
Mean - average
S.Dev. - SD
CV% - 100 * (SD / Mean)
V/M - Variance / Mean ratio
PCORD DominanceCurves
PCORD Distributions
Distributions:
How are your data distributed across

environmental conditions ?
Depends on whether a variable is continuous or discrete.

Discrete variables are easily summarized with a frequency
distribution counting the number of occurrences of each
discrete value in the variable.
Continuous variables are summarized using smooth density
distributions - the frequency of observations along a scale.
PC ORD uses several methods to represent distributions,
along with some classic distributions for comparison
(normal, lognormal, poisson, binomial, negative binomial).
The program estimates moments from the observed data,
and uses them as parameters in theoretical distributions.
PCORD Distributions
The Observed Data
Reference Data
The user can change the

number / size of bins
Other curves included to
show reference distributions
PCORD Distributions
The Main Matrix Distributions and Second Matrix
Distributions ask the following:
Variable (Choose one variable from the matrix)
Distribution type (Discrete or Continuous)
Curve steps (number of increments along the axis )
Constant for lognormal distribution (leave default)
Note: Kernels are Smoothing Parameters

(Read Instructions in PC ORD help)
YourTask
Start Exploring PC ORD using bryophyte matrices (1, 2)
Load data (import and open)
Format and export figures
Make scatterplots (species pairs & species env data)
Find good example of a dust bunny distribution
Find good example of normal cloud distribution
Try other plots: scatterplot matrix, dominance, distribution,
Create dominance curve for matrix 1
Pick one species and make distribution plot
QuantifyingDistances
Objectives:
DiscussSimilarityandDissimilarity
GoovertherulesforselectingaDistanceMeasure
LearningOutcomes:
UnderstandthevalueofDistancemetrics
BecomefamiliarwiththeapproachforcalculatingDistancemetrics
DistanceMeasures
First step of most multi-variate analysis involves creating a
matrix of distances or similarities for all pairs of samples
This step is extremely important:
If information is ignored it will not be included in results
If noise / outliers exaggerated add distorting influences
There are a myriad of indices and lots of details.
Lets begin with the general principles:
Distance = Difference
DistanceConcepts
Resemblance can be quantified in two ways:
dissimilarity or similarity
These two metrics can be translated:
Similarity = 1 Dissimilarity
Distance metrics can be applied to a variety of data:
Quantitative (discrete / continuous)
Binary (Presence / Absence)
Distances calculated among two types of objects:
either the rows or the columns of the primary data matrix
Sample unit distances -> sample space
Species distances -> species space
SpeciesSpace vs SampleSpace
Samples
Species
Sample Space: Compare species across pairs of samples

Species Space: Compare samples across pairs of species
SpeciesSpace vs SampleSpace
Species abundance shown

as points in sample space
Sp 1
4
3
Sp 2
0
2
3
4
Sample Unit A
SU B
2
1
SU A
1
0
Species space
Sample space
Species 2
Sample Unit B
Sample unit composition

as points in species space
2
3
Species 1
TypesofDistanceMetrics
Three types: Metric, Semimetric, Non-Metric
A good metric needs to meet four rules:
Minimum distance value is 0 (e.g., for identical samples)
When two items differ, distance > 0
Distances are symmetrical:
from A to B = from B to A
Triangle Axiom: Whenever we have 3 objects
Any one pair-wise distance
(between any two objects)
CANNOT be larger than the
sum of the other two distances
TypesofDistanceMetrics
Rules:
Metrics Semimetrics Nonmetrics
Minimum Distance = 0
Distance > 0
Symmetrical Distances
Triangle Axiom
Both metrics and semimetrics used in ecology

Watch out: if index violates any other rule Nonmetric
Do NOT use Nonmetrics
Examples Distanceofspecies
Because each object is represented by 4 variables,

we say that these objects has 4 dimensions.
The coordinate of Apple is (1,1,1,1)
The coordinate of Banana is (0,1,0,0).
Jaccard's coefficient between Apple and Banana is 1/4
Shared: 0 + 1 + 0 + 0
Total: 4 variables (union of banana and apple)
Jaccard's distance between Apple and Banana is 3/4
Examples Distanceofsamples
A = {2,3,4,6,7}
We have 2 sets:
7
4
6
2 3
Union between the sets is:

1 4 6 7
3 5
B = {1,4,5,7,8}
8
7
4
Intersection between sets is:

4 7
Jaccard's coefficient can be computed as the number of

elements in the intersection set divided by the number of
elements in the union set: 2 / 8 = 0.25
Jaccard's distance: 1 Jaccards coefficient = 0.75
SelectingaDistanceMetric
Two things to consider:
Input: Acceptable domain of input data
Presence / Absence vs Count Data
Positive / Negative values
Output: Range of output distances
Within bounded range
Meet triangle axiom
SelectingaMetric ProportionalOverlap
Jaccard Index BinaryDataExample

Two species (A, B). For each sample, values are 0 or 1.
Total number of each combination, as follows:
M11 = number of attributes where A = B = 1. (Joint Presence)
M01 = number of attributes where A = 0, B = 1. (B present)
M10 = number of attributes where A =1, B = 0. (A present)
M00 = number of attributes where A = B = 0. (Joint Absence)
Each sample can have one of four values:
M11 + M01 + M10 + M00 = n (sample size)
Jaccard similarity coefficient =
Jaccard distance =
.
The Jaccard index measures similarity
between samples, and is defined as the size
of the intersection divided by the size of the
union of the sample sets (Jaccard 1901)
The Jaccard distance, which measures
dissimilarity between samples, is obtained
by dividing the difference of the sizes of the
union and the intersection of two sets by the
size of the union
1-
Jaccard Distance is complementary to the Jaccard coefficient

and is obtained by subtracting Jaccard index from 1
JS = w / (A + B -w)
JD = (1 w) / (A + B -w)
Properties:
Proportion of combined
abundance not shared
Works with binary data or quantitative data (counts)

Output is metric (does meet triangle axiom)
Sorensen index (Sorensen 1948), also known as Sorensens
similarity coefficient (QS), is defined as:
Where:
A and B are the number of species in sample A and B,
respectively, and C is the number of shared species.
Sorensen Distance is complementary to the QS metric, and is

obtained by subtracting the QS index from 1
SS (Bray-Curtis)
= 2w / (A + B)
SD = (1 - 2w) / (A+B)
Properties:
Proportion of shared abundance

(divided by total abundance)
Works with binary data or quantitative data (counts)

Output is semimetric (does not meet triangle axiom)
SelectingaMetric ContinuousDistance
Euclidean Distance:
Species j
Sample units i and h
Widely Used:
A = Abundance
For both species abundances

and environmental conditions
Where p is the number of

dimensions (axes used)
Summary DistanceMetrics
Many Distance indices available How to select?
Consider 4 rules: metric / semimetric
Look for proportionality (scaled from 1 to 0)
Think about what makes intuitive sense to you
Check what indices are compatible with given tests
The choice of a distance metric is based on empirical
evidence (e.g., methodological studies, previous literature)
Recommendations: (According to PC ORD)
Sorensen index shown to be effective for assessing
species and sample similarity (community data)
Euclidean distance well suited for environmental data
Summary Recommendations
Sorensen:
Quantifies proportion shared abundance among species
Works well for community data (empirically)
Relative Sorensen:
Includes general relativization (by totals)
Summary Recommendations
Euclidean:
Sensitive to outliers
Bad with community data (lots of 0s)
Relativized Euclidean:
Euclidean after scaling abundances to %s
Focuses on relative abundance among species
Summary DistanceMetrics
Additional considerations dataset specific:
Are the data very noisy?
Is there a lot of variation in the data?
Are there many 0s in the data?
Are the environmental responses not normal?
PCORD AdvisorTools
Advisor menu provides two tools to help you decide what

transformations or analysis might be appropriate your data
Use Show Current Profile to generate summary statistics
describing some of the important properties of your data
The Wizard is a decision tree to help you decide what data
adjustments or analyses to use
PCORD CurrentProfile
PCORD ProfileWarnings
Notes are written at the end of the profile, if certain conditions
are encountered. These are listed below.
1. If fields are filled with asterisks, then they could not be
calculated -- incompatible with the data.
2. Negative numbers in main matrix are incompatible with CV
3. Negative numbers in second matrix are incompatible with CV
4. Warning: one or more CV (coefficient of variation) could not
be calculated; replaced with missing value indicator: 99999.99
5. Negative numbers are present in main matrix, so Sorensen
distance could not be used for outlier analysis or calculation of
average half changes.
6. Negative numbers are present in 2nd matrix, so Sorensen
distance could not be used for outlier analysis or calculation of
average half changes.
OrdinationMethods PCA
Objectives:
DiscussPCAwithincontextofOrdinationMethods
GoovertheoutputofOrdinationMethods
LearningOutcomes:
BecomefamiliarwiththeapproachfordoingaPCA
Ordination
Arranging items (samples / species) along one or more axes
Graphical summarization of complex relationships
Extracting one or more dominant patterns % variance
Synthesis (reduction) of large datasets into fewer variables
These variables are then related to environmental variables
Components are
independent
from each other
OrdinationDiagrams
Typically, a 2-dimensional plot of samples / species in terms
of synthetic axes (combinations of variables)
Ideally, the distance between points in ordination space is
proportional to the underlying distance measures
NOT LIKE A REGRESSION
F e s tu c a id aSU17
h o e n s is
S TA ND5
Topo
Class
STAND5
(Axes uncorrelated, by definition)
S TA ND3
STAND3
STAND2
S TA ND2
draw
flat
slope
ridge
S TA N1 0
S TA ND7
STAND7
S TA ND1 S TA ND6
STAND1 STAND6
SU19
Axis 2
Axis 2
If possible, use points for

samples, overlays for species
Axis 2
Plot samples / species
SU18
SU16STAN10
SU4
SU10
SU5STAN13
STAND9
SU7
S TA N1 3
S TA ND9
STAN11
STAN14
SU3
SU15
S TA N1 1
STAN18
SU1
S TA N1 4
SU6
SU2
S TA N1 8
STAN19
SU14
S TA N1 9
SU11
S TA N1 6
STAN16
STAN17
STAN15
Also can code samples by

habitat types (using a key
environmental variable)
SU12 SU8
SU9
S TA N1 5
S TA N1 7
SU13
S TA N1 2
STAN12
STAND8
Axis 1
S TA ND8
STAND4
Axis 1 A xis 1
S TA ND4
OrdinationResults
How many axes? How many discrete signals in dataset
Different rules for selecting number of axes
Significance assigned to axes and contributing variables
Yet, most studies select 2 or 3 axes
Interpretation of results: overlays, correlations with axes
OrdinationResults
How many axes? Number of discrete signals in dataset
Coefficient of
determination:
% variance
represented
Rules of thumb: relative

(% variance vs NMDS axes)
vs
absolute
(PCA eigenvalues)
Yet, most studies select 2 (or 3 axes): Intuitive explanation

Strength assigned to axes and contributing variables (r , r 2)
OrdinationResults
Interpretation of results: overlays, correlations with axes
Conclusions:
Species ALSA
negatively
correlated with
Axis 1. Variance
explained= 25%
r = + 0.031
tau = + 0.045
r = - 0.534
tau = - 0.327
Interpretation of time change: successional vectors
OrdinationResults
Beware when interpreting correlation coefficients:
outliers can have strong influence
coefficients meaningless if relationships not linear
correlations coefficients invalid with binary data
2.0
1.5
1.0
Axis 2
Graphical representation of
environmental correlates using
Joint Plots: the angles / lengths
indicate strength and direction
of environmental variable
association with ordination axes
Bryo%
Age
0.5
Last burn
0.0
Canopy
-0.5
Radiation
Topo position
UnderRad
-1.0
-1.5
-1.5
-1.0
-0.5
0.0
0.5
Axis 1
1.0
1.5
2.0
OrdinationResults
External Evaluation: Correlations with second matrix
(e.g., how are environmental variables related to the axes?)
Beware of biases: proving the expected results
Comparisons with null model: Comparing results of
real data with those from randomized data is promising, but
can yield no clear results if strong outliers cause spurious
significant patterns with the random dataset. Beware
PrincipalComponents(PCA)
Using the best-fit straight line to describe a system of points
in multiple dimensions using straight lines (Pearson 1901)
Y = a 0 + a1 X1 + a2 X2 +
Start with cloud of n points
in p-dimensional space
Center the axes in the
point cloud (centroid)
Rotate axes to maximize
the variance along axes
As rotation angle changes,
the variance changes
PrincipalComponents(PCA) WhentoUse
Normality: Ideal for normal data with approximately linear
relationships amongst variables Rarely for community data
Beware of heterogeneous community data
Critical to justify the use of this linear approach
Sample size: Need a good estimate of correlation structure
Stronger patterns require smaller sample sizes
Rule of thumb: 5 sample units per variable
(Tabachnick and Fidell 1989)
Increasing number of variables, strengthens results
(Pillar 1999)
PrincipalComponents(PCA) Normality
Assessed with skewness (asymmetry) / kurtosis (peakiness)
skew= 0
(normal data)
skew > 0 (right tail too long) skew < 0 (left tail too long)
kurtosis = 0
(normal data)
kurtosis > 0 (more peaky)
kurtosis < 0 (less peaky)
Rule of thumb (McCune & Grace 2002):
-1 < Skew < 1
PrincipalComponents(PCA) Linearity
Use bivariate scatterplots to assess linear relationships
Solutions:
Data transformations
Beware of outliers they can change cross-correlations

r = 0.92
Solutions:
Transform the data
r = -0.96
Remove those data
PrincipalComponents(PCA) Reporting
What type of cross-correlation matrix you used?
Correlation or Covariance - Use euclidean distance
If used with community data, justify using this linear

model for species data?
Were assumptions of linearity / normality met?
How many axes were interpreted, and what proportion

of variance was explained by these axes?
Describe the axes and the individual / cumulative variance
Principal eigenvectors - Test of significance?

Not necessary, but an option using randomization tests
Rotation of the solution? Use of interpretation aids?

Explain overlays and correlations of variables with axes
PrincipalComponents(PCA) Example
Setup:
PCA uses only Euclidean
Distances (real metric)
Matrix can be calculated
in three ways:
Correlation: very susceptible
to outliers (DO NOT USE)
Variance / Covariance: Less
sensitive to outliers (USE)
Non-centered: Experimental
(DO NOT USE)
Setup II:
Scores for species can be
calculated in two ways:
Distance-based: Relates
species- samples to each
axis represents species as
vectors from centroid
(Standard) USE
Weighted Average: Species
represented as points outliers
(DO NOT USE)
Setup III:
Output Options:
Cross-Product Matrix:
Shows pair-wise distances
(USE)
Randomization tests: Use
bootstrap to assess
significance of the results
(USE)
Variance
Distance-based
Cross-product
Matrix
Randomization
Setting up the Randomization Test:
Seed: make
multiple tests
comparable by
using the same
sequence of
random numbers
(supply seed)
Runs: number
of permutations
used in the test
(determines p
value of statistic)
Enter descriptive explanation to document the

analysis this label will be added to results
Results: Covariance matrix species distances
Results: Eigenvalues - Variance explained (up to 10 axes)
Eigenvalues are
proportional to
variance explained
Broken-stick
eigenvalues are
produced by chance
Results: Never explain 100% of variance (axes = variables)
Observed
Variance Explained
Expected
Results: Randomization tests
PrincipalComponents(PCA) FinalResult
Correlation with Axes
Results: Graphs (for Eigenvalues > 100)
Results: Graphs
Samples: points
Species: vectors
Results: Graphs
Samples: points
Species: vectors
Results: Species Response Graphs
Loadings STLI
Loadings NEDO
Axis 1: +0.51
Axis 2: -0.74
Axis 1: 0.00 Axis 2: 0.00
Variance
Weight-average
Cross-product
Matrix
Randomization
Setting up the Randomization Test:
Seed: make
multiple tests
comparable by
using the same
sequence of
random numbers
(supply seed)
Runs: number
of permutations
used in the test
(determines p
value of statistic)
Enter descriptive explanation to document the

analysis this label will be added to results
Same Result - GOOD
Results: Eigenvalues
Same Result - GOOD
Results: Graphs
Samples Labeled
Species Labeled
Display Recommendation:
PC ORD recommends displaying
species as vectors / samples as points
Rotation:
Useful to show patterns you are interested in.
Need to keep track and report in results
Rotation by NEDO
We stretch the graph
along direction of most
variation of the species
Loadings of
NEDO
Axis 1: +0.51
Axis 2: -0.74
PCA Application&Examples
Objectives:
ShowcasePCAanalysis inPCORDandtheliterature
LearningOutcomes:
Becomefamiliarwiththeoutput/resultsofPCA
Results: Eigenvalues - Variance explained (up to 10 axes)
Eigenvalues are
proportional to
variance explained
Broken-stick
eigenvalues are
produced by chance
Results: Never explain 100% of variance (axes = variables)
Observed
Variance Explained
Expected
Results: Species Loadings onto the PC Axes
Use
the scaled eigenvectors
Results: Randomization tests
PrincipalComponents(PCA) FinalResult
Results: Correlation with Axes
Results: Graphs
Samples: points
Species: vectors
Results: Graphs
PC ORD recommends
displaying
species as vectors /
samples as points
Samples: points
Species: vectors
Results: Graphs
Samples Labeled
Species Labeled
Results: Species Response Graphs
Loadings STLI
Loadings NEDO
Axis 1: +0.51
Axis 2: -0.74
Axis 1: 0.00 Axis 2: 0.00
Rotation: Highlights certain patterns. Report in results
NEDO Axes
Correlations
Axis 1: +0.51
Axis 2: -0.74
Rotation by NEDO
Stretch plot along
direction of most
variation for species
PrincipalComponents(PCA) PaperI
Published Example: Ainley, D.G. et al. (2005).
Objective: Relate densities of the 12 most abundant
species of seabirds to 12 habitat variables:
5 biological, 4 oceanographic, 3 geographic (spatial)
PrincipalComponents(PCA) Paper I
Oceanographic variables examined:
sea-surface temperature / salinity, thermocline depth / strength
Date Distance to Fronts
Chl
Max
Acoustic
Biomass
Data Manipulations To Avoid Biases:
Densities log-transformed to meet normality assumptions
Nevertheless, residuals generated in the regressions for
some species did not meet those assumptions (Skewness /
Kurtosis Test for Normality of Residuals, P < 0.05)
Least-squares regression analysis (ANOVA), however,
is a very robust procedure with respect to non-normality
(Seber, 1977, Kleinbaum et al., 1988)
Yet, while these analyses yield the best linear unbiased
estimator in the absence of normally distributed residuals, Pvalues near 0.05 must be viewed with caution (Seber, 1977)
To avoid double-absences:
Only 15-min transects in which any given species was
recorded were analyzed
The total sample size for the 12 species was 1209
Is this an adequate sample size ?
Rule of thumb:
5 samples per variable (Tabachnick and Fidell 1989)
1209 / 12 ~ 100 samples per variable
Analysis Methods:
Principal components analysis (PCA), in combination
with Sidak multiple comparison tests, used to assess
differences in habitat selection among 12 seabird species
To test for significant differences in habitat affinities
among seabird species, we used two one-way ANOVAs:
In the first, we tested for differences among PC1 scores of
each species; in the second, we compared the PC2 scores
Considered differences between two species to be
significant if either one or both of the PC1 or PC2 scores
differed significantly
Community-Wide Result: The first and second PC axes
explain 60% of variance in habitat use by 12 seabird species
Species-specific Results:
Salty, Green
Species mapped onto two

(independent) dimensions
Near
Fronts
Pair-wise associations
denoted by circles
Zoop
Prey
Fish Prey
PrincipalComponents(PCA) PaperII
Published Example: Weichler et al. (2004).
Objective: Relate seabird densities to seven
environmental parameters:
(1) water depth, (2) distance to nearest land, (3) number
of trawlers within a radius of 5 km, (4) sea surface
temperature, (5) water temperature difference (0 10 m) ,
(6) water temperature difference (0 30 m), and (6) water
temperature difference (10 50 m)
Did Not Report Cross-correlations of Habitat Variables
Data Manipulations To Avoid Biases:
Species densities were selected as variables and 10 min
intervals (samples), were selected as cases
Only species seen in at least five counting intervals were
included, an arbitrary choice that allowed covering a wide
spectrum of species while ignoring those with few occurrences
Only commoner species with numbers exceeding 1% of all
individuals counted were included in the analysis
Dataset of 46 sections of the cruise tracks. Each section
comprised a hydrographic station approximately midway and
10 min intervals in two opposite directions (4 8 km away)
Sample Size: 46 samples / 7 variables: Ratio of 6.5
Community-Wide Result: Six principal eigenvalues (> 1),
showing % of variation explained and ecological interpretation
Community-Wide Result:
Loadings for the 11 seabird
species and 7 variables on
the six principal eigenvalues
3 principal components:
50 % of variance
6 principal components:
78 % of variance
PrincipalComponents(PCA) Comparisons
Number of Axes:
- Selected 2 easy to interpret (Ainley et al. 2005)
- Selected 6 based on eigenvalues > 1 (Weichler et al. 2004)
Display of Results:
- Plot and table of eigenvalues (Ainley et al. 2005)
- Eigenvalues and interpretation (loadings) (Weichler et al. 2004)
Significance Tests:
- Pairwise species comparisons (ANOVA) (Ainley et al. 2005)
- Correlations with selected variables (Weichler et al. 2004)
PrincipalComponents(PCA) Tools
Percent of pattern
explained in original
distance matrix
Orthogonality of PCA axes
PrincipalComponents(PCA) Tools
Ranking of species scores according to Axis 1

Showing Presence / Absence of species on samples
Categorizing samples by a categorical variable
PrincipalComponents(PCA) Examples
Ainley DG, Spear LB, Tynan CT, Barth JA, Pierce SD, Ford RG, Cowles
TJ (2005). Physical and biological variables affecting seabird distributions
during the upwelling season of the northern California Current. Deep-Sea
Research II 52: 123143
Weichler T, Garthe S, Luna-Jorquera G, Moraga J (2004). Seabird
distribution on the Humboldt Current in northern Chile in relation to
hydrography, productivity, and fisheries. ICES J. Marine Science
61 (1):148-154
DisclaimerReferences
Seber, G.A.F. (Ed.), 1977, Linear Regression Analysis. Wiley, New York.
Kleinbaum, D.G., Kupper, L.L., Muller, K.E., 1988. Applied Regression
Analysis and other Multivariable Methods. PWS-KENT Publishing Company,
Boston.
Tabachnik, B.G. and L.S. Fidell. 1989. Using Multivariate Statistics. 2nd ed.
New York: Harper and Row.
DataScreeningandTransformations
Objectives:
DiscussStepsforAnalysis:DataScreening,DataManipulation
Goovertheprinciplesofdataexploration
LearningOutcomes:
Bereadytoplanyouranalysis:DevelopMetadataandAnalysisLog
BeabletoscreenandmanipulateyourdatawithPCORD
DataExploration DocumentingFlow
Flow diagram: sequence of changes / analysis
Analysis log: input, output, results
Save all input and output files and data edits
Metadata:
data
about
data
List of
errors
Clean
Data
Clean
Data
List of
errors
File Names
PC-ORD
File Contents
Connections
Links to other
software
Products:
- figures
- tables
- results
GIS - Stats
Use clear,
descriptive
titles (dated)
Save all
output files
Keep a
flowchart or
dated record
Record WHY
you did it
you will
forget!
Screening:
Are column / rows means and

ranges reasonable?
Are the sample sizes correct?
Are there missing data / outliers?
Cleaning:
Fix typos
Erase / Correct incomplete data
Check effects of corrections
Transformations:
Look up assumptions of test
Check data distributions
Make transformations (re-check)
DataExploration DataScreening
Metadata
96 samples
5 variables
Data type?
Explanation
Show Current Profile

% zeros, data ranges, skewness
DataExploration CurrentProfile
% zeros:
species data
Lowest / highest value:
typos (errors)
Skewness:
non-normality
-1 < SK < 1
Outliers:
(in SD units)
2 SD -> 96%
DataExploration SummaryI
Data Summary:
Mean
SD
Range
Diversity
S = Richness = number of non-zero elements in row

E = Evenness = H / ln (Richness)
H = Diversity = - sum (Pi*ln(Pi)) = Shannon`s diversity index
D = Simpson`s diversity index for infinite population = 1 - sum (Pi*Pi)
DataExploration SummaryII
Skewness:
Steps to Fix
Skewness:
Taking the log
or square root
works for data
with moderate
skewness
DataExploration 1DOutliers
Frequency
distribution of
a univariate
outlier falling
5.5 standard
deviations
above mean
Describe the distribution:
In graph and tabular form
Discrete Distribution
Continuous Distribution
Test for significance: off PC-ORD
25
Sp2
20
15
10
5
0
0
10
15
20
25
Sp1
A bivariate outlier that is not a univariate outlier

for either of the two variables Sp1 and Sp2
DataExploration Outliers
1.20
1.15
1.10
1.05
1.00
0.95
6
5
4
3
2
1
0
0.90
Frequency
DataExploration Outliers
Average Distance
Frequency distribution of average relative Euclidean

distances to a sample unit, given a sample size of
25. The sample marked with the red circle is 3.2 SD
units above the mean of the average distances
DataManipulations
You can manipulate data directly in PC-ORD
Modify / Append Data
Delete Columns / Rows
Multiply / Add Constant
Randomly Sample
Shuffle Data
Note: Beals smoothing is Experimental DO NOT USE
DataTransformations
What are the two reasons for data transformations?
Statistical:
Meet assumptions (normality, linearity, variances,)
Express variables in the same units (km, km/hr):
Ecological:
Make distance measures work better
Reduce influence of total quantity (sample totals)
Deal with importance of rare / common species
Identify informative species
DataTransformations Nomenclature
Monotonic: Element values are changed, but
ranks stay the same (e.g., change unit from km to m)
Relativization: Adjusts matrix elements by one
column / row standard (e.g., total, maximum)
Note: Not all transformations are reasonable or

feasible with all types of data (e.g., negative, P/A)
DataTransformation
DataTransformation
Monotonic transformations retain ranks, but change values
P/A
(x)
f(x)
Power exponents: (square root), 2 (squared), 3 (cubed)

Note: 0 used to recode data as Presence / Absence (0 / 1)
DataTransformations ExampleI
Logarithmic transformation fx = ln(x) OR log(x)
This transformation is useful when:

high degree of variation within attributes (e.g., Chl Conc.)
high degree of variation among attributes within a sample
helps if there are large outliers and lots of zeros
Note: to log-transform data containing zeros, a
small number should be added to all data points.
With count data, add one, so that: fx = log(0+1) =0
With density data, add constant smaller than smallest
possible sample, so that: fx = log(0+0.001) = -3
DataTransformations ExampleII
Arcsine / Arcsine-squareroot transformation
This transformation is useful when:

normalizing proportion data (e.g., Percent Cover)
Note: data must range between zero and one, inclusive.
If they are not, you should relativize (general relativization
or relativization by maximum) before selecting this option.
The constant 2 / pi scales the result of arcsin(x) [in
radians] to range from 0 to 1, assuming that 0 < x < 1.
DataTransformation Howto
Note:
Need to
accept
TEMP file
DataRelativization
Relativization
re-scales data using some criterion / standard.
When its done by columns (e.g., species), variation across

plots is retained, but variation across species is standardized.
Two approaches:
General Relativization: (by totals or sums) makes area
under each species distribution response curve = 1.
(input: x > 0; output: from 0 to 1)
Relativization by Maximum: (by max for column or row)
equalizes the heights of the peaks along the gradient
DataRelativization
General Relativization: (by totals or sums) makes area

under each species distribution response curve = 1.
Relativization by Maximum: (by max for column or row)
equalizes the heights of the peaks along the gradient
DataRelativization
Deviations: Value Mean

Z scores: (Value Mean) / SD
Binary response: Above (1) / Below (0)
Ranks: Assigns ranks
(e.g., 0, 0, 6, 9 would receive the ranks 1.5, 1.5, 3, 4)
DataRelativization Howto
Note:
Need to
accept
TEMP file
DataExploration Summary
Create naming convention for your files (metadata record)
(DATE_AREA_SP_suffix)
9710_Oahu_WTSH_raw
Create a data flow archive in your analysis notebook
Check assumptions of statistical tests / approaches
(PCA: normality of data, linear relationships)
Visually inspect your data: 1-D, 2-D, many-D.
Look for missing data and outliers in individual datasets
Inspect
relationships between variables (pairs, multiple)
DataManipulation Summary
Add missing data and fix typos
Ensure variables expressed in the same units (km / m)
Select the number and identify of species
(Rare species that occur in a single sample
contribute virtually no information, but add noise)
Look for and deal with outliers: (Remove OR Transform)
Deal with confounding factors, such as the different
magnitude of environmental variables (e.g., depth in m or km)
and the proportional representation of different species
(Relativize your data)
CHAPTER 9
Data Transformations
Tables, Figures, and Equations
From: McCune, B. & J. B. Grace. 2002. Analysis of

Ecological Communities. MjM Software Design,
Gleneden Beach, Oregon http://www.pcord.com
A general procedure for data adjustments

Species data
Table 9.3. Suggested procedure for data adjustments of species data matrices.
Action to be considered
1. Calculate descriptive statistics. Repeat
this after each step below. (In PC-ORD run
Row & column summary)
Beta diversity (community data sets)
Average skewness of columns
Coefficient of variation (CV, %)
CV of row totals
CV of column totals
2. Delete rare species (< 5% of sample units)
Criteria
Always
Usually applied to community data sets,

unless contrary to study goals
Species data, cont.

3. Monotonic transformation (if applied to species,
then usually applied uniformly to all of them, so that
all are scaled the same)
A. Average skewness of columns (species)

B. Data range over how many orders of magnitude?
(Count and biomass data often are extreme.)
C. Beta diversity. (Consider presence/absence
transformation for community data when is high.)
Species data, cont.
3. Monotonic transformation (if applied to species,

then usually applied uniformly to all of them, so that
all are scaled the same)
A. Average skewness of columns (species)

B. Data range over how many orders of magnitude?
(Count and biomass data often are extreme.)
C. Beta diversity. (Consider presence/absence
transformation for community data when is high.)
4. Row or column relativizations
What is the question?

Are units for all variables the same?
Is relativization built into the subsequent analysis?
CV of row totals
CV of column totals
What distance measure do you intend to use?
Note: regardless of your decision to relativize or not,
you should state your decision and justify it briefly on
biological grounds.
Species data, cont.

5. Check for outliers based on the average distance of
each point from all other points. Calculate standard
deviation of these average distances. Describe
outliers and take steps to reduce influence, if
necessary
standard
deviation
----------<2
2 - 2.3
2.3 - 3
>3
degree of
problem
----------------------no problem
weak outlier
moderate outlier
strong outlier
Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Criteria
1. Calculate descriptive statistics for

quantitative variables. Repeat this
after each step below. (In PC-ORD
run Row & column summary)
Skewness and range for each
variable (column)
Always
2. Monotonic transformation (applied

to individual variables, depending on
need)
Consider log or square root transformation for variables with

skewness > 1 or ranging over several orders of magnitude.
Consider arcsine squareroot transformation for proportion data.
Environmental data
Criteria

variable (column)
Always

need)

3. Column relativizations
Consider column relativization (by norm or standard deviates) if

environmental variables are to be used in a distance-based
analysis that does not automatically relativize the variables (for
example, using MRPP to answer the question: do groups of
sample units defined by species differ in environmental space?).
Column relativization is not necessary for analyses that use the
variables one at a time (e.g., ordination overlays) or for analyses
with built-in standardization (e.g., PCA of a correlation matrix).
Environmental data
Criteria

variable (column)
Always

need)

3. Column relativizations
Consider column relativization (by norm or standard deviates) if

environmental variables are to be used in a distance-based
analysis that does not automatically relativize the variables (for
example, using MRPP to answer the question: do groups of
sample units defined by species differ in environmental space?).
Column relativization is not necessary for analyses that use the
variables one at a time (e.g., ordination overlays) or for analyses
with built-in standardization (e.g., PCA of a correlation matrix).
4. Check for univariate outliers and

take corrective steps if necessary.
Examine scatterplots or frequency distributions or relativize by

standard deviates (z scores) and check for high absolute
values.
NonmetricMultidimensionalScaling
(NMS)
Objectives:
DiscussStepsforAnalysis:Advantages/Disadvantages
GooveroutputandinterpretationofAutopilotAnalysis
LearningOutcomes:
UnderstandwhatanNMSanalysisdoesandtellsyou
BeabletodoaNMSanalysiswithPCORD
NMS Whatisit?
Non-metric: Non-parametric data analysis (ranks)
(Relationships between object pair-wise
distances and dissimilarities are not linear)
Output:
Representation of relationships between

objects (samples, species) and descriptors
(environmental variables) in a reduced
number of dimensions (axes)
Axes do not correspond to eigenvectors
(User cannot deduce contribution of various
descriptors / objects to described axes)
NMS Howdoesitwork?
NMS searches for best position of n objects on k
dimensions (axes) to minimize stress of k-d configuration
Compares the pair-wise distances (difference) of the
objects in reduced space (expressed in terms of the axes)
and the dissimilarity of the objects in the real world
(expressed in terms of the samples / species / variables):
The Real World
(e.g., 3D)
Reduced Space
(e.g., 1D)
NMS Howdoesitwork?
Approach:
Mechanics:
Iterative procedure
(Manipulates the coordinates of pairs of
observations so they fit as closely as
possible the measured object similarities)
Using a random initialization, NMS uses

multiple iterations to find a robust pattern
(Goodness of fit is measured using stress,
which relates distances between objects in
reduced space with their dissimilarities)
NMS TheGood
Being based on ranked distances, it tends to linearize
relationship between environmental / species distances
Can deal with any distance measure, data normalization,
and data transformation
Can handle non-metric, semiquantitative and subjective
data (e.g., good / bad, beaufort sea state)
Solves zero truncation problem and some missing data
Empirical studies have shown that:
- Use of ranks makes NMS robust even if relationships
between distances and dissimilarities are not linear
- Provides appropriate distance summary with small
number of dimensions
NMS TheBad
Computationally intensive
Does not provide formula loadings
For a given number of dimensions, the solution for a
particular axis is unique. (First dimension in 2-D solution
not the same as first dimension in 3-D or 1-D)
Axis numbers are arbitrary, so the percent of variance on
a given axis does not decrease with increasing axis number
Difficulties in detecting discontinuities
Fails to find the global solution (minimum global stress)
because of multiple local minima.
Need to account for random start of iterative process
(e.g., repeat analysis to see if random start matters)
NMS Approach
1. Calculate dissimilarity matrix () of real data.
2. Assign sample units to starting configuration in the kspace (define initial X). Starting locations (scores on
axes) are assigned with a random number generator.
3. Normalize X by subtracting axis means for each axis
l and dividing by overall standard deviation of scores:
normalized
x il =
x il x l
k
(x
l =1
i =1
(n = samples, k = dimensions)
xl )
il
/ (n k )
NMS Approach
4. Calculate D using the Euclidean distances between
sample units in k-space.
5. Rank elements of in ascending order.
6. Put the elements of D in the same order as .
$ (with elements d$ )
7. Calculate D
ij
created by replacing elements of D
which do not meet monotonicity).
Software creates a plot of sample
pair-wise dissimilarities (y axis)
versus distances in k-space
We compute distance in k-space
NMS Approach
Plot of distance in ordination space (dij, horizontal axis)

vs. dissimilarity in original p-dimensional space (dij, vertical
axis). Points are labeled with the ranked distance
(dissimilarity) in the original space.
NMS Approach
Calculate d terms: shifts in k-dimensional distances (x axis)

to reach monotonic change in distances in original data
NMS Stress
8. Calculate raw stress, S*
n-1
S =
i=1 j=i +1
2
$
( d ij - d ij )
Note: S* measures the departure from monotonicity.

If S* = 0, the relationship is perfectly monotonic.
NMS Stress
9. Because raw stress is altered if the configuration of points
changes (e.g., point locations, number of dimensions) it is
necessary to standardize ("normalize") stress.
Kruskals stress formula one:
n-1
S = S /
*
2
ij
i=1 j=i +1
PC-ORD reports SR, the square root of scaled stress:

Analogous to standard deviation, then multiplied by 100 to
rescale the result from zero to 100:
S
R
= 100 S
NMS Approach
10. Now the program tries to minimize S by changing the
configuration of the sample units in the k-space.
Calculate "negative gradient of stress" for each point i.
11. The amount of movement in direction of the negative
gradient is set by the step length, a, which is about 0.2 initially.
The step size is recalculated after each step such that the step
size gets smaller as reductions in stress become smaller.
12. Iterate (go to step 3) until either:
- a set maximum number of iterations is reached OR
- a criterion of stability is met
NMS Approach
Crawling through the landscape in search of the optimum
Stress Landscape
Changing
positions of
the samples
Axis 1
Axis 2
The goal is to minimize stress

(to end up in a valley)
Some landscapes are

trickier than others
NMS Approach
The starting configuration can influence the result
Beware of local minima (pits)
Avoid unstable solutions (saddle points)
The starting configuration can be selected in two ways:
Use a random starting configuration
Use coordinates from another ordination method
Recommendation: Use a random start
A high number of random starting configurations often
provides a solution with lower stress
This approach avoids having to decide on what other
method to use lose the great benefits of NMS
NMS Approach
Possible to evaluate whether NMS is extracting stronger
axes than expected by chance
Statistical Significance Based on Randomization Test
(Monte Carlo approach):
p = (1+n) / (1+N)
n = number of randomized runs with final stress
less than or equal to the observed minimum stress
(one tailed test) N = number of randomized runs
Recommendation: Use a large number of runs
This is a time intensive computational method, that will
take a great deal of time (even if runs = 20)
We need to have a large enough number of runs to
calculate the p value with the desired resolution
NMS Approach
Statistical Significance Based on Randomization Test
( p value: p = (1+n) / (1+N) )
(20 runs)
(50 runs)
Stress declines with increasing dimensions

Real data have lower stress than randomized data
NMS Approach
Stress Interpretation:
Real Data:
Declines with increasing
dimensions (from 1 to 5)
Randomized Data:
Real data below the
distribution of
randomized data
(for dimensions 1 to 5)
NMS AutopilotMode
The automatic procedure determines
most appropriate dimensionality,
assigns statistical significance with
randomizations, and avoids local
minima (using random iterations)
Advantages: Uses default settings
and decides number of axes for you
Disadvantages: User may want

additional output products. Number of
axes based on additional considerations
NMS AutopilotMode
The autopilot NMS mode
provides three settings
Speed vs Thoroughness
Quick and Dirty
Medium
Slow and Thorough
NMS AutopilotMode
The autopilot NMS mode provides three settings
NMS Results
Examine Results.txt file: Settings / Options
Up to 6 dimensions (for sake of interpretation)

Random start (to avoid local minima)
Reduction in dimensionality (D: 6,5,4,3,2,1)
NMS Results
Examine Results.txt file: Settings / Options (all Dimensions)
Cannot monitor changing stress

Cannot assess linearity of distances / dissimilarities
Cannot see scores for all the runs just for final run
Cannot see scores for species just for final run
NMS Results
Examine Results.txt file: Results for best result
Stress
P values
Scores
NMS Results
Examine Results.txt file: Results for best result
Scores
NMS Results
Examine Results.txt file: Plotting Stress vs Iteration
Note: This graph provided only for best answer (3-D)
NMS Results
Examine Results.txt file: Interpret Stress (Clarke 1993)
NMS Results
Examine Results.txt file: Run Log
Random data: 0 = not randomized, 1 = randomized

Start file: 0 = random starting coordinates, 1 = read from file
Seeds: initial seeds for random number generator
* Stability criterion not met.
NMS Results
Examine Results.txt file: Run Log
**To run single NMS ordination repeating best result, use this
file as starting configuration, rather than using random start.
Save this file with new name, to avoid overwriting it with next
NMS test. To do this, open file using File | Open | Graph Row
file, then File | Save as | Graph Row file (specify new name).
.
NMS Results
Examine graphs: Species scores
Select Weighted Average Scores
Species as Vectors
Species as Points
NMS Results
Examine graphs: 2D Ordination plots
Tau: non parametric correlation
NMS Results
Correlations with Matrices:
Tau (rank correlation)
DO NOT use r 2 value
Percent of Variance:
Use same distance metric
used for NMS analysis
NMS Results
Coefficient of Determination (% of Variance):
For each axis together
FINE to use r 2 value

Orthogonality:
Measure of independence of the three axes
NonmetricMultidimensionalScaling
(NMS)
Objectives:
GooversettingsandresultsofManualAnalysis
Discussconstraintswhendecidingonnumberofaxes
LearningOutcomes:
Understandwhatresultsneedtobereported
NMS SuggestedProcedure
This suggested procedure
determines appropriate the most
dimensionality, assigns statistical
significance with randomizations,
and avoids local minima.
Recommendation: Request a 6-dimensional solution,

stepping down to a 1-dimensional solution, with instability
criterion of 0.0005 (or smaller), 200-500 iterations, 20-50
runs with real data, and 20-50 runs of randomized data
(for randomization tests of statistical significance)
NMS SuggestedProcedure:Step1
First, pick distance measure
Second, set up parameters
Step
Down
Relative Sorensen
Dimensions (max = 6)
Relative Euclidean
200 Iterations, 10 Runs
Third, pick the output options
Write final
configuration
Run Log
Plot Stress
vs.. Iteration
Provides scores
Statistics
Dimensionality
Plot distance
vs. dissimilarity
Randomization
Statistical Test
Species Scores
(for plotting)
1. Preliminary runs: Stress Test determines dimensionality
Use time of day random seed
Graph messages
NMS Results
NMS Results
Examine Results.txt file: Results for each run / dimension
Stress
Scores
NMS Results
Differences in Real Space
Examine Results.txt file: Shepard Diagram

(6 - D)
(2 - D)
Final
Stress:
4.137
Distances in 6-D space
Final
Stress:
23.138
Distances in 2-D space
NMS Results
Examine Results.txt file: Plotting Stress vs. Iteration
Note: This process is repeated for each run
NMS Results
Examine Results.txt file: Stress
13.4178 = final stress
0.0031 = final instability
NMS SuggestedProcedure:Step2
Goal: Select the Best Solution:
Plot stress vs.. number of dimensions
How: Just after running NMS
Do this in PC-ORD by selecting
Graph | NMS Scree Plot
If the stress
increases with
additional
dimensions,
the model is
over-fitted
NMS SuggestedProcedure:
PC-ORD uses the following criteria (for reference):
Comparing the final stress values among the best

solutions, one best solution for each dimensionality.
Additional dimensions considered useful if they reduce

final stress by 5 or more (on a scale of 0-100). PCORD selects the highest dimensionality that meets this
criterion.
At that dimensionality, the final stress must be lower

than that for 95% of the randomized runs (i.e. p <=
0.05).
If this criterion is not met, PC-ORD does not accept that

solution and chooses a lower-dimensional solution,
provided that it passes the randomization test.
Other metrics for selecting number of dimensions:
marginal change in stress
p values
*
*
If stress does
not increase,
computer
considers
marginal
decline with
added Ds
Consider the
p values
Check for a better-than-random solution by using the
results of the Randomization test.
Limitations: Helpful but not foolproof.
The most common problems are:
Strong outliers, single super-abundant species,
small data sets (e.g.,<10 SUs), many zeros
Note: The first axis with randomized community data is
often nearly as strong or stronger than the real data,
even when the pattern in the real data is strong. The
randomization creates rows with unequal abundances
some rows can have higher or lower totals the real data.
Thus a 1-D NMS solution from shuffled data tends to
describe variation in row totals. Interpret carefully
Goal: Select number of dimensions beyond which
additional dimensions provide only small stress reductions
Suggestion: Follow PC-ORDs recommendation
but check for some safeguards
Note:
No firm fixed criterion for selecting an appropriate
number of dimensions (Kruskal and Wish 1978)
Axis scores depend on the number of axes. The first
dimension on a 2-D and a 3-D result will be different
Trade-Offs:
Do not trust results with large stress values (> 20)
Final stress decreases and the proportion of the variance
represented increases with more axes
Pick as few dimensions as possible based on stress
reductions but if in doubt, add an extra dimension
Beware of unstable results (stress wiggles with iterations)
Consult the instability of the final answer
Check the following: (a) plot of stress vs.. iteration for
stability of the solution at the selected number of
dimensions; and (b) final instability value for the chosen
solution, as listed in the numerical output from NMS
Look for smooth curves
30
35
25
Stable
Unstable
20
Stress
Stress
30
25
20
15
15
10
10
5
50
100
150
200
Step
Strive for instability < 10 -4 (< 0.001)
50
100
Step
150
200
Use Data Exploration to decrease stress of NMS analysis
16
30
12
10
Stress (%) or Spp count
Final Stress (%)
14
8
6
4
2
0
0
10
20
30
40
Number of Sample Units
50
25
Species remaining, count

Final Stress
20
15
10
5
0
0
20
40
60
80
100
Criterion for species retention (% of SU's)
Dependence of stress on sample Dependence of stress on

size, by subsampling rows of a
progressive removal of
matrix of 50 units by 29 species rare species from data set
NMS WhattoReport
Samples / Species Considered
Data Transformations
Distance Metric and Software Used

Did you use a random starting point
Number of runs with real / random data
Number of dimensions considered
How did you select the dimensions

Final stress / instability of best solution
Monte Carlo tests results (p values)
Proportion of variance explained by each axis (r 2)
Overlays (env. data / species)
Correlations of env. data / species with axes (Tau)
NMS References
PC-ORD uses the following algorithms:
Mather, P. M. 1976. Computational methods of multivariate
analysis in physical geography. J. Wiley & Sons, London.
532 pp.
Kruskal, J. B. 1964. Multidimensional scaling by optimizing
goodness of fit to a nonnumeric hypothesis. Psychometrical
29:1-27.
For a review of NMS, cite:
Clarke, K.R. 1993. Non-parametric multivariate analyses of
changes in community structure. Australian Journal of
Ecology 18:117-143.
Kneel, N.C., Or loci, L., 1986. Applying metric and nonmetric
multidimensional scaling to ecological studies: some new
results. Ecology 67, 919923.
NMS ExamplesI
Seabird communities of the Indian Ocean
We selected an
observation day as
the sampling unit for
the community-level
analysis because we
regarded the daily
transects as discrete
samples, separated
by night time periods
with no survey effort.
(Hyrenbach et al. 2007)
Our sample size was a matrix of 16 transects and 46 taxa.

We standardized the samples using the relative abundance of taxa.
To ensure each daily sample was weighted equally in the analysis,
we used the relative Sorensen (Bray- Curtis) index (Manly, 1994).
NMS ExamplesI
Seabird communities of the Indian Ocean
The NMDS selected 3 habitat axes, which accounted for
73.4 % of variance observed in the seabird community
- The first axis (r2 = 0.15) described lat gradients associated with
concurrent SST decrease and CHL increase.
- The second axis (r2 = 0.41) illustrated concurrent lat / long changes in
wind speed, depth, CHL, SST, and gradients in ocean depth and SST.
- The third axis (r2 = 0.17) captured the influence of onshoreoffshore
gradients in CHL, irrespective of lat and long.
Because axis 2 and 3 explained a higher proportion of the

observed variability, we plotted the survey transects and
species distributions in 2-dimensions
NMS ExamplesI
Seabird community structure in the Indian Ocean
Shallow
TRANSECTS
1.0
Axis 3
0.5
2
1
16
0.0
9
4 5 76
10 8
11
3
15 13
12 14
-0.5
North
-1.0
-1.5
Deep
-1.0
South
-0.5
0.0
0.5
Axis 2
Three seabird assemblages:

sub-Antarctic, subtropical offshore, subtropical nearshore
1.0
NMS ExamplesII
Seabirds and subsurface predators around Oahu
69 seabird foraging
observations recorded
Presence of subsurface
predators was not
ascertained in 7 schools
In 2 of remaining 62
observations, no
subsurface predators
were present
(Hebshi et al. 2008)
NMS ExamplesII
The NMDS analysis relied on a similarity matrix created using the
Sorensen (Bray-Curtis) index from the raw seabird counts and 13
explanatory variables describing:
-type of fishing (commercial vs.. sport)
- subsurface predator (skipjack tuna, mahimahi, spotted dolphin,
false killer whale, yellowfin tuna, unknown),
- geographic location around Oahu (Waianae, Penguin Bank,
Kaena Point, other *).
* Only those locations contributing at least 10%
(7 or more) observations considered in analysis.
NMS ExamplesII
NMS identified 2 highly (99.3%)
orthogonal axes (r = 0.082),
which explained 67.9% of the
cumulative observed variance
axis1, r2: 0.502
axis2: r2: 0.178
The NMS stress was 17.873,
suggesting that the test
performance was fair
(McCune & Grace 2002)
NMS ExamplesII
Seabirds and subsurface
predators around Oahu
The seabird community was
influenced by the presence of
wedge-tailed shearwaters, brown
noddies, and sooty terns
The first axis captured the
differences between commercial
and sport fishing vessels, while
the second axis captured variability
across geographic locations
This analysis also revealed
significant correlations with the first
axis for 2 subsurface predators:
mahimahi (+) and skipjack tuna (-)
TakeHomeMessages
NMS is a flexible and powerful tool
This inherent flexibility makes this technique difficult
to interpret (how many meaningful axes are there ?)
Yet, NMS allows the integration of different datasets into
multivariate patterns
Data exploration will help you use NMS most efficiently,
by carefully choosing the sample sizes and species /
variables to include in your analyses.
Use NMS to tell ecological stories that balance noise
against statistical significance
PCAExaminationKey
This exam is worth 10 points (two homeworks).
Just like in the homeworks, make sure you explain what
you are doing and how you are getting the answers. This
way, I can give you partial credit for incomplete answers.
In particular, explicitly state what PC-ORD command you
used to obtain the various figures / results.
You will turn in a ppt file with your images and text
inserted into the body of the presentation. To copy text
from PC-ORD screen, use CONTROL + Print Screen
When answering the questions, back up your responses
with figures / tables / numbers. An image / table is worth
1000 words!!!
Dataset
Data file: PCA1M.wk1 (main matrix)
96 samples and 5 variables

Samples are monthly values (Jan. 97 - Dec. 04)
Variables:
Time: decimal year
MEI: El Nio Multivariate Index (positive: warm, negative: cold)
PDO: Pacific Decadal Oscillation (positive: warm, negative: cold)
Up36: upwelling at 36 N (positive: upwelling, negative: downwelling)
Up39: upwelling at 39 N (positive: upwelling, negative: downwelling)
DataExploration
Use scatterplot matrix to make a
plot of all possible pair-wise
combinations of the 5
environmental variables
DataExploration Correlograms
Time Trends ?
Regional Indices
(PDO / MEI)
Local Indices
(up36 / up39)
DataExploration Advisor
Rows Skewed
Columns Not Skewed
Outliers: Samples
Look our for these
in the plot results
DataExploration VariableYear
- 0.2330 E-07 = skewness
DataExploration Skewness
0.22 = skewness
0.94 = skewness
0.60 = skewness
0.24 = skewness
StatisticalResults WithYear
StatisticalResults WithYear
Important Axes:
Eigenvalue: 1,2,3,4 Broken-stick: 1 P-values: 1,4
Interpretation:
Loadings > 0.5 highlighted (arbitrary)
up36 / up39 up36 / up39

Together
Opposite
Time
MEI / PDO
Together
DataTransformation TimesinceStart
Transformation:
Subtract 1970 (first year sample)
Recode as Time Since Start
Similar skewness
No more outliers
DataExploration Skewness
- 0.2330 E-07 = skewness
StatisticalResults WithTime
StatisticalResults WithTime
Important Axes:
Eigenvalue: 1,2,3,4
Interpretation:
Broken-stick: 1 P-values: 1,4
up36 / up39 up36 / up39

Together
Opposite
Time
MEI / PDO
Together
DataTransformation RemoveTime
Remove Column:
No Time
Less Skewness
(for rows)
Still No Outliers
StatisticalResults RemoveTime
StatisticalResults WithoutTime
Important Axes:
Eigenvalue: 1,2,3
Interpretation:
Broken-stick: 1 P-values: 1,3

up36 / up39 up36 / up39

Together
Opposite
MEI / PDO
Together
DataExploration WithTime
Independent
(orthogonal) variables
DataExploration Time
No correlation with axis 1 or 2
Positive correlation with axis 3
DataExploration Time
DataExploration WithoutTime
Independent
(orthogonal) variables
DataExploration WithTime
Axis 1:
Su02
Big:
More Upwelling
Small:
Less Upwelling
Axis 3:
SU98 WI98
Su99
Small: Warm
Big: Cool
SU97
WI97
PDO axis 1
MEI axis 1
Upwelling 39 axis 1
Upwelling 36 axis 1
Conclusions
Number of eigenvalues = Number of variables
Eigenvalues loadings did not change
- even after transforming YEAR data
Broken-stick results did not vary: YEAR / TIME
Randomization results did vary: YEAR / TIME
Removing time (linear variable)

- one less eigenvalue
- highlighted upwelling / PDO / MEI influence
PolarOrdination/MRPP
Objectives:
Discussgeneralapproachesofthesetwomethods
Gooversettingsandresultsforthesetwomethods
LearningOutcomes:
Understandhowtoperformtheseanalyses
Befamiliarwithwhatresultsneedtobereported
PolarOrdination Applications
Bray-Curtis Ordination
(Polar Ordination) arranges samples
with respect to poles (also termed
end points or reference points)
according to a distance matrix
These endpoints are two samples
with the highest ecological distance
between them (objective approach),
OR two samples suspected of being
at opposite ends of an important
gradient (subjective approach)
Recommendation This procedure is especially useful for
investigating ecological change (e.g., succession, recovery).
PolarOrdination Pros/Cons
Advantages:
Ideal for evaluating problems with discrete endpoints:
conceptually (arctic sample / tropical sample) or
practically (before disturbance / climax community)
Polar Ordination ideal for testing specific hypotheses
(e.g., reference condition or experimental design) by
subjectively selecting the end points
Disadvantages:
This technique does not provide a general-purpose

description of the community (perspective is biased)
Very sensitive to outliers (by definition end points)
PolarOrdination HowitWorks
Setting Up:
Select a distance measure (usually Sorensen Index) and
calculate matrix of distances (D) between all pairs of points
Calculate sum of squares of the
distances for calculation of the
variance represented by each axis
Select two points, A and B, as reference points for axis 1
Define End Points Subjectively OR Use Objective Method
3 Objective Methods: Recommend Variance-Regression
- find point with largest variance in pairwise distances
- select point which minimizes regression of distances
Selecting End Points:
Variance-Regression:
(Beals 1984)
Selects points at edges of main
cloud of points (Recommended)
Original (Bray & Curtis 1957):

Selects outliers looking for two
points with largest distances
Minimum-Centroid (Deviation):
Must pick geometry subjective, rarely used (DO NOT USE)
Once you have the first axis (g) linking the two points:
Calculate position (xgi) of each point i on the axis g. Point i
is projected onto axis k between reference points A and B
For Reference: Equation for projection onto the axis is:
Calculate variance represented by axis k as a percentage
of the original variance (V k %). The residual sum of
squares has same form as original sum of squares and
represents amount of variation from original distance matrix
PO SuggestedProcedure:Step1
First, pick distance measure
Distance: Sorensen
Second, select End Points
Lets try Subjective
Third, Geometry / Residuals
Recommend City-Block
NOTE: # Axes only changes reported results not solution.
Always try more than 1. Set List Residual Matrix = 0
PO SuggestedProcedure
Next, pick number of subjective axes
Note: Possible that
objective axes capture
more variation than the
subjective axis selected
Method for determining the

(remaining) objective axes
PO Results
Ordination axes in order of variance explained

showing endpoints chosen and sample scores
PO Results
Examine scores on axis 1: Results.txt file
Examine performance of axes 2 and 3: Results.txt file
End Points: 88 & 92

Variance: 55.79%
End Points: 86 & 02
Variance: 16.05%
PO WhattoReport
Distance Metric Used (Use Sorensen)
Method for selecting End Points What are they
Use subjective for axis 1
Select City block Distance / Residuals similar to NMS
Use Variance-regression method for additional axes
Number of dimensions considered
Always use more than 1
Proportion of variance explained by each axis (r 2)

Bi-plot ordination plots
Correlations of env. data / species with axes (Tau)
Orthogonally of axes: May be lower than NMS / PCA
NOTE: No Randomizations (p values) working on it.
MRPP Applications
Multi-response Permutation Procedure
(MRPP) is a non-parametric approach for
testing the hypothesis of no differences
between two or more groups of entities
(species, variables): MRBP, ANOSIM, Qb
These pre-existing groups can be
defined using groups of samples on the
basis of categorical data:
The presence absence of given species
Categories of environmental variables
(e.g., early vs. late)
Recommendation This procedure yields a p value and
interpretation requires further exploration: indicator species
MRPP Pros/Cons
Advantages:
Ideal for evaluating specific hypotheses
differences between groups of samples
Disadvantages:
Cannot investigate interaction terms

Interpretation difficult to determine what species are
contributing to these differences in community
composition require additional exploration of the data
Recommend: Use Indicator Species Analysis
MRPP HowitWorks
Setting Up:
Include a Grouping Variable in the Main / Second matrix
Select a distance measure (usually Sorensen Index) and
calculate matrix of distances (D) between all pairs of points
within each of the pre-defined groups we are testing
Group 1
Group 2
Shuffle data and

recalculate distances,
for all possible
arrangements of
samples into groups
MRPP HowitWorks
Calculate distance matrix, D
Calculate average distance xi within each group i
Calculate delta (weighted mean within-group distance)
Note: for g groups, where C is a weight that

depends on the number of items in the groups
(Ci = ni / N, where ni is the number of items in
group i and N is the total number of items)
MRPP HowitWorks
Permutations:
M = N!/(n1! * n2!)
Determine probability of a this small or smaller

Species
SU
Groups
SU
1
1
2
2
3
etc.
MRPP HowitWorks
Calculating the p value:
Determine probability of a as small or smaller
MRPP HowitWorks
Output:
Test Statistic T: measures effect size
A: within-group agreement
BEWARE:
DO NOT
over-interpret
T and A:
Ongoing
Discussion
P-value: Null Hypothesis:

within-group distance the same as across-group distances
MRPP SuggestedProcedure:Step1
First, pick
distance measure
Distance: Sorensen
Second, select
Weights of Groups
Recommend:
n / sum (n)
Third, use Ranks
Useful for very heterogeneous data

More comparable to NMS
MRPP Results
Examine
Results.txt file: Distribution of samples into groups
MRPP Results
Examine Results.txt file: T & A Statistics
Smaller
observed delta
A>0
(more similar
within groups)
Significant result: p < 0.05
Fairly small Output: NO bi-plots, NO variance explained
MRPP WhattoReport
Distance Metric Used (Use Sorensen)
How groups were defined Relate back to Hypothesis
Chance corrected within-group agreement (A)
Associated p value
PO/MRPP References
Polar Ordination:
Bray, J. R. and J. T. Curtis. 1957. An ordination of upland
forest communities of southern Wisconsin. Ecological
Monographs 27: 325-349.
Beals, E. W. 1984. Bray-Curtis ordination: an effective
strategy for analysis of multivariate ecological data. Advances
in Ecological Research 14: 1-55
MRPP:
Mielke, P. W., Jr. 1991. The application of multivariate
permutation methods based on distance functions in the earth
sciences. Earth-Science Reviews 31:55-71.
Zimmerman, G. M., H. Goetz, and P. W. Mielke, Jr. 1985. Use
of an improved statistical method for group comparisons to
study effects of prairie fire. Ecology 66: 606-611.
ForthePeer Review
Look for a gradient (one axis):
Polar Ordination
Compare groups:
MRPP
Suggestions:
- If you have a categorical value (in canyon / outside): MRPP
- If you have continuously changing samples (across latitude or
depth): you can test for N / S OR shallow /deep gradients
- If you have the diet or habitat multiple species, you can use
them as groups

Pcord - Exemple Envirenemment Analysis

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Pcord - Exemple Envirenemment Analysis

Caricato da

Copyright:

Formati disponibili

MultiVariate Analysis

Multivariate DataAnalysis Background

If there are specific niches:

Community ecologists analyze the effects of multiple

(Spalding et al. 2007)

Ecozones consist of clusters of adjacent

Small areas where

Transformed (log scale)

Most species infrequent:

Data tend to be sparse:

Plant Community Composition

Samples refer to the basic unit of observation

Figure 5.1. Hypothetical species abundance in response

Figure 5.2. Hypothetical linear responses of species

Some times the answer

Observation: Species are

Consider two species with

Two variables plotted at once

How are your data distributed across

Depends on whether a variable is continuous or discrete.

The user can change the

Note: Kernels are Smoothing Parameters

Sample Space: Compare species across pairs of samples

Species abundance shown

Sample unit composition

Metrics Semimetrics Nonmetrics

Both metrics and semimetrics used in ecology

Because each object is represented by 4 variables,

Union between the sets is:

Intersection between sets is:

Jaccard's coefficient can be computed as the number of

Jaccard Index BinaryDataExample

Jaccard Distance is complementary to the Jaccard coefficient

Works with binary data or quantitative data (counts)

Sorensen Distance is complementary to the QS metric, and is

Proportion of shared abundance

Works with binary data or quantitative data (counts)

For both species abundances

Where p is the number of

Advisor menu provides two tools to help you decide what

(Axes uncorrelated, by definition)

If possible, use points for

Plot samples / species

Also can code samples by

Rules of thumb: relative

Yet, most studies select 2 (or 3 axes): Intuitive explanation

Interpretation of time change: successional vectors

kurtosis < 0 (less peaky)

Rule of thumb (McCune & Grace 2002):

-1 < Skew < 1

Beware of outliers they can change cross-correlations

Remove those data

If used with community data, justify using this linear

How many axes were interpreted, and what proportion

Principal eigenvectors - Test of significance?

Rotation of the solution? Use of interpretation aids?

Enter descriptive explanation to document the

Axis 1: 0.00 Axis 2: 0.00

Enter descriptive explanation to document the

Same Result - GOOD

Same Result - GOOD

the scaled eigenvectors