Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
597–609
ß 2012 SETAC 597
(Submitted 30 January 2009; Returned for Revision 1 April 2009; Accepted 19 June 2012)
EDITOR’S NOTE
This article represents 1 of 6 papers describing development and evaluation of a sediment quality assessment framework to
support implementation of California’s new sediment quality objectives for bays and estuaries, which became effective in 2009.
Over thirty scientists collaborated on this effort by the California State Water Resources Control Board, which resulted in the
establishment of one of the first statewide programs in the US to fully incorporate the sediment quality triad for regulatory
Special Series
applications.
ABSTRACT
A number of sediment quality guidelines (SQGs) have been developed for relating chemical concentrations in sediment to
their potential for effects on benthic macroinvertebrates, but there have been few studies evaluating the relative effectiveness of
different SQG approaches. Here we apply 6 empirical SQG approaches to assess how well they predict toxicity in California
sediments. Four of the SQG approaches were nationally derived indices that were established in previous studies: effects range
median (ERM), logistic regression model (LRM), sediment quality guideline quotient 1 (SQGQ1), and Consensus. Two
approaches were variations of nationally derived approaches that were recalibrated to California-specific data (CA LRM and CA
ERM). Each SQG approach was applied to a standardized set of matched chemistry and toxicity data for California and an index
of the aggregate magnitude of contamination (e.g., mean SQG quotient or maximum probability of toxicity) was calculated. A
set of 3 thresholds for classification of the results into 4 categories of predicted toxicity was established for each SQG approach
using a statistical optimization procedure. The performance of each SQG approach was evaluated in terms of correlation and
categorical classification accuracy. Each SQG index had a significant, but low, correlation with toxicity and was able to correctly
classify the level of toxicity for up to 40% of samples. The CA LRM had the best overall performance, but the magnitude of
differences in classification accuracy among the SQG approaches was relatively small. Recalibration of the indices using
California data improved performance of the LRM, but not the ERM. The LRM approach is more amenable to revision than other
national SQGs, which is a desirable attribute for use in programs where the ability to incorporate new information or chemicals
of concern is important. The use of a consistent threshold development approach appeared to be a more important factor than
type of SQG approach in determining SQG performance. The relatively small change in classification accuracy obtained with
regional calibration of these SQG approaches suggests that further calibration and normalization efforts are likely to have
limited success in improving classification accuracy associated with biological effects. Fundamental changes to both SQG
components and conceptual approach are needed to obtain substantial improvements in performance. These changes include
updating the guideline values to include current use pesticides, as well as developing improved approaches that account for
changes in contaminant bioavailability. Integr Environ Assess Manag 2012;8:597–609. ß 2012 SETAC
specific contaminant types. In addition, some of the param- and 3) if performance further improves when the SQGs are
eters needed to apply these guidelines (e.g., sediment acid recalibrated to 2 subregions within California.
volatile sulfides and simultaneously extracted metals) are
rarely collected in current routine monitoring programs. METHODS
Second, the more widely used empirical SQGs are derived The study assessed the performance of 6 empirical SQG
from statistical association of matched sediment chemistry approaches by applying them to matched chemistry and
and biological effects data. Multiple kinds of empirical SQGs toxicity data for California and calculating an index of overall
that are based on different statistical approaches have been contamination based on the mean SQG quotient or the
developed. Examples of empirical SQG approaches for the maximum probability of toxicity. Performance of the SQG
marine environment include effects range median (ERM), indices was evaluated in terms of correlation with magnitude
probable effects level (PEL), apparent effects threshold of biological response and categorical classification accuracy
(AET), SQGQ1, and LRM (Barrick et al. 1988; Fairey et al. (Figure 1). Four of the SQG approaches were derived in
2001; Field et al. 2002; Long et al. 1995; MacDonald previous national studies (ERM, LRM, SQGQ1, Consensus)
et al. 1996). Consensus guidelines, which aggregate several and 2 were variations of nationally derived SQGs that were
different SQGs having a similar narrative intent (e.g., median recalibrated to California-specific data (CA LRM and CA
effect), are an evolution of the empirical approach. Marine ERM). Thresholds relating each SQG index to toxicity
consensus SQGs have been developed for some constituents, response categories were derived using a standardized
including metals, polychlorinated biphenyls (PCBs), and statistical approach. Each SQG index was evaluated by
polycyclic aromatic hydrocarbons (PAHs) (MacDonald et al. determining 3 measures of association between the calculated
2000; Swartz 1999; Vidal and Bay 2005). effect categories and the observed toxicity responses: corre-
It is unclear which empirical SQG approach is most lation, weighted kappa, and percent agreement. SQG
effective for describing the potential for biological effects calibration and performance evaluations were conducted at
associated with chemical contamination. Numerous studies 2 scales to investigate the influence of regional variations in
have shown that each SQG approach has some degree of sediment characteristics: statewide (all California data) and
predictive ability with respect to biological effects, but most regional (separate northern and southern California data sets).
studies have generally been limited to examination of just 1 or
2 approaches and often use variable methods to measure Data
performance (Wenning et al. 2005). Long et al. (2000)
applied ERMs and PELs to several data sets and observed Paired sediment chemistry and toxicity measurements
different patterns in predictive ability. Vidal and Bay (2005) from California marine embayments were compiled from
compared 5 SQG approaches using a common data set and 151 dredging, monitoring, and research studies conducted
found large differences in predictive ability among some between 1984 and 2004. The database included stations from
approaches, however, their study did not include the LRM marine and estuarine embayments located from 41.948N (Del
approach. Vidal and Bay (2005) also observed that compar-
isons of SQG performance can be strongly influenced by the
selection of thresholds used to classify the results. Existing
studies are inadequate for comparing the performance of Matched Chemistry and Toxicity
empirical SQGs because of their limited scope, lack of Data Compilation
comparability in methods, and lack of thresholds derived
using a consistent methodology.
It is also unclear whether performance of SQGs is Data Standardization and
improved when they are calibrated to local conditions. The Categorization of Biological Effects
predictive ability of SQGs has been shown to vary when the
same guidelines are applied to data from different regions
(Fairey et al. 2001; Long et al. 1998, 2006; O’Connor et al.
1998; Vidal and Bay 2005). These variations in performance Calibration Data Validation Data
may be due to differences in the chemical mixtures between Set (2/3) Set (1/3)
sites or regions, variations in bioavailability due to geo-
chemical factors, or differences in the sensitivity of methods
used to measure biological effects. Variation in SQG per-
Regional LRM and
formance among studies creates uncertainty in determining ERM Calibration
the threshold of SQG exceedance associated with adverse
impacts on sediment quality. The use of SQGs and
interpretation thresholds that are derived or calibrated
relative to site-specific conditions has been recommended as Statewide and Regional SQG Index
a way to reduce the uncertainty of SQG interpretation (Fairey Threshold Development Evaluation
et al. 2001; Long et al. 2006; Vidal and Bay 2005).
This study applied 6 empirical SQG approaches to a large
California data set of paired sediment chemistry and toxicity Kappa
Agreement
measurements to assess: 1) which national SQG approach Correlation
best classifies the toxicity of California sediments, 2) whether
the relationship of national SQGs to sediment toxicity is
improved when the SQGs are recalibrated to California data, Figure 1. Schematic of data analyses.
National and Regional Sediment Quality Guidelines Comparison—Integr Environ Assess Manag 8, 2012 599
Norte County, CA) to 31.758N (US–Mexico international (TOC) for the purposes of calculating the SQGQ1 and
border). More information on the studies used to populate Consensus quotients. Estimated values were not used in
this database can be found at http://www.sccwrp.org/view. calculations for any other analytes missing in the data sets,
php?id¼519. except when needed to calculate standardized sums of PAHs,
The data were screened to select information that was of PCBs, or pesticides. For example, a value for phenanthrene
high quality and comparable. All stations were from locations was estimated for a sample that contained data for other
in enclosed bays or harbors at subtidal depths and only data PAHs to use the standardized method to calculate the PAH
from surficial sediment (top 30 cm or less) were selected. sums, but the estimated phenanthrene value was not used
Toxicity data were limited to information from solid phase individually to calculate summary SQG values for that
10-day amphipod survival tests using Rhepoxynius abronius or sample.
Eohaustorius estuarius and conducted using standardized The standardized data set was divided into 2 groups to
methods (USEPA 1994). Overall, 74% of the data were from facilitate investigation of regional differences in chemical
tests using E. estuarius. The proportion of tests per species contamination on SQG performance: northern California
varied regionally, with E. estuarius tests comprising 90% and embayments north of Point Conception and southern
60% of the data in the northern California and southern California embayments south of Point Conception. Each
California data sets, respectively. Toxicity data were further regional data set was further divided into 2 portions: a
screened to ensure mean negative control survival was 90% calibration subset used for index development and threshold
and overlying water ammonia concentrations (initial and final, calibration, and an independent validation subset used for the
if available) were less than species-specific criteria (USEPA analysis of SQG performance. Approximately one-third of
1994). Sediment grain size was not used as a toxicity data the data were used for validation. The validation samples
screening criterion. Screening steps to select chemistry data were selected by first grouping the data into 1 of 8 subregions
for analysis included a review of the data quality assessment based on latitude to ensure even spatial representation. The
from the study authors, use of comparable extraction/ samples within each subregion were then ranked by the mean
digestion methods, and measurement of a minimum suite of ERM quotient (mERMq) and one-third of the samples
contaminants that included multiple metals and PAHs. systematically sampled from throughout the mERMq distri-
Standardized sums of PAHs, dichlorodiphenyltrichloro- bution. Additional validation data were obtained from recent
ethane (DDTs), PCBs, and chlordanes were calculated using monitoring studies that were not included in the initial data
a consistent methodology for all samples. Low molecular compilation effort. The north and south validation data sets
weight PAHs (LMW PAH) were calculated as the sum of contained 146 and 249 samples, respectively.
acenaphthene, anthracene, biphenyl, naphthalene, 2,6-dime-
thylnaphthalene, fluorene, 1-methylnaphthalene, 2-methyl-
naphthalene, 1-methylphenanthrene, and phenanthrene.
High molecular weight PAHs (HMW PAH) was the sum National SQGs
of benzo[a]anthracene, benzo[a]pyrene, benzo[e]pyrene, The ERM guideline values are based on the analysis of
chrysene, dibenz[a,h]anthracene, fluoranthene, perylene, marine chemistry and biological effects data from throughout
and pyrene. Total PAHs was the sum of LMW PAH and North America (Long et al. 1995). These SQGs use results
HMW PAH values. Total PCBs was calculated from the sum from a wide range of biological effects measures, including
of congeners 8, 18, 28, 44, 52, 66, 101, 105, 110, 118, 128, acute and sublethal sediment toxicity tests of field sediments,
138, 153, 180, 187, and 195. The congener list was a subset of spiked sediment experiments, benthic community assess-
that used by the NOAA Status and Trends Program; the sum ments, fish pathology, and mechanistic models of sediment
was multiplied by a correction factor of 1.72 to approximate toxicity. In general, the chemical concentrations associated
the value obtained using the larger NOAA list. Total DDTs with adverse effects for each study were compiled and sorted
represented the sum of p,p0 -DDT, o,p0 -DDT, p,p0 -DDE, in ascending order, with the ERM representing the median
o,p0 -DDE, p,p0 -DDD, and o,p0 -DDD. Total chlordane was concentration of the data distribution. The index used to
the sum of a-chlordane (cis-chlordane), oxychlordane, trans- represent the ERM approach in the present study was the
chlordane, trans-nonachlor, and g-chlordane. mean ERM quotient (mERMQ) developed by Long et al.
Data were estimated for values reported as below reporting (2000), which was calculated by dividing each chemical
limits based on multiple regression imputation, taking concentration by its respective ERM and averaging the
advantage of covariation among the many chemical and individual quotients. A subset of 28 ERM values was used
sediment variables. Imputation produces lesser bias than to calculate the mERMQ (Table 1), which was the same as
conventional approaches for interpreting nondetect data, such that used in previous mERMQ performance studies (Long
as substituting zero or 50% of the reporting limit (Helsel et al. 2000).
2005). SAS PROC MI (SAS Institute, Cary, NC) was used to The SQGQ1 approach is a composite of chemical guide-
impute values in a sequential stepwise fashion by contaminant lines from other approaches that were selected to provide
type. Metal data were estimated first, followed in order by an improved ability predict toxicity to amphipods using
pesticides, PAHs, and PCBs. The stepwise manner in which California data (Fairey et al. 2001). These values are a
the groups of data variables were imputed was used because combination of consensus values for PAHs and PCBs (Swartz
SAS PROC MI could not compute all imputations in a 1999; MacDonald et al. 2000), ERMs, and PELs (probable
single step. The stepwise procedure also allowed for better effects level) (MacDonald et al. 1996). The index used to
control of the data variables used in the imputations for represent the SQGQ1 guidelines in the present study was the
each chemical group. Estimated values were constrained mean SQGQ1 quotient, which was calculated by dividing
to always be less than the study reporting limit. The impu- each chemical concentration by its respective SQG (Table 1)
tation method was also used to estimate total organic carbon and averaging the individual quotients.
600 Integr Environ Assess Manag 8, 2012—SM Bay et al.
Table 1. Chemical values for individual sediment quality guidelines used for data analyses
Chemical Units ERM CA ERM SoCA ERM NorCA ERM SQGQ1 Consensus
CA LRM ¼ California logistic regression model; SoCA LRM ¼ southern California LRM; NorCA LRM ¼ northern California LRM; p,p0 -DDE ¼ 1-chloro-4-[2,2-dichloro-
1-(4-chlorophenyl)ethenyl]benzene; DDTs ¼ sum of p,p0 DDT, o,p0 DDT, p,p0 DDE, o,p0 DDE, p,p0 DDD, o,p0 DDD; PAH ¼ polycyclic aromatic hydrocarbons;
PCB ¼ polychlorinated biphenyls.
Values for the effects range median (ERM) were taken from Long et al. (1995); CA ERM, SoCA ERM, and NorCA ERM indicate California-specific ERM values for the
entire state, southern California, and northern California, respectively. Mean sediment quality guideline quotient 1 (SQGQ1) values taken from Fairey et al. (2001).
Consensus midpoint effect concentration values taken from Swartz (1999), MacDonald et al. (2000), and Vidal and Bay (2005). Concentrations are on a dry
weight basis except where noted.
a
Organic carbon basis (mg/g).
Consensus SQGs are chemical values based on the mean of at least 3 different SQGs having a similar intended
integration of multiple SQG approaches in an effort to application (e.g., to predict probable biological effects). The
obtain guidelines with greater validity. The integration Consensus SQG values for PAHs and PCBs were midrange
method and types of SQGs used vary, but in general the effect concentrations obtained from Swartz (1999) and
consensus SQG represents either the arithmetic or geometric MacDonald et al. (2000), respectively. Consensus values for
National and Regional Sediment Quality Guidelines Comparison—Integr Environ Assess Manag 8, 2012 601
DDTs, dieldrin, As, Cd, Cr, Cu, Pb, Hg, Ni, Ag, and Zn were and calculating medians based on the distribution of all data,
obtained from Vidal and Bay (2005). The index used to rather than selected values from each study.
represent the Consensus SQGs in the present study was the California LRMs for individual chemicals were developed
mean Consensus quotient, which was calculated by dividing for the statewide and regional California data sets using the
each chemical concentration by its respective Consensus methods described in USEPA (2005b). These models
SQG (Table 1) and averaging the individual quotients. were applied to the California calibration data using
The LRM approach uses a suite of regression model to <80% control adjusted amphipod survival as the definition
relate chemical concentration to the probability of sediment of a toxic sample. The specific models included in the CA
toxicity. Chemical-specific models were developed using LRM, SoCA LRM, and NorCA LRM approaches were
logistic regression analysis of a large database of marine selected from a library of candidate models that included
amphipod survival data from field studies throughout North national models, as well as models derived using the
America (Field et al. 1999, 2002). The logistic regression California data sets. The selected models were chosen based
model is described by the following equation: on the suitability of fit with the observed probability of
toxicity (Table 2). Models with high false positive rates were
p ¼ exp½b0 þ b1 ðxÞ=ð1 þ exp½b0 þ b1 ðxÞ; not included.
Table 2. Logistic regression parameters for the regional and national models compared in this study
Cadmium mg/kg 0.3 2.5 1.4 0.3 3.2 0.8 0.3 3.2 0.8 1.5 3.4 0.4
Copper mg/kg 5.6 2.6 145.0 6.8 2.8 268.0 6.6 3.8 51.0
Lead mg/kg 5.5 2.8 94.0 4.7 2.8 46.0 8.6 4.8 62.0
Zinc mg/kg 8.0 3.3 245.0 5.1 2.4 132.0 10.0 4.2 234.0 13.8 6.9 100.0
Dieldrin mg/kg 1.2 2.6 2.9 1.8 2.6 5.1 1.2 4.3 2.0
HMW PAH mg/kg 8.2 2.0 12506.0 8.2 2.0 12506.0 4.3 1.5 785.2
LMW PAH mg/kg 6.8 1.9 4127.0 6.8 1.9 4127.0 3.4 1.5 185.2
p,p’-DDD mg/kg 1.9 1.5 19.0 1.8 2.0 7.6 0.8 2.5 2.0
p,p’-DDT mg/kg 3.6 3.3 12.0 1.5 1.6 8.1 0.6 3.3 1.5
PCB, total mg/kg 3.5 1.4 368.0 4.4 1.5 945.0 4.4 1.5 945.0 4.4 1.5 945.0
CA LRM ¼ California logistic regression model; SoCA LRM ¼ southern California LRM; NorCA LRM ¼ northern California LRM; HMW PAH ¼ high molecular weight
polycyclic aromatic hydrocarbons; LMW PAH ¼ low molecular weight polycyclic aromatic hydrocarbons; o,p0 -DDD ¼ 1-chloro-2-[2,2-dichloro-1-(4-chlorophe-
nyl)ethyl]benzene; p,p0 -DDD ¼ 1-chloro-4-[2-chloro-1-(4-chlorophenyl)ethenyl]benzene; p,p0 -DDT ¼ 1-chloro-4-[2,2,2-trichloro-1-(4-chlorophenyl)ethyl]ben-
zene; DDTs ¼ sum of p,p0 -DDT, o,p0 -DDT, p,p0 -DDE, o,p0 -DDE, p,p0 -DDD, and o,p0 -DDD; PCB ¼ polychlorinated biphenyl. Values for the national logistic
regression model (LRM) were taken from Field et al. (2002); CA LRM, SoCA LRM, and NorCA LRM indicate California-specific LRM values for the entire state,
southern California, and northern California, respectively. B0 ¼ intercept; B1 ¼ slope; T50 is the calculated concentration corresponding to a toxicity probability of
0.5. Concentrations are on a dry weight basis.
National and Regional Sediment Quality Guidelines Comparison—Integr Environ Assess Manag 8, 2012 603
plings of the calibration data. With each subsampling, 40 categories, respectively (see Supplemental Data). SAS PROC
representatives from each of the 4 toxicity categories were FREQ (SAS Institute) was used to calculate the weighted
selected randomly without replacement from the larger data kappa (Stokes et al. 2000) Weighted kappa values range
set. This step was necessary due to the greater prevalence of between 1 and 1, where 1 is 100% agreement. Weighted
nontoxic samples in the calibration data set. Using the full kappa values >0 indicate that the SQG is performing better
data set, potential exists for the lowest threshold to be set than is expected by chance alone; weighted kappa ¼ 0
artificially high simply to increase the number of samples implies no improvement over chance classification; weighted
classified into the lowest category, consistent with the kappa values <0 indicate a less than chance expectation of
majority. However, these thresholds would not be useful classification accuracy.
for detecting low to moderate levels of toxicity in those water A bootstrap resampling approach similar to that used for
bodies that did not exhibit the same preponderance of threshold development was also used in calculation of the
nontoxic samples as that found in our calibration data set. correlation, percent agreement, and weighted kappa values.
By selecting our subsamples uniformly across the 4 categories, The reported correlation and classification accuracy values are
we could select (and evaluate) thresholds relative to their the median of 50 resamples. The approach having the highest
ability to discriminate among the 4 toxicity categories equally, median values for both correlation and classification accuracy
without preference for 1 category over the other. In addition, was selected as the best performing SQG. Those medians that
subsampling 50 times, ensured that nearly every sample was fell below the 10th percentile of the distribution having the
included in the analysis at least once. highest median performance (i.e., correlation, percent agree-
A set of 3 optimal 3 thresholds was determined for each ment, and weighted kappa) were deemed statistically different.
bootstrap sample by comparing weighted agreement statistics Medians above the 10th percentile were characterized as
for a large set of possible candidate thresholds and then statistically similar. Correlation results were given greater
choosing the set of 3 thresholds that yielded the largest weight when the rankings were variable among the perform-
weighted agreement. Candidates consisted of all ordered ance measures to minimize the influence of threshold selection.
permutations of 3 threshold values, taken at 5% increments of Bootstrapping addressed 3 important issues. First, boot-
the SQG’s range. Weighted values for thresholds between the strapping was used to create data subsets with a uniform
5% increments were linearly interpolated. To ensure con- distribution of toxicity and thus eliminate prevalence bias due
vergence of optimization and so that optimal thresholds were to the relatively high proportion of nontoxic samples in the
not too close to one another, distances between individual validation data set. SQG accuracy then was assessed with
thresholds within each set were constrained to be no less than respect all 4 categories equally, without preference to a single
10% of the chemical range. Taking the median of each category. Without correction for prevalence, less sensitive
optimal threshold across all 50 subsamples gave the final set SQGs (those that tend to classify samples in the lowest
of SQG-specific threshold values for the Low (T1), Moderate toxicity category) or SQGs with a stronger correlation with
(T2), and High (T3) categories. nontoxic or low toxicity samples will tend to perform better
than other SQGs, simply because there are more nontoxic
samples to evaluate. In addition, the correction for chance in
Evaluation of SQG performance the weighted kappa statistic may impose an unfair penalty for
SQG performance was evaluated by quantifying the greater agreement in the lower categories due to the skewness
strength of association between sediment chemistry and of toxicity distribution. For a more thorough examination of
toxicity in terms of both correlation and categorical classi- the effect of prevalence on performance statistics, see Mouton
fication accuracy. Correlation was measured as the non- et al. (2010), Feinstein and Cicchetti (1990), and Lantz and
parametric Spearman’s correlation coefficient between the Nebenzahl (1996). Second, bootstrapping provided a more
SQG index value (i.e., mean quotient or Pmax) and percent robust performance evaluation because SQGs were evaluated
amphipod mortality (100-control adjusted survival). Analyses across multiple subsamplings, where the relative contribution
of categorical classification accuracy were based on the of contaminants varied within each of the toxicity categories.
frequency with which the SQG index category (determined Taking the median as a measure of performance, removed the
by applying the thresholds derived from the calibration data influence of spurious results or outliers. Finally, bootstrapping
set) correctly predicted the measured toxicity response allowed for statistical comparisons to be made among the
category. All analyses were conducted using an independent SQGs.
validation data set that was not used for threshold develop-
ment. Two measures of classification accuracy were calcu- RESULTS
lated: percent agreement and weighted kappa. Percent Different patterns of sediment contamination were appa-
agreement is the number of samples that are correctly rent between the northern and southern California data sets
classified, calculated as A ¼ (Nc/Nt) 100 where A ¼ percent (Table 3), reflecting different anthropogenic inputs and
percent agreement, Nc ¼ number of samples correctly clas- geochemistry. Median concentrations of most PAH com-
sified, and Nt ¼ total number of samples. pounds, Cr, and Ni were greater in the north, whereas the
The weighted kappa statistic (Cohen 1960, 1968) is also a south data set contained higher concentrations of chlordane,
measure of agreement between the SQG predictions and Cu, DDTs, PCBs, and Zn. The southern California data set
toxicity, but differs in that a correction for chance is applied usually contained the highest concentrations of each con-
and partial credit is given according to the magnitude of taminant, which may reflect the larger south data set. An
disagreement. Kappa weights were based on the linear exception was the presence of higher Cr and Ni concen-
weighting scheme of Cicchetti and Allison (1971); a weight trations in the north data set, which was likely due to higher
of 1 was assigned to cases of perfect agreement and weights of naturally occurring concentrations of these elements in
1/3, 1/6, and 0 assigned to disagreements of 1, 2, or 3 toxicity northern California soils.
604 Integr Environ Assess Manag 8, 2012—SM Bay et al.
Table 3. Cumulative distribution of sediment chemistry data for the California samples used in the analyses
DDTs ¼ sum of p,p0 -DDT, o,p0 -DDT, p,p0 -DDE, o,p0 -DDE, p,p0 -DDD, and o,p0 -DDD; PAH ¼ polycyclic aromatic hydrocarbons; PCB ¼ polychlorinated biphenyl.
There was a similar range and distribution of sediment There were large differences in the number of chemicals
toxicity in the northern and southern California data sets and their threshold concentrations included in the different
(Figure 2). The distribution of the data was skewed toward SQG indices (Tables 1 and 2). The number of chemicals
low toxicity; approximately 60% of the samples in each varied from 9 for the SQGQ1 to 28 for the CA ERM.
region had >80% survival and <10% had <40% survival. The Individual chemical concentrations for the ERM, SQGQ1,
similarity in distribution between regions despite differences and Consensus SQGs were similar because these values were
in the relative proportion of data from the 2 amphipod often derived from similar sources. There were often large
species suggests that both species were responding similarly to differences in individual chemical concentrations between the
the sediment characteristics. national and region-specific versions of the ERM. This was
National and Regional Sediment Quality Guidelines Comparison—Integr Environ Assess Manag 8, 2012 605
SQG Approach Index North South State North South State North South State
National ERM Mean Quotient 0.08 0.06 0.07 0.15 0.12 0.13 0.29 0.38 0.33
National LRM Maximum Probability 0.17 0.23 0.230 0.26 0.44 0.35 0.50 0.61 0.55
Consensus Mean Quotient 0.15 0.14 0.14 0.23 0.26 0.25 0.51 0.60 0.55
SQGQ1 Mean Quotient 0.06 0.16 0.160 0.11 0.34 0.19 0.33 0.80 0.52
CA LRM Maximum Probability 0.25 0.42 0.34 0.42 0.58 0.250 0.62 0.72 0.67
CA ERM Mean Quotient 0.15 0.14 0.15 0.23 0.25 0.24 0.68 1.28 0.93
ERM ¼ effects range median; LRM ¼ logistic regression model; SQGQ1 ¼ sediment quality guideline quotient 1; CA LRM ¼ California LRM; CA ERM ¼ California
ERM.
606 Integr Environ Assess Manag 8, 2012—SM Bay et al.
Table 5. Nonparametric Spearman correlation (r) and classification consistent methodology and calibration data set. The stand-
accuracy (weighted kappa) of statewide SQG approaches with ardized thresholds allowed each SQG approach to be
amphipod mortality evaluated on a similar basis, so that differences in perform-
ance could be compared without the confounding effect of
Weighted %
differences in threshold selection.
Region Approach Kappa Agreement r
Two of the SQG approaches were recalibrated using
State CA LRM 0.23 37 0.35 California data, which had mixed effects. For the CA LRM,
there was a substantive improvement in performance, but
State National ERM 0.17 32 0.25 performance of the mean quotients based on the CA ERM,
State Consensus 0.17 31 0.25 was comparable to that of the national mERMQ. This may
have resulted from differences in the SQG calibration
State National LRM 0.15 35 0.22 process. The CA ERMs consisted entirely of new values that
were derived from the California data set. All available CA
State CA ERM 0.17 33 0.20
ERMs were used in the quotient calculations, regardless of
State SQGQ1 0.12 32 0.16 their reliability for predicting toxicity. In contrast, predictive
ability (relative to the national LRM models) was taken into
ERM ¼ effects range median; LRM ¼ logistic regression model; SQGQ1 ¼ sedi- account when selecting the set of models used for the CA
sediment quality guideline quotient 1; CA LRM ¼ California LRM; CA
LRM. A similar selection process was not used for the CA
ERM ¼ California ERM. Values are the median of the bootstrapped analyses.
Shaded cells indicate values that are statistically similar (within the 90th ERM because of differences in derivation methodology
percentile) to the highest value. Analyses were conducted on the combined compared to the national ERMs, which were based on
data for the north and south validation data sets and used thresholds multiple types of toxicity tests and other biological response
developed using the statewide data set. values (Long et al. 1995).
The improved performance of the CA LRM may also have
ence in performance among many of the indices. This differs been due to differences in the composition, magnitude, and
from the findings of Vidal and Bay (2005) and probably bioavailability of sediment contamination in the California
results from using thresholds that were selected using a data, relative to the data used for national LRM development.
Table 6. Classification accuracy (weighted kappa) and Spearman correlation (r) of SQG approaches applied to data from each region
separately
Statewide thresholds
Region-specific thresholds
ERM ¼ effects range median; LRM ¼ logistic regression model; SQGQ1 ¼ sediment quality guideline quotient 1; CA LRM ¼ California LRM; CA ERM ¼ California
ERM; Nor/SoCA LRM ¼ northern or southern California LRM; Nor/SoCA ERM ¼ northern or southern California ERM. Values are the median of the bootstrapped
analyses. Shaded cells indicate values that are statistically similar (within the 90th percentile) to the highest value. Analyses were conducted separately using
thresholds developed with statewide and region-specific data sets.
National and Regional Sediment Quality Guidelines Comparison—Integr Environ Assess Manag 8, 2012 607
Regional differences in contamination and geochemistry have SQGs are dependent on the availability of values from other
been identified as important factors affecting the predictive sources. Local calibration is also not feasible for these
accuracy of SQGs (Long et al. 2000; Wenning et al. 2005). approaches for the same reason.
Because the values used in empirical SQG approaches are The best performing index, CA LRM, is highly amenable
derived from chemistry-toxicity relationships in the calibra- to revision as demonstrated by this study. However, LRM
tion data set, regionally calibrated approaches would be approaches are also the most difficult to apply and interpret
expected to have greater predictive accuracy. because a complex set of regressions must be used to
The regional SQG results suggest that further improve- determine probabilities of toxicity, rather than comparing
ment in SQG performance could be obtained through further chemistry data to a simple table of SQG values. These
site-specific normalization or the use of mechanistic SQGs. difficulties can be overcome by incorporating the regression
However, normalization of the organics data to TOC and calculations into spreadsheets or other data analysis tools and
metals data to a reference element (Fe) and use of US establishing thresholds for interpreting the Pmax values.
Environmental Protection Agency (USEPA) equilibrium The low levels of correlation and agreement attained in this
partitioning sediment benchmarks were evaluated in prelimi- study represent the maximum likely to be attained when
nary phases of this study and did not result in any improve- empirical SQG approaches are applied to sediments with the
ment in correlation or classification accuracy. low to moderate levels of contamination characteristic of
Use of thresholds calibrated to the north and south California bays and estuaries. A higher level of performance
subregions produced only small increases in performance might be obtained in regions having higher sediment
relative to the statewide thresholds. The relatively small contamination levels (Long et al. 2006). The SQG values
differences in regional performance are probably related to and thresholds developed in this study should not be applied
the heterogeneous nature of sediment contamination. to other regions without validation, as they have been
Although there are differences in overall pattern and optimized to match California contamination patterns. It is
magnitude of contamination in the northern and southern recommended that similar calibration efforts, especially
California data sets, contamination patterns within each threshold optimization, be conducted before applying SQGs
region are highly diverse due to the presence of multiple in other regions to maximize SQG index performance.
water bodies and diverse contaminant sources. The high uncertainty associated with the indices under-
Regional thresholds for the SQGQ1 differed more than for scores their limited usefulness to represent sediment quality
the other SQG approaches, with higher thresholds for the when used without supporting lines of evidence (e.g., toxicity
south data set. The cause for this difference was not and biological assessment). These indices are also ineffective
determined, but it may have been related to the relatively for identifying the cause of sediment toxicity, as they are not
small set of contaminants used in calculating the SQGQ1 based on chemical-specific concentration–response relation-
index. Data for only 9 contaminants were used in the ships. These limitations of SQGs are well known and are
SQGQ1, whereas the other approaches used 13 to 28 addressed in most sediment quality assessment frameworks by
contaminants. Use of fewer components in the SQGQ1 using these approaches in combination with biological effects
may have made this index more sensitive to variations in measures in a multiple lines of evidence approach (Wenning
regional contamination patterns. However, the apparent et al. 2005), such as that recently adopted by the state of
greater sensitivity of the SQGQ1 did not result in a higher California (SWRCB 2008).
level of performance relative to the other SQGs. Substantial improvement in performance beyond that
The limited improvement in classification accuracy described here will require fundamental changes to both
obtained with regional calibration suggests that further SQG components and conceptual approach. For example, all
calibration and normalization efforts are likely to have limited of the SQG approaches in common use are based on an
success in improving the association of empirical SQG indices outdated list of priority and legacy pollutants (e.g., PCBs,
with biological effects. Interlaboratory variation in the trace metals, PAHs, chlorinated pesticides) that does not
chemistry or toxicity analyses may have reduced the SQG include current use pesticides. These pesticides, including
performance values as data were compiled from multiple pyrethroids (e.g., bifenthrin) and organophosphates (e.g.,
laboratories and over many years. A formal intercalibration chlorpyrifos), have widespread occurrence in coastal water-
was not possible for this study, but the effects of interlabor- sheds, bays, and estuaries (Delgado-Moreno et al. 2011; Lao
atory variation are expected to relatively small as most of the et al. 2012). Pyrethroids in particular have been identified as
data were compiled from regional monitoring programs that a dominant cause of sediment toxicity in California
employ robust QA/QC procedures for both chemistry and streams and estuaries (Bay et al. 2011; Holmes et al. 2008).
toxicity data. SQG index performance values were also likely Although the current list of SQG contaminants is effective
reduced by the analysis of bootstrapped data subsets having a for characterizing potential exposure to unmeasured chem-
uniform range of toxicity, which reduced the proportion of icals having similar sources, this assumption may not be valid
nontoxic samples that most indices have greater success in for current use pesticides and other compounds having
predicting. different sources and input history. Similarly, recent research
Because the performance difference among SQG indices indicates that measurement of PAHs may not adequately
was small, characteristics such as history of use, ease of represent the potential for toxicity from oils (Mount et al.
application, types of chemicals included in the constituent 2009).
array, and feasibility for revision should be considered when Development of SQG approaches that account for changes
selecting the SQG approach to be used. For instance, the in contaminant bioavailability are also needed to improve the
Consensus and SQGQ1 approaches incorporate a lesser interpretation of sediment contamination. The bulk chem-
number of chemicals than the other approaches and it is istry measurements used in current empirical approaches do
difficult to add new contaminants of concern because these not address bioavailability and thus are unable to accurately
608 Integr Environ Assess Manag 8, 2012—SM Bay et al.
depict changes in organism exposure resulting from geo- Field LJ, MacDonald D, Norton SB, Severn CG, Ingersoll CG. 1999. Evaluating
chemical factors. Progress has been made in developing sediment chemistry and toxicity data using logistic regression modeling.
mechanistic SQG approaches based on equilibrium partition- Environ Toxicol Chem 18:1311–1322.
ing theory (USEPA 2005a, 2008), but additional research is Greenstein DJ, Bay SM. 2012. Selection of methods for assessing sediment
needed to develop approaches that perform well for sediment toxicity in California bays and estuaries. Integr Environ Assess Manag 8:
quality assessment under a wide range of conditions. In recent 625–637.
years, new methods for evaluating the bioavailability of Helsel D. 2005. More than obvious: Better methods for interpreting nondetect
sediment contaminants based on passive sampling devices or data. Environ Sci Technol 39:419A–423A.
measures of rapidly desorbing contaminant pools have been Holmes RW, Anderson BS, Phillips BM, Hunt JW, Crane DB, Mekebri A, Connor
V. 2008. Statewide investigation of the role of pyrethroid pesticides in
developed that show promise for application in sediment
sediment toxicity in California’s urban waterways. Environ Sci Technol 42:
quality assessment (Maruya et al. this issue). Incorporation of
7003–7009.
this technology into sediment quality assessment frameworks,
Lantz CA, Nebenzahl E. 1996. Behavior and interpretation of the kappa statistic:
either as a replacement for existing SQG approaches or as an
Resolution of the two paradoxes. J Clin Epidemiol 49:431–434.
additional line of evidence, holds promise for strengthening
Lao W, Tiefenthaler L, Greenstein DJ, Maruya KA, Bay SM, Ritter K, Schiff K. 2012.
the available tools for interpreting the significance of sediment Pyrethroids in Southern California coastal sediments. Environ Toxicol Chem
contamination. 31:1649–1656.
Long ER, Field JE, MacDonald DD. 1998. Predicting toxicity in marine sediments
SUPPLEMENTAL DATA with numerical sediment quality guidelines. Environ Toxicol Chem 17:714–
Calculation of weighted agreement and weighted kappa 727.
statistic. Long ER, Ingersoll CG, MacDonald DD. 2006. Calculation and uses of mean
sediment quality guideline quotients: A critical review. Environ Sci Technol
Acknowledgment—The authors thank Chris Beegan from 40:1726–1736.
the California Water Resources Control Board, and Mike Long ER, MacDonald DD, Severn CG, Hong CB. 2000. Classifying the probabilities of
Connor and Bruce Thompson of the San Francisco Estuary acute toxicity in marine sediments with empirically derived sediment quality
Institute for their suggestions on the design of this study. Peggy guidelines. Environ Toxicol Chem 19:2598–2601.
Myre of Exa Data and Mapping compiled and standardized the Long ER, MacDonald DD, Smith SL, Calder FD. 1995. Incidence of adverse biological
effects within ranges of chemical concentrations in marine and estuarine
data sets. Jeff Brown, Diana Young, and Darrin Greenstein
sediments. Environ Manage 19:81–97.
assisted with data compilation and statistical analysis. The
MacDonald DD, Carr RS, Calder FD, Long ER, Ingersoll CG. 1996. Development and
authors also thank Peter Landrum, Ed Long, Todd Bridges,
evaluation of sediment quality guidelines for Florida coastal waters.
Tom Gries, Rob Burgess and Bob Van Dolah for their thought-
Ecotoxicology 5:253–278.
ful review of the ideas contained within the document.
MacDonald DD, Di Pinto LM, Field LJ, Ingersoll CG, Long ER, Swartz RC. 2000.
Work on this project was funded by the California State Development and evaluation of consensus-based sediment effect
Water Resources Control Board under agreement 01-274- concentrations for polychlorinated biphenyls (PCB). Environ Toxicol Chem
250-0. 19:1403–1413.
Maruya KA, Landrum PF, Burgess RM, Shine JP. 2012. Incorporating contaminant
REFERENCES bioavailability into sediment quality assessment frameworks. Integr Environ
Barrick R, Becker S, Brown L, Beller H, Pastorok R. 1988. Sediment Quality Values Assess Manag 8:659–673.
refinement: 1988 update and evaluation of Puget Sound AET, Volume 1. Mount DR, Heinis LJ, Highland TL, Hockett JR, Hoff DJ, Jenson CT, Norberg-King TJ.
Bellevue, WA: PTI Environmental Services. 177 p. 2009. Are PAHs the right metric for assessing toxicity related to oils, tars,
Bay SM, Greenstein DJ, Maruya KA, Lao W. 2011. Toxicity identification evaluation creosote, and similar contaminants in sediments? In 5th International
of sediment (sediment TIE) in Ballona Creek Estuary. Technical Report 634. Costa Conference on Remediation of Contaminated Sediments, February 2–5,
Mesa, CA: Southern California Coastal Water Research Project. 2009, Jacksonville, FL.
Bay SM, Weisberg SB. A framework for interpreting sediment quality triad data. Mouton AM, DeBaets B, Goethals PLM. 2010. Ecological relevance of
Integr Environ Assess Manag 8:589–596. performance criteria for species distribution models. Ecol Modell 221:1995–
Cicchetti DV, Allison T. 1971. A new procedure for assessing reliability of scoring 2002.
EEG sleep recordings. Am J EEG Technol 11:101–109. O’Connor TP, Daskalakis KD, Hyl JL, Paul JF, Summers JK. 1998. Comparisons of
Cohen J. 1960. A coefficient of agreement for nominal scales. Educ Psychol Meas sediment toxicity with predictions based on chemical guidelines. Environ
20:37–46. Toxicol Chem 17:468–471.
Cohen J. 1968. Weighted Kappa nominal scale agreement with provision for scale [SWRCB] State Water Resources Control Board. 2008. Water quality control plan for
disagreement or partial credit. Psychol Bull 70:213–220. enclosed bays and estuaries. Part I: Sediment quality. Sacramento, CA: State
Delgado-Moreno L, Lin K, Velga-Nascimento R, Gan J. 2011. Occurrence and Water Resources Control Board.
toxicity of three classes of insecticides in water and sediment in two southern Stokes ME, Davis CS, Koch GC. 2000. Categorical data analysis using the SAS
California coastal watersheds. J Agric Food Chem 59:9448–9456. system. 2nd ed. Cary (NC): SAS Institute.
Fairey R, Long ER, Roberts CA, Anderson BS, Phillips BM, Hunt JW, Puckett Swartz RC. 1999. Consensus sediment quality guidelines for PAH mixtures. Environ
HR, Wilson CJ. 2001. An evaluation of methods for calculating mean Toxicol Chem 18:780–787.
sediment quality guideline quotients as indicators of contamination and [USEPA] United States Environmental Protection Agency. 1994. Methods
acute toxicity to amphipods by chemical mixtures. Environ Toxicol Chem for assessing the toxicity of sediment-associated contaminants with
20:2276–2286. estuarine and marine amphipods. Washington DC: USEPA. EPA 600-R94-
Feinstein AR, Cicchetti DV. 1990. High agreement but low kappa: I. The problems of 025.
two paradoxes. J Clin Epidemiol 43:6:543–549. [USEPA] United States Environmental Protection Agency. 2003. Procedures for the
Field LJ, MacDonald DD, Norton SB, Ingersoll CG, Severn CG, Smorong D, Lindskoog derivation of equilibrium partitioning sediment benchmarks (ESBs) for the
R. 2002. Predicting amphipod toxicity from sediments using Logistic Regression protection of benthic organisms: PAH mixtures. Washington DC: USEPA.
Models. Environ Toxicol Chem 9:1993–2005. EPA-600-R-02-013.
National and Regional Sediment Quality Guidelines Comparison—Integr Environ Assess Manag 8, 2012 609
[USEPA] United States Environmental Protection Agency. 2005a. Procedures for the protection of benthic organisms: Compendium of Tier 2 values for nonionic
derivation of equilibrium partitioning sediment benchmarks (ESBs) for the organics. Washington DC: USEPA. EPA-600-R-02-016.
protection of benthic organisms: Metal mixtures (cadmium, Cu, Pb, Ni, Ag, Vidal DE, Bay SM. 2005. Comparative sediment guideline performance for
and Zn). Washington DC: USEPA. EPA-600-R-02-011. predicting sediment toxicity in southern California, USA. Environ Toxicol
[USEPA] United States Environmental Protection Agency. 2005b. Predicting toxicity Chem 24:3173–3182.
to amphipods from sediment chemistry (Final Report). Washington DC: USEPA. Wenning RJ, Batley GE, Ingersoll CG, Moore DW. (editors). 2005. Use of
EPA/600/R-04/030. sediment quality guidelines (SQGs) and related tools for the assessment of
[USEPA] United States Environmental Protection Agency. 2008. Procedures for the contaminated sediments. Pensacola (FL): Society of Environmental Toxicology
derivation of equilibrium partitioning sediment benchmarks (ESBs) for the and Chemistry.