PDFbigbook SAS

Sampling, Regression, Experimental Design and Analysis for
Environmental Scientists, Biologists, and Resource Managers

C. J. Schwarz
Department of Statistics and Actuarial Science, Simon Fraser University
cschwarz@stat.sfu.ca
December 21, 2012
Contents
1 In the beginning... 15
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Effective note taking strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Its all to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Which computer package? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 FAQ - Frequently Asked Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.1 Accessing journal articles from home . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.2 Downloading from the web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.3 Printing 2 pages per physical page and on both sides of the paper . . . . . . . . . . . 28
1.5.4 Is there an on-line textbook? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Introduction to Statistics 30
2.1 TRRGET - An overview of statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Parameters, Statistics, Standard Deviations, and Standard Errors . . . . . . . . . . . . . . . 34
2.2.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.2 Theoretical example of a sampling distribution . . . . . . . . . . . . . . . . . . . . 39
2.3 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.2 Some practical advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.3 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.2 Comparing the population parameter against a known standard . . . . . . . . . . . . 51
2.4.3 Comparing the population parameter between two groups . . . . . . . . . . . . . . 58
2.4.4 Type I, Type II and Type III errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.5 Some practical advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4.6 The case against hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.7 Problems with p-values - what does the literature say? . . . . . . . . . . . . . . . . 69
Statistical tests in publications of the Wildlife Society . . . . . . . . . . . . . . . . . 69
The Insignicance of Statistical Signicance Testing . . . . . . . . . . . . . . . . . 69
Followups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5 Meta-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5.1 Scales of measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5.2 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1
CONTENTS
2.5.3 Roles of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.6 Bias, Precision, Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.7 Types of missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.8 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.8.2 Conditions under which a log-normal distribution appears . . . . . . . . . . . . . . 80
2.8.3 ln() vs. log() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.8.4 Mean vs. Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.8.5 Back-transforming estimates, standard errors, and ci . . . . . . . . . . . . . . . . . 82
Mean on log-scale back to MEDIAN on anti-log scale . . . . . . . . . . . . . . . . 82
2.8.6 Back-transforms of differences on the log-scale . . . . . . . . . . . . . . . . . . . . 83
2.8.7 Some additional readings on the log-transform . . . . . . . . . . . . . . . . . . . . 84
2.9 Standard deviations and standard errors revisited . . . . . . . . . . . . . . . . . . . . . . . 95
2.10 Other tidbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.10.1 Interpreting p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.10.2 False positives vs. false negatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.10.3 Specicity/sensitivity/power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3 Sampling 106
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.1.1 Difference between sampling and experimental design . . . . . . . . . . . . . . . . 108
3.1.2 Why sample rather than census? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.4 Probability sampling vs. non-probability sampling . . . . . . . . . . . . . . . . . . 110
3.1.5 The importance of randomization in survey design . . . . . . . . . . . . . . . . . . 111
3.1.6 Model vs. Design based sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2 Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.2.6 Panel design - suitable for long-term monitoring . . . . . . . . . . . . . . . . . . . 128
3.2.7 Sampling non-discrete objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.2.8 Key considerations when designing or analyzing a survey . . . . . . . . . . . . . . 129
3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.4 Simple Random Sampling Without Replacement (SRSWOR) . . . . . . . . . . . . . . . . . 131
3.4.1 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.4.2 Estimating the Population Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.3 Estimating the Population Total . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.4 Estimating Population Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.4.5 Example - estimating total catch of sh in a recreational shery . . . . . . . . . . . 134
What is the population of interest? . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
What is the frame? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
What is the sampling design and sampling unit? . . . . . . . . . . . . . . . . . . . . 137
c 2012 Carl James Schwarz 2 December 21, 2012
CONTENTS
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5 Sample size determination for a simple random sample . . . . . . . . . . . . . . . . . . . . 141
3.5.1 Example - How many angling-parties to survey . . . . . . . . . . . . . . . . . . . . 144
3.6 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.6.1 Advantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.6.2 Disadvantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.6.3 How to select a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.6.4 Analyzing a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.6.5 Technical notes - Repeated systematic sampling . . . . . . . . . . . . . . . . . . . . 149
Example of replicated subsampling within a systematic sample . . . . . . . . . . . . 149
3.7 Stratied simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.7.1 A visual comparison of a simple random sample vs. a stratied simple random sample154
3.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.7.4 Example - sampling organic matter from a lake . . . . . . . . . . . . . . . . . . . . 164
3.7.5 Example - estimating the total catch of salmon . . . . . . . . . . . . . . . . . . . . 168
What is the population of interest? . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
What is the sampling frame? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
What is the sampling design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
When should the various estimates be used? . . . . . . . . . . . . . . . . . . . . . . 175
3.7.6 Sample Size for Stratied Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
3.7.7 Allocating samples among strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.7.8 Example: Estimating the number of tundra swans. . . . . . . . . . . . . . . . . . . 183
3.7.9 Post-stratication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.7.10 Allocation and precision - revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 189
3.8 Ratio estimation in SRS - improving precision with auxiliary information . . . . . . . . . . 190
3.8.1 Summary of Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.8.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Post mortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.8.3 Example - Grouse numbers - using a ratio estimator to estimate a population total . . 201
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Post mortem - a question to ponder . . . . . . . . . . . . . . . . . . . . . . . . . . 209
3.9 Additional ways to improve precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
3.9.1 Using both stratication and auxiliary variables . . . . . . . . . . . . . . . . . . . . 210
3.9.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
3.9.3 Sampling with unequal probability - pps sampling . . . . . . . . . . . . . . . . . . 211
3.10 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
3.10.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3.10.2 Advantages and disadvantages of cluster sampling compared to SRS . . . . . . . . . 219
3.10.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
CONTENTS
3.10.5 Example - estimating the density of urchins . . . . . . . . . . . . . . . . . . . . . . 221
Excel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Planning for future experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.10.6 Example - estimating the total number of sea cucumbers . . . . . . . . . . . . . . . 227
SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
3.11 Multi-stage sampling - a generalization of cluster sampling . . . . . . . . . . . . . . . . . . 235
3.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
3.11.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
3.11.4 Example - estimating number of clams . . . . . . . . . . . . . . . . . . . . . . . . 238
Excel Spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
SAS Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
3.11.5 Some closing comments on multi-stage designs . . . . . . . . . . . . . . . . . . . . 242
3.12 Analytical surveys - almost experimental design . . . . . . . . . . . . . . . . . . . . . . . . 242
3.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
3.14 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
3.14.1 Confusion about the denition of a population . . . . . . . . . . . . . . . . . . . . 247
3.14.2 How is N dened . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
3.14.3 Multi-stage vs. Multi-phase sampling . . . . . . . . . . . . . . . . . . . . . . . . . 248
3.14.4 What is the difference between a Population and a frame? . . . . . . . . . . . . . . 249
3.14.5 How to account for missing transects. . . . . . . . . . . . . . . . . . . . . . . . . . 249
4 Designed Experiments - Terminology and Introduction 250
4.1 Terminology and Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.1.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.1.2 Treatment, Experimental Unit, and Randomization Structure . . . . . . . . . . . . . 252
4.1.3 The Three Rs of Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 255
4.1.4 Placebo Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4.1.5 Single and bouble blinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4.1.6 Hawthorne Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
4.2 Applying some General Principles of Experimental Design . . . . . . . . . . . . . . . . . . 258
4.2.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.2.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.2.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.2.4 Experiment 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4.2.5 Experiment 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4.3 Some Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.3.1 The Salk Vaccine Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.3.2 Testing Vitamin C - Mistakes do happen . . . . . . . . . . . . . . . . . . . . . . . . 262
4.4 Key Points in Design of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
4.4.1 Designing an Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.4.2 Analyzing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.4.3 Writing the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.5 A Road Map to What is Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
CONTENTS
4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.5.2 Experimental Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.5.3 Some Common Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
5 Single Factor - Completely Randomized Designs (a.k.a. One-way design) 272
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.2 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
5.2.1 Using a random number table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Assigning treatments to experimental units . . . . . . . . . . . . . . . . . . . . . . 275
Selecting from the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
5.2.2 Using a computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Randomly assign treatments to experimental units . . . . . . . . . . . . . . . . . . 277
Randomly selecting from populations . . . . . . . . . . . . . . . . . . . . . . . . . 281
5.3 Assumptions - the overlooked aspect of experimental design . . . . . . . . . . . . . . . . . 285
5.3.1 Does the analysis match the design? . . . . . . . . . . . . . . . . . . . . . . . . . . 286
5.3.2 No outliers should be present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
5.3.3 Equal treatment group population standard deviations? . . . . . . . . . . . . . . . . 287
5.3.4 Are the errors normally distributed? . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.3.5 Are the errors are independent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.4 Two-sample t-test- Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.5 Example - comparing mean heights of children - two-sample t-test . . . . . . . . . . . . . . 290
5.6 Example - Fat content and mean tumor weights - two-sample t-test . . . . . . . . . . . . . . 297
5.7 Example - Growth hormone and mean nal weight of cattle - two-sample t-test . . . . . . . 303
5.8 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
5.8.1 Basic ideas of power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
5.8.2 Prospective Sample Size determination . . . . . . . . . . . . . . . . . . . . . . . . 312
5.8.3 Example of power analysis/sample size determination . . . . . . . . . . . . . . . . 313
Using tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Using a package to determine power . . . . . . . . . . . . . . . . . . . . . . . . . . 314
5.8.4 Further Readings on Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 319
5.8.5 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
5.8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
5.9 ANOVA approach - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
5.9.1 An intuitive explanation for the ANOVA method . . . . . . . . . . . . . . . . . . . 323
5.9.2 A modeling approach to ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.10 Example - Comparing phosphorus content - single-factor CRD ANOVA . . . . . . . . . . . 331
5.11 Example - Comparing battery lifetimes - single-factor CRD ANOVA . . . . . . . . . . . . . 343
5.12 Example - Cuckoo eggs - single-factor CRD ANOVA . . . . . . . . . . . . . . . . . . . . . 353
5.13 Multiple comparisons following ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
5.13.1 Why is there a problem? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
5.13.2 A simulation with no adjustment for multiple comparisons . . . . . . . . . . . . . . 367
5.13.3 Comparisonwise- and Experimentwise Errors . . . . . . . . . . . . . . . . . . . . . 369
5.13.4 The Tukey-Adjusted t-Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.13.5 Recommendations for Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . 372
5.13.6 Displaying the results of multiple comparisons . . . . . . . . . . . . . . . . . . . . 373
5.14 Prospective Power and sample sizen - single-factor CRD ANOVA . . . . . . . . . . . . . . 375
CONTENTS
5.14.1 Using Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
5.14.2 Using SAS to determine power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
5.14.3 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
5.15 Pseudo-replication and sub-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
5.16.1 What does the F-statistic mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
5.16.2 What is a test statistic - how is it used? . . . . . . . . . . . . . . . . . . . . . . . . 381
5.16.3 What is MSE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
5.16.4 Power - various questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
What is meant by detecting half the difference? . . . . . . . . . . . . . . . . . . . . 382
Do we use the std dev, the std error, or root MSE in the power computations? . . . . 382
Retrospective power analysis; how is this different from regular (i.e., prospective)
power analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
What does power tell us? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
When to use retrospective and prospective power? . . . . . . . . . . . . . . . . . . 383
When should power be reported . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
What is done with the total sample size reported by JMP? . . . . . . . . . . . . . 384
5.16.5 How to compare treatments to a single control? . . . . . . . . . . . . . . . . . . . . 384
5.16.6 Experimental unit vs. observational unit . . . . . . . . . . . . . . . . . . . . . . . . 384
5.16.7 Effects of analysis not matching design . . . . . . . . . . . . . . . . . . . . . . . . 385
5.17 Table: Sample size determination for a two sample t-test . . . . . . . . . . . . . . . . . . . 388
5.18 Table: Sample size determination for a single factor, xed effects, CRD . . . . . . . . . . . 390
5.19 Scientic papers illustrating the methods of this chapter . . . . . . . . . . . . . . . . . . . . 393
5.19.1 Injury scores when trapping coyote with different trap designs . . . . . . . . . . . . 393
6 Single factor - pairing and blocking 395
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
6.2 Randomization protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
6.2.1 Some examples of several types of block designs . . . . . . . . . . . . . . . . . . . 399
Completely randomized design - no blocking . . . . . . . . . . . . . . . . . . . . . 400
Randomized complete block design - RCB design . . . . . . . . . . . . . . . . . . . 400
Randomized complete block design - RCB design - missing values . . . . . . . . . . 401
Incomplete block design - not an RCB . . . . . . . . . . . . . . . . . . . . . . . . . 401
Generalized randomized complete block design . . . . . . . . . . . . . . . . . . . . 402
6.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
6.3.1 Does the analysis match the design? . . . . . . . . . . . . . . . . . . . . . . . . . . 403
6.3.2 Additivity between blocks and treatments . . . . . . . . . . . . . . . . . . . . . . . 404
6.3.3 No outliers should be present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
6.3.4 Equal treatment group standard deviations? . . . . . . . . . . . . . . . . . . . . . . 406
6.3.5 Are the errors normally distributed? . . . . . . . . . . . . . . . . . . . . . . . . . . 407
6.3.6 Are the errors independent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
6.4 Comparing two means in a paired design - the Paired t-test . . . . . . . . . . . . . . . . . . 408
6.5 Example - effect of stream slope upon sh abundance . . . . . . . . . . . . . . . . . . . . . 409
6.5.1 Introduction and survey protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
6.5.2 Using a Differences analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
6.5.3 Using a Matched paired analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
CONTENTS
6.5.4 Using a General Modeling analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.5.5 Which analysis to choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
6.5.6 Comments about the original paper . . . . . . . . . . . . . . . . . . . . . . . . . . 420
6.6 Example - Quality check on two laboratories . . . . . . . . . . . . . . . . . . . . . . . . . . 421
6.7 Example - Comparing two varieties of barley . . . . . . . . . . . . . . . . . . . . . . . . . 427
6.8 Example - Comparing prep of mosaic virus . . . . . . . . . . . . . . . . . . . . . . . . . . 432
6.9 Example - Comparing turbidity at two sites . . . . . . . . . . . . . . . . . . . . . . . . . . 437
6.9.1 Introduction and survey protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
6.9.2 Using a Differences analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
6.9.3 Using a Matched paired analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
6.9.4 Using a General Modeling analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 443
6.9.5 Which analysis to choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
6.10 Power and sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
6.11 Single Factor - Randomized Complete Block (RCB) Design . . . . . . . . . . . . . . . . . 449
6.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
6.11.2 The potato-peeling experiment - revisited . . . . . . . . . . . . . . . . . . . . . . . 449
6.11.3 An agricultural example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
6.11.4 Basic idea of the analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
6.12 Example - Comparing effects of salinity in soil . . . . . . . . . . . . . . . . . . . . . . . . 453
6.12.1 Model building - tting a linear model . . . . . . . . . . . . . . . . . . . . . . . . . 455
6.13 Example - Comparing different herbicides . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
6.14 Example - Comparing turbidity at several sites . . . . . . . . . . . . . . . . . . . . . . . . . 468
6.15 Power and Sample Size in RCBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
6.16 Example - BPK: Blood pressure at presyncope . . . . . . . . . . . . . . . . . . . . . . . . . 476
6.16.1 Experimental protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
6.16.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
6.16.3 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
6.17 Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
6.18.1 Difference between pairing and confounding . . . . . . . . . . . . . . . . . . . . . 488
6.18.2 What is the difference between a paired design and an RCB design? . . . . . . . . . 489
6.18.3 What is the difference between a paired t-test and a two-sample t-test? . . . . . . . 489
6.18.4 Power in RCB/matched pair design - what is root MSE? . . . . . . . . . . . . . . . 490
6.18.5 Testing for block effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
6.18.6 Presenting results for blocked experiment . . . . . . . . . . . . . . . . . . . . . . . 491
6.18.7 What is a marginal mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
6.18.8 Multiple experimental units within a block? . . . . . . . . . . . . . . . . . . . . . . 492
6.18.9 How does a block differ from a cluster? . . . . . . . . . . . . . . . . . . . . . . . . 492
7 Incomplete block designs 493
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
7.2 Example: Investigate differences in water quality . . . . . . . . . . . . . . . . . . . . . . . 494
8 Estimating an over all mean with subsampling 501
8.1 Average agellum length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
8.1.1 Average-of-averages approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
CONTENTS
8.1.2 Using the raw measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
8.1.3 Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
9 Single Factor - Sub-sampling and pseudo-replication 513
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
9.2 Example - Fat levels in sh - balanced data in a CRD . . . . . . . . . . . . . . . . . . . . . 514
9.2.1 Analysis based on sample means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
9.2.2 Analysis using individual values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
9.3 Example - fat levels in sh - unbalanced data in a CRD . . . . . . . . . . . . . . . . . . . . 524
9.4 Example - Effect of UV radiation - balanced data in RCB . . . . . . . . . . . . . . . . . . . 525
9.4.1 Analysis on sample means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
9.4.2 Analysis using individual values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
9.5 Example - Monitoring Fry Levels - unbalanced data with sampling over time . . . . . . . . 535
9.5.1 Some preliminary plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
9.5.2 Approximate analysis of means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
9.5.3 Analysis of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
9.5.4 Planning for future experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
9.6 Example - comparing mean agella lengths . . . . . . . . . . . . . . . . . . . . . . . . . . 547
9.6.1 Average-of-averages approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
9.6.2 Analysis on individual measurements . . . . . . . . . . . . . . . . . . . . . . . . . 562
9.6.3 Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
9.7 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
10 Two Factor Designs - Single-sized Experimental units - CR and RCB designs 569
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
10.1.1 Treatment structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Why factorial designs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Why not factorial designs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
Displaying and interpreting treatment effects - prole plots . . . . . . . . . . . . . . 573
10.1.2 Experimental unit structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
10.1.3 Randomization structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
10.1.4 Putting the three structures together . . . . . . . . . . . . . . . . . . . . . . . . . . 582
10.1.5 Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
10.1.6 Fixed or random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
10.1.7 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
10.1.8 General comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
10.2 Example - Effect of photo-period and temperature on gonadosomatic index - CRD . . . . . . 586
10.2.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
10.2.2 Preliminary summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
10.2.3 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
10.2.4 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
10.2.5 Hypothesis testing and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
10.3 Example - Effect of sex and species upon chemical uptake - CRD . . . . . . . . . . . . . . . 603
10.3.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
CONTENTS
10.4 Power and sample size for two-factor CRD . . . . . . . . . . . . . . . . . . . . . . . . . . 619
10.5 Unbalanced data - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
10.6 Example - Stream residence time - Unbalanced data in a CRD . . . . . . . . . . . . . . . . 626
Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
10.6.2 The Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
10.6.4 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
10.7 Example - Energy consumption in pocket mice - Unbalanced data in a CRD . . . . . . . . . 641
10.7.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
10.7.6 Adjusting for unequal variances? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
10.8 Example: Use-Dependent Inactivation in Sodium Channel Beta Subunit Mutation - BPK . . 656
10.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
10.8.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
10.9 Blocking in two-factor CRD designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
10.10FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
10.10.1 How to determine sample size in two-factor designs . . . . . . . . . . . . . . . . . 669
10.10.2 What is the difference between a block and a factor? . . . . . . . . . . . . . . . 669
10.10.3 If there is evidence of an interaction, does the analysis stop there? . . . . . . . . . . 670
10.10.4 When should you use raw means or LSmeans? . . . . . . . . . . . . . . . . . . . . 671
11 SAS CODE NOT DONE 673
12 Two-factor split-plot designs 674
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
12.2 Example - Holding your breath at different water temperatures - BPK . . . . . . . . . . . . 675
12.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
12.2.2 Standard split-plot analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
12.2.3 Adjusting for body size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
12.2.4 Fitting a regression to temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
12.2.5 Planning for future studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
12.3 Example - Systolic blood pressure before presyncope - BPK . . . . . . . . . . . . . . . . . 698
12.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
12.3.3 Power and sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . 707
13 Analysis of BACI experiments 709
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
13.2 Before-After Experiments - prelude to BACI designs . . . . . . . . . . . . . . . . . . . . . 714
13.2.1 Analysis of stream 1 - yearly averages . . . . . . . . . . . . . . . . . . . . . . . . . 717
13.2.2 Analysis of Stream 1 - individual values . . . . . . . . . . . . . . . . . . . . . . . . 719
CONTENTS
13.2.3 Analysis of all streams - yearly averages . . . . . . . . . . . . . . . . . . . . . . . . 721
13.2.4 Analysis of all streams - individual values . . . . . . . . . . . . . . . . . . . . . . . 724
13.3 Simple BACI - One year before/after; one site impact; one site control . . . . . . . . . . . . 726
13.4 Example: Change in density in crabs near a power plant - one year before/after; one site
impact; one site control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
13.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
13.5 Simple BACI design - limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
13.6 BACI with Multiple sites; One year before/after . . . . . . . . . . . . . . . . . . . . . . . . 737
13.7 Example: Density of crabs - BACI with Multiple sites; One year before/after . . . . . . . . . 739
13.7.1 Converting to an analysis of differences . . . . . . . . . . . . . . . . . . . . . . . . 741
13.7.2 Using ANOVA on the averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
13.7.3 Using ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
13.7.4 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
13.8 BACI with Multiple sites; Multiple years before/after . . . . . . . . . . . . . . . . . . . . . 752
13.9 Example: Counting sh - Multiple years before/after; One site impact; one site control . . . 754
13.9.1 Analysis of the differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
13.9.2 ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
13.10Example: Counting chironomids - Paired BACI - Multiple-years B/A; One Site I/C . . . . . 764
13.10.1 Analysis of the differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
13.10.2 ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
13.11Example: Fry monitoring - BACI with Multiple sites; Multiple years before/after . . . . . . 771
13.11.1 A brief digression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
13.11.2 Some preliminary plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775
13.11.3 Analysis of the averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
13.11.4 Analysis of the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
13.11.5 Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786
13.12Closing remarks about the analysis of BACI designs . . . . . . . . . . . . . . . . . . . . . . 787
13.13BACI designs power analysis and sample size determination . . . . . . . . . . . . . . . . . 788
13.13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
13.13.2 Power: Before-After design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
Single Location studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
Multiple Location studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
13.13.3 Power: Simple BACI design - one site control/impact; one year before/after; inde-
pendent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
13.13.4 Power: Multiple sites in control/impact; one year before/after; independent samples . 803
13.13.5 Power: One sites in control/impact; multiple years before/after; no subsampling . . . 808
13.13.6 Power: General BACI: Multiple sites in control/impact; multiple years before/after;
subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
14 Comparing proportions - Chi-square (
2
) tests 814
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
14.2 Response variables vs. Frequency Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 816
14.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
14.4 Single sample surveys - comparing to a known standard . . . . . . . . . . . . . . . . . . . . 820
CONTENTS
14.4.1 Resource selection - comparison to known habitat proportions . . . . . . . . . . . . 820
14.4.2 Example: Homicide and Seasons . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
14.5 Comparing sets of proportions - single factor CRD designs . . . . . . . . . . . . . . . . . . 830
14.5.1 Example: Elk habitat usage - Random selection of points . . . . . . . . . . . . . . . 830
14.5.2 Example: Ownership and viability . . . . . . . . . . . . . . . . . . . . . . . . . . 834
14.5.3 Example: Sex and Automobile Styling . . . . . . . . . . . . . . . . . . . . . . . . . 839
14.5.4 Example: Marijuana use in college . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
14.5.5 Example: Outcome vs. cause of accident . . . . . . . . . . . . . . . . . . . . . . . 847
14.5.6 Example: Activity times of birds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
14.6 Pseudo-replication - Combining tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
14.7 Simpsons Paradox - Combining tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857
14.7.1 Example: Sex bias in admissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857
14.7.2 Example: - Twenty-year survival and smoking status . . . . . . . . . . . . . . . . . 858
14.8 More complex designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
14.9 Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
14.10Appendix - how the test statistic is computed . . . . . . . . . . . . . . . . . . . . . . . . . 860
14.11Fishers Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
14.11.1 Sampling Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
14.11.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
14.11.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
14.11.4 Example: Relationship between Aspirin Use and MI . . . . . . . . . . . . . . . . . 867
Mechanics of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868
14.11.5 Avoidance of cane toads by Northern Quolls . . . . . . . . . . . . . . . . . . . . . 870
15 Correlation and simple linear regression 878
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879
15.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
15.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
15.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882
15.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886
15.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887
15.3.2 Correlation coefcient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
15.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
15.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
15.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
15.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
15.4.2 Equation for a line - getting notation straight (no pun intended) . . . . . . . . . . . . 895
15.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896
15.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
Correct scale of predictor and response . . . . . . . . . . . . . . . . . . . . . . . . 897
Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
No outliers or inuential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
Equal variation along the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
CONTENTS
X measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
15.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900
15.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902
15.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
15.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
15.4.9 Example - Mercury pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914
15.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 923
15.4.11 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
15.4.12 Example: Monitoring Dioxins - transformation . . . . . . . . . . . . . . . . . . . . 925
15.4.13 Example: Weight-length relationships - transformation . . . . . . . . . . . . . . . . 937
A non-linear t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945
15.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
15.4.15 The perils of R
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 947
15.5 A no-intercept model: Fultons Condition Factor K . . . . . . . . . . . . . . . . . . . . . . 950
15.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957
15.6.1 Do I need a random sample; power analysis . . . . . . . . . . . . . . . . . . . . . . 957
19 Estimating power/sample size using Program Monitor 962
19.1 Mechanics of MONITOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
19.2 How does MONITOR work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972
19.3 Incorporating process and sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . 977
19.4 Presence/Absence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
19.5 WARNING about using testing for temporal trends . . . . . . . . . . . . . . . . . . . . . . 989
23 Logistic Regression - Advanced Topics 994
23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994
23.2 Sacricial pseudo-replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995
23.3 Example: Fox-proong mice colonies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
23.3.1 Using the simple proportions as data . . . . . . . . . . . . . . . . . . . . . . . . . . 997
23.3.2 Logistic regression using overdispersion . . . . . . . . . . . . . . . . . . . . . . . . 999
23.3.3 GLIMM modeling the random effect of colony . . . . . . . . . . . . . . . . . . . . 1000
23.4 Example: Over-dispersed Seeds Germination Data . . . . . . . . . . . . . . . . . . . . . . 1002
CONTENTS
25 A short primer on residual plots 1011
25.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012
25.2 ANOVA residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
25.3 Logistic Regression residual plots - Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015
25.4 Logistic Regression residual plots - Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016
25.5 Poisson Regression residual plots - Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
25.6 Poisson Regression residual plots - Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019
27 Tables 1022
27.1 A table of uniform random digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022
27.2 Selected Binomial individual probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026
27.3 Selected Poisson individual probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034
27.4 Cumulative probability for the Standard Normal Distribution . . . . . . . . . . . . . . . . 1037
27.5 Selected percentiles from the t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 1039
27.6 Selected percentiles from the chi-squared-distribution . . . . . . . . . . . . . . . . . . . . 1040
27.7 Sample size determination for a two sample t-test . . . . . . . . . . . . . . . . . . . . . . . 1041
27.8 Power determination for a two sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . 1043
27.9 Sample size determination for a single factor, xed effects, CRD . . . . . . . . . . . . . . . 1045
27.10Power determination for a single factor, xed effects, CRD . . . . . . . . . . . . . . . . . . 1049
28 THE END! 1053
28.1 Statisfaction - with apologies to Jagger/Richards . . . . . . . . . . . . . . . . . . . . . . . . 1053
28.2 ANOVA Man with apologies to Lennon/McCartney . . . . . . . . . . . . . . . . . . . . . . 1055
29 An overview of enviromental eld studies 1057
29.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
29.1.1 Survey Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065
Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065
Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067
Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074
Multi-phase designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076
Summary comparison of designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078
29.1.2 Permanent or temporary monitoring stations . . . . . . . . . . . . . . . . . . . . . . 1079
29.1.3 Renements that affect precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1080
Stratication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1080
Auxiliary variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083
Sampling with unequal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083
29.1.4 Sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
29.2 Analytical surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
29.3 Impact Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
29.3.1 Before/After contrasts at a single site . . . . . . . . . . . . . . . . . . . . . . . . . 1088
29.3.2 Repeated before/after sampling at a single site. . . . . . . . . . . . . . . . . . . . . 1088
29.3.3 BACI: Before/After and Control/Impact Surveys . . . . . . . . . . . . . . . . . . . 1089
29.3.4 BACI-P: Before/After and Control/Impact - Paired designs . . . . . . . . . . . . . . 1092
CONTENTS
29.3.5 Enhanced BACI-P: Designs to detect acute vs. chronic effects or to detect changes
in variation as well as changes in the mean. . . . . . . . . . . . . . . . . . . . . . . 1094
29.3.6 Designs for multiple impacts spread over time . . . . . . . . . . . . . . . . . . . . . 1096
29.3.7 Accidental Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099
29.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108
29.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
29.6 Selected journal articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
29.6.1 Designing Environmental Field Studies . . . . . . . . . . . . . . . . . . . . . . . . 1112
29.6.2 Beyond BACI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113
29.6.3 Environmental impact assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113
29.7 Examples of studies for discussion - good exam questions! . . . . . . . . . . . . . . . . . . 1114
29.7.1 Effect of burn upon salamanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
Chapter 1
In the beginning...
Contents
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Effective note taking strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Its all to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Which computer package? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 FAQ - Frequently Asked Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.1 Accessing journal articles from home . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.2 Downloading from the web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.3 Printing 2 pages per physical page and on both sides of the paper . . . . . . . . . 28
1.5.4 Is there an on-line textbook? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.1 Introduction
To many students, statistics is synonymous with sadistics and is not a subject that is enjoyable. Obviously, I
think this view is mistaken and hope to present some of the interesting things that can be done with statistics.
Statistics is all about discovery - how to extract information in the face of uncertainty. In the past,
learning about statistics was tedious because of the enormous amount of arithmetic that needed to be done.
Now, we let the computer do the heavy lifting, but it now vitally important that you UNDERSTAND what
a computer package is doing after all, these computer packages dont have a conceptual understanding of
what the data are about. They will quite happily compute the average sex (where 0 codes males and 1 codes
females) only you can decide that this is a meaningless statistic to compute.
These note try to operate at a conceptual level. There are many example which show how a typical
15
CHAPTER 1. IN THE BEGINNING...
analysis might be performed using a statistical package. There is often no unique answer to a problem with
several good alternatives always available, so dont let my notes constrain your thinking.
Statistics is fun! Just ask my family:
1.2 Effective note taking strategies
This section is taken from : The Tomorrows Professor Listserv - an email list server on topics of general
interest to higher education. It is available at Stanford University at http://sll.stanford.edu/.
It will soon become apparent, if it hasnt already, that not all lectures are fascinating and stimulating, and
that not all lecturers are born with a gift for public speaking.
However, the information and ideas that they are trying to impart are just as important, and any notes that
you take in the lectures must be understandable to you, not only ve minutes after the lecture has nished,
but in several months time, when you come to revise from them. The question, then, is how to retain your
concentration and produce a good set of notes.
There are a few misconceptions on the part of students as to what can be expected of a lecture session.
Firstly, that the responsibility for the success of the lecture is entirely the instructors, and that the students
role is to sit and listen or to take verbatim notes. Secondly, that the purpose of a lecture is to impart infor-
mation which will be needed for an exam question. And thirdly, that attending the lecture, and taking notes,
is an individual, even competitive, activity.
This page aims to correct these ideas, and to help you develop successful note-taking strategies.
BEFORE THE LECTURE
If you know the subject of the lecture, do some background reading beforehand. This way, you will
go into the lecture with a better understanding, and nd it easier to distinguish the points worth noting.
Read through the notes of the previous lecture in the series just before the present one begins. This
helps orient your thoughts to the subject in hand, especially if you have just come from a lecture on a
completely different subject.
DURING THE LECTURE
Think of a lecture as an active, learning process, rather than a passive, secretarial exercise. Writing
verbatim what the lecturer says, or copying everything down from overheads, does not involve much
thought, and subsequent reading of these notes often makes little sense.
Pages of continuous prose are the least helpful for revision. Some things said in the lecture are obvi-
ously more important than others, and the notes you take should reect this. Try to give them some
structure, by using headings and sub-headings, by HIGHLIGHTING or underlining key ideas and re-
alizing the links between them. Alternative noting forms to linear notes, such as ow diagrams or star
charts, can be used, although these are often more helpful to revise your notes (see After the Lecture).
In some situations, you may be directed in the amount of note-taking necessary. For example, the
lecturer may start off by giving you some references on the subject he/she is to lecture on. A good
strategy to adopt in this case would be to note down carefully the references, then just LISTEN to the
lecture, making brief notes about the main points or specic examples. Taking notes from books is
far easier than in lectures, as you can go at your own pace, stop and think about something, re-read a
section etc. Use the lecture to try and understand the concepts being explained.
Or, the lecturer may give hand-outs to accompany the lecture. In this case, you dont need to make
copious notes of your own. Again, listen to what is being said, and annotate the hand-out with any
extra information. It gives you more time to think, and perhaps raise questions of your own.
On the subject of questions, it is commonly believed by students that lecturers are not to be interrupted
when they are in full ow. You may nd that this isnt always the case, and there is nothing wrong
with asking individual lecturers if they mind taking questions during the lecture. It is best to establish
this at the beginning of the course of lectures.
However, there is also the problem of speaking out in front of your peers, perhaps asking something
foolish, or not having the time to frame your question well. In this case, write down the question in
the margin of your notes, to ask the lecturer later, or check with friends or in a textbook. It is far easier
to recall the question you wanted to ask in this way, rather than rely on remembering after the lecture
has nished (or even when you come to revise from your notes!)
AFTER THE LECTURE
The best time to review your lecture notes is immediately following the lecture, although this is not
always possible if, for example, you have to go straight to another one. However, the sooner you do
it, with the lecture still fresh in your mind, the better chance you have to produce a good set of notes
for revision.
Revising your notes does not mean writing them out neatly!
Try swapping notes with a friend, to check the accuracy/omissions of your own, and your understand-
ing of the key points.
If you feel that your notes are incomplete, or if you jotted down any questions during the lecture,
follow this up by asking your tutor, or by reading round the subject.
Transforming your lecture notes by using a different noting form can sometimes make them clearer,
e.g. a ow diagram
Highlight key points; produce summaries for revision purposes.
Think how this topic relates to previous ones, or to other courses that you are studying, and begin to
recognize themes and relationships.
Meet with a few friends after lectures, to discuss the lecture topic and answer each others questions.
Discussion with your peers often leads to a better understanding of a subject, which in turn makes it
easier to remember. Your group could also establish a reading syndicate, whereby reading lists can be
divided between members, who each take notes on their allotted texts and give copies to the rest of the
group.
STORING YOUR NOTES
A little time spent at this stage in organizing your notes will make life much easier when you come to
revise from them some months later.
Numbering pages, making a contents page, or using dividers in your le will all make your notes more
accessible.
1.3 Its all to me
There are several common Greek letters and special symbols that are used in Statistics. In this section, we
illustrate the common Greek letters and notation used.
Check that the following symbols and small equations.
- the Greek letter alpha
- the Greek letter beta
- the Greek letter lambda
- the Greek letter mu (looks like a u with a tail in front)
- the Greek letter sigma (looks like an o with a top line)
X
2
, X
2
- an X with a superscript 2 and then an X with a subscript 2
Y - Y-bar, a Y with a bar across the top

Y - Y-hat, a Y with a circumex over it
x
y
- a fraction x/y in vertical format
Z =
X
- an equation with the x-mu above the Greek letter sigma

n - square root of n
p - p-hat - a p with a circumex over it
= - a not-equal sign
- a plus/minus sign
- a multiplication sign

n
i=1
something - a summation of something with the index i ranging from 1 to n.
1.4 Which computer package?
Modern Statistics relies heavily upon computing many would say that many modern statistical methods
would be infeasible without modern software. Rather than spending time on tedious arithmetic, or on trying
to reinvent the wheel, many people rely upon modern statistical packages.
Here are some of the most common packages in use today.
SAS. Available in Windoze and Unix avours. Modern Macintoshs with the Intel chip can run SAS
under Windoze.
1
One of the best packages around. SAS can handle nearly any type of data (dates, times, characters,
numbers) with many posible analyses (over 100 different base analyses are currently available) and
allows virtually arbitrary input formats and structures. SAS is extremely exible and powerful but has
a very steep learning curve. This is the premier statistical procedure virtually all statistical analyses
can be done with SAS. This is the package that I, as a Professional Statistician
2
use regularly in my
job.
Not only does SAS have modern statistical procedures, but is also a premiere database management
system. It is designed for heavy duty computing.
Refer to http://www.sas.com for more details on this package. The SAS program includes a
module SAS/INSIGHT that is virtually identical to JMP (see below).
SPSS. This is a fairly powerful package (but not nearly as broad as SAS). It is very popular with Social
Sciences researchers, but I personally, prefer SAS.
Refer to http://www.spss.com for more details on this package.
JMP. JMP was originally developed by one of the two SAS developers (who were the 68 and 138 rich-
est people in USA/NA in 2003). John Saul developed JMP. He did this originally for the MacIntosh
platform and called it Johns MacIntosh Product ergo the name JMP.
JMP runs on Macintosh, Linux, and Windoze platforms.
JMP is easy to use and fairly powerful package. You should be able to do most things in this course
in JMP.
JMP does not have the range of procedures as in SAS, nor can it deal with as complex data structures.
However, my guestimation is that most people can do 80% of their statistical computing using JMP.
Refer to http://www.jmp.com for more details on this package.
SYSTAT. This package has good graphical procedures, a fairly wide range of statistical procedures,
but the package is showing its age. I nd SYSTAT clumsy compared to using JMP and SAS and
everytime I use it, I quickly get frustrated by its limitations and clumsy operations.
A review of SYSTAT is available in
Hilbe, J. M. (2008).
Systat 12.2: An overview
American Statistician, 62, 177-178
http://dx.doi.org/10.1198/000313008x299339
Refer to http://www.systat.com for more details.
STATA I have never used STATA but a nice review of the package is found in
Hilbe, J.M. (2005). A review of Stata 9.0. The American Statistician, 59, 335-348.
1
There is a VERY old version of SAS that runs under older Macintoshes. This is a very old version and should not be used.
2
The Statistical Society of Canada has undertaken a program to accredit statisticians in Canada. Visit http://www.ssc.ca for
more details. Yours truly proudly bears the title P.Stat. 007
According to this review, Stata would be of interest to biostatisticians, mediacal/health outcomes,
econometric, and social science research.
S-PLUS/R. S-PLUS and R are based on the S-programing language. As the name implies, S-PLUS
is an extended version of S with a nice graphical interface. R is a freeware version of S Plus with
the basically the same functionality and can be freely downloaded from the WWW. It does not have
the nice graphical interface.
These packages are commonly used by statisticians when developing new statistical procedures. They
are yery exible, but require a somewhat steep learning curve.
Refer to http://www.insightful.com/ for information on S-PLUS and http://www.r-project.
org/ for information about R.
Excel. This is the standard spreadsheet program in the MSOfce Suite. Excel comes with some basic
statistics but nothing too fancy. While EXCEL has its uses, you will nd quickly that it cant handly
more complex analysis situations and gives wrong results without warning!
Except for very simple statistics, I RECOMMEND AGAINST the use of Excel to do statistical
analyses.
People are wedded to Excel for often spurious reasons:
"Its free." So is R and you get a much superior product.
"It is easy to use". Yes, and easy to get WRONG answers without warnings.
"It has good graphs". Excel has the largest selection of BAD graphs in the world. Hardly any of
them are useful!
The following articles discuss some of the problems with Excel. They can also be accessed directly
from the web by clicking on their respective links.
J. Cryer from the University of Iowa discusses some of the problems with using Excel for ana-
lyzing data at http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf.
Yet more problems with Excel are discussed in : Practical Stat with Excel? available at http:
//www.practicalstats.com/xlsstats/excelstats.html and a copy of which
is included in these notes.
Yet more problems with Excel: Using Excel in Statistics? available at http://www.umass.
edu/acco/statistics/handout/excel.html.
An article by the Statistical Consulting Service at the University of Reading has a brief discus-
sion of the pros and cons of using Excel or analyzing data at http://www.rdg.ac.uk/
ssc/software/excel/home.html. Basically, the graphs presented by Excel are often
inappropriate for data presentation and you quickly run into limitations of the analysis routines
available.
How to use the basic functions of Excel for Statistics. This page also has a link to discussion
about regression in Excel. Using Excel functions in Statistics available at http://physicsnt.
clemson.edu/chriso/tutorials/excel/stats.html
Spreadsheet Addiction. Some of the problems in Excel and alterntives. Has a long bibliography
on the problems with Excel. Very nice summary document - well worth the read. Available at
http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html.
There are a number of add ons available for Excel that seem to be reasonable priced and extend the
analyses available.
Nevertheless, the algorithms used in Excel to do the actual computations are awed and can give
INCORRECT results without any warning that something has gone wrong!
For this reason, I generally use Excel only for simple problems - for anything more complex than a
simple mean, I reach for a package such a JMP or SAS.
Friends dont let friends do statistics in Excel!
10/27/08 10:48 Statistics With Excel
Page 1 of 5 http://www.practicalstats.com/xlsstats/excelstats.html
Practical Stats
Make sense of your data!
Home
Consulting
Upcoming Classes
Applied Environmental Stats
Nondetects (NADA)
Multivariate Relationships
Newsletter
Downloads
Statistics With Excel?
Which Test Do I Use?
Who Else Has Taken These?
Contact Us
Is Microsoft Excel an Adequate Statistics Package?
It depends on what you want to do, but for many tasks, the answer is No.
Excel is available to many people as part of Microsoft Office. It contains some statistical functions in its
basic installation. It also comes with statistical routines in the Data Analysis Toolpak, an add-in found
separately on the Office CD. You must install the Toolpak from the CD in order to get these routines on the
Tools menu. Once installed, these routines are at the bottom of the Tools menu, in the "Data Analysis"
command. People use Excel as their everyday statistics software because they have already purchased it.
Excels limitations, and occasionally its errors, make this a problem. Below are some of the concerns with
using Excel for statistics that are recorded in journals, on the web, and from personal experience.
Limitations of Excel
1. Many statistical methods are not available in Excel.
Excel's biggest problem. Commonly-used statistics and methods NOT available within Excel include:
* Boxplots
* p-values for the correlation coefficient
* Spearmans and Kendalls rank correlation coefficients
* 2-way ANOVA with unequal sample sizes (unbalanced data)
* Multiple comparison tests (post-hoc tests following ANOVA)
23
* p-values for two-way ANOVA
* Levenes test for equal variance
* Nonparametric tests, including rank-sum and Kruskal-Wallis
* Probability plots
* Scatterplot arrays or brushing
* Principal components or other multivariate methods
* GLM (generalized linear models)
* Survival analysis methods
* Regression diagnostics, such as Mallows Cp and PRESS ( it does compute adjusted r-squared)
* Durbin-Watson test for serial correlation
* LOESS smooths
Excel's lack of functionality makes it difficult to use for more than computing summary statistics and simple
univariate regression. Third-party add-ins to Excel attempt to compensate for these limitations, adding new
functionality to the program (see "A Partial Solution", below).
2. Several Excel procedures are misleading.
Probability plots are a standard way of judging the adequacy of the normality assumption in regression. In
statistics packages, residuals from the regression are easily, or in some cases automatically, plotted on a
normal probability plot. Excels regression routine provides a Normal Probability Plot option. However, it
produces a probability plot of the Y variable, not of the residuals, as would be expected.
Excels CONFIDENCE function computes z intervals using 1.96 for a 95% interval. This is valid only if the
population variance is known, which is never true for experimental data. Confidence intervals computed
using this function on sample data will be too small. A t-interval should be used instead.
Excel is inconsistent in the type of P-values it returns. For most functions of probabilities, Excel acts like a
lookup table in a textbook, and returns one-sided p-values. But in the TINV function, Excel returns a 2-
sided p-value. Look carefully at the documentation of any Excel function you use, to be certain you are
getting what you want.
Tables of standard distributions such as the normal and t distributions return p-values for tests, or are used
to confidence intervals. With Excel, the user must be careful about what is being returned. To compute a
95% t confidence interval around the mean, for example, the standard method is to look up the t-statistic in
a textbook by entering the table at a value of alpha/2, or 0.025. This t-statistic is multiplied by the standard
error to produce the length of the t-interval on each side of the mean. Half of the error (alpha/2) falls on
each side of the mean. In Excel the TINV function is entered using the value of alpha, not alpha/2, to return
the same number.
For a one-sided t interval at alpha=0.05, standard practice would be to look up the t-statistic in a textbook
for alpha=0.05. In Excel, the TINV function must be called using a value of 2*alpha, or 0.10, to get the
value for alpha=0.05. This nonstandard entry point has led several reviewers to state that Excels distribution
functions are incorrect. If not incorrect, they are certainly nonstandard. Make sure you read the help menu
descriptions carefully to know what each function produces.
3. Distributions are not computed with precision.
NEW In reference (1), the authors show that all problems found in Excel 97 are still there in Excel 2000 and
24
XP. They say that "Microsoft attempted to fix errors in the standard normal random number generator and
the inverse normal function, and in the former case actually made the problem worse." From this, you can
assume that the problems listed below are still there in the current versions of the software.
Statistical distributions used by Excel do not agree with better algorithms for those distributions at the third
digit and beyond. So they are approximately correct, but not as exact as would be desired by an exacting
statistician. This may not be harmful for hypothesis tests unless the third digit is of concern (a p-value of
0.056 versus 0.057). It is of most concern when constructing intervals (multiplying a std dev of 35 times
1.96 give 68.6; times 1.97 gives 69.0) As summarized in reference 2:
"the statistical distributions of Excel already have been assessed by Knusel (1998), to which we refer the
interested reader. He found numerous defects in the various algorithms used to compute several
distributions, including the Normal, Chi-square, F and t, and summarized his results concisely: So one has
to warn statisticians against using Excel functions for scientific purposes. The performance of Excel in this
area can be judged unsatisfactory."
4. Routines for handling missing data were incorrect.
This was the largest error in Excel, but a 'band-aid' has been added in Office 2000. In earlier versions of
Excel, computations and tests were flat out wrong when some of the data cells contained missing values,
even for simple summary statistics. See (3) , (5), and page 4 of (6). Error messages are now displayed in
Excel 2000 when there are missing values, and no result is given. Although this is still inferior to computing
correct results it is somewhat of an improvement.
In reference to pre-2000,
"Excel does not calculate the paired t-test correctly when some observations have one of the measurements
but not the other." E. Goldwater, ref. (5)
5. Regression routines are incorrect for multicollinear data.
This affects multiple regression. A good statistics package will report errors due to correlations among the X
variables. The Variance Inflation Factor (VIF) is one measure of collinearity. Excel does not compute
collinearity measures, does not warn the user when collinearity is present, and reports parameter estimates
that may be nonsensical. See (6) for an example on data from an experiment. Are multicollinear data of
concern in practical problems? I think so -- I find many examples of collinearity in environmental data
sets.
Excel also requires the X variables to be in contiguous columns in order to input them to the procedure. This
can be done with cut and paste, but is certainly annoying if many multiple regression models are to be built.
6. Ranks of tied data are computed incorrectly.
When ranking data, standard practice is to assign tied ranks to tied observations. The value of these ranks
should equal the median of the ranks that the observations would have had, if they had not been tied. For
example, three observations tied at a value of 14 would have had the ranks of 7, 8 and 9 had they not been
tied. Each of the three values should be assigned the rank of 8, the median of 7, 8 and 9.
Excel assigns the lowest of the three ranks to all three observations, giving each a rank of 7. This would
result in problems if Excel computed rank-based tests. Perhaps it is fortunate none are available.
7. Many of Excel's charts violate standards of good graphics.
25
Use of perspective and glitz (donut charts?) violate basic principles of graphics. Excel's charts are more
suitable to USA Today than to scientific reports. This bothers some people more than others.
"Good graphs should.[a list of traits]However, Excel meets virtually none of these criteria. The vast
majority of chart types produced by Excel should never be used!" -- Jon Cryer, ref (3).
"Microsoft Excel is an example of a package that does not allow enough user control to consistently make
readable and concise graphs from tables."
- A. Gelman et al., 2002, The American Statistician 56, p.123.
A partial solution:
Some of these difficulties (parts of 1,2, 6 and 7) can be overcome by using a good set of add-in routines.
One of the best is StatPlus, which comes with an excellent textbook, "Data Analysis with Microsoft Excel".
With StatPlus, Excel becomes an adequate statistical tool., though still not in the areas of multiple regression
and ANOVA for more than one factor. Without this add-in Excel is inadequate for anything beyond basic
summary statistics and simple regression.
Data Analysis with Microsoft Excel by Berk and Carey
published by Duxbury (2000).
Opinion: Get this book if you're going to use Excel for statistics.
(I have no connection with the authors of StatPlus and get no benefit from this recommendation. I'm just a
satisfied user.)
Some advice from others:
"If you need to perform analysis of variance, avoid using Excel, unless you are dealing with extremely
simple problems."
- Statistical Services Centre, Univ. of Reading, U.K. (at A, below)
"Enterprises should advise their scientists and professional statisticians not to use Microsoft Excel for
substantive statistical analysis. Instead, enterprises should look for professional statistical analysis software
certified to pass the (NIST) Statistical Reference Datasets tests to their users' required level of accuracy."
- The Gartner Group
References:
1) On the accuracy of statistical procedures in Microsoft Excel 2000 and Excel XP
B.D. McCullough and B. Wilson, (2002), Computational Statistics & Data Analysis, 40, pp 713 - 721
(2) On the accuracy of statistical procedures in Microsoft Excel 97
B.D. McCullough and B. Wilson, (1999), Computational Statistics & Data Analysis, 31, pp 27-37
(3) Problems with using Microsoft Excel for statistics [pdf Download]
J.D. Cryer, (2001), presented at the Joint Statistical Meetings, American Statistical Association, 2001,
Atlanta Georgia
[pdf download]
(4) Use of Excel for statistical analysis
Neil Cox, (2000), AgResearch Ruakura
26
(5) Using Excel for statistical data analysis
Eva Goldwater, (1999), Univ. of Massachusetts Office of Information Technology
[pdf download]
(6) Statistical analysis using Microsoft Excel [pdf download]
Jeffrey Simonoff, (2002)
[pdf download]
(7) Spreadsheet addiction
Patrick Burns
(8) On the Accuracy of Statistical Distributions in Microsoft Excel 97
Leo Knuesel
[pdf download]
(9) Statistical flaws in Excel
Hans Pottel
[pdf download]
Guides to Excel on the web:
(A) A Beginner's Guide to Excel - Univ. of Reading, UK
(B) An Intermediate Guide to Excel - Univ. of Reading, UK
Note: All opinions other than those cited as coming from others are my own.
Home >
Statistics With Excel? >
2007 Practical Stats Email Us
27
1.5 FAQ - Frequently Asked Question
1.5.1 Accessing journal articles from home
I make reference to several journal articles in these notes. These are often available in e-journals so the
link should take you there directly IF you are authorized to access this journal. For example, if you try
and access these articles from a computer with an SFU IP address, you should likely be granted permission
without problems.
However, if you are trying to access these from home, you must go through the SFU library site and ac-
cess the e-journal via the catalogue. You will then be prompted to enter your SFU ACS userid and password
to grant you access to this journal
1.5.2 Downloading from the web
When ever I try and download an Excel le, it seems to be corrupted and cant be opened.
Through out the notes, reference is made to spreadsheets or SAS programs available at my web site.
The SAS programs and listings are simple text les and should transfer to your computer without much
problem.
If you trying to download an Excel spreadsheet, be sure to specify that the le should be transfered as
a source document rather than as text. If you transfer the sheets as text, you will nd that the data are
corrupted.
1.5.3 Printing 2 pages per physical page and on both sides of the paper
The notes look as if I could print 2 per page. Is this possible, and can I print on both sides of the
paper?
Yes, it is possible to print two logical pages per physical page - the text is a bit small, but still readable.
On a Macintosh System with a recent OS, when you select Print, it presents the standard print options
menu. Under the popdown menu is a Layout option. Select 2 logical pages per physical page. This will
work with ALL applications that use the standard print dialogue.
Im not familiar enough with Windoze machines to offer any advice.
To print on both sides of the paper, you need a printer capable of duplex printing, i.e. on both sides of the
paper. I believe that most printers in the public areas of campus are capable of this. You will have to consult
your own printer manual if you are printing at home. Otherwise, you have to print rst the odd pages, then
take the paper, reverse it, and print the even pages - a recipe for disaster.
1.5.4 Is there an on-line textbook?
Are there any online textbooks in statistics?
Yes, there are several - it is easiest to search the web using google. Beware that some of the advice on
the web may be less than perfect.
StatSoft has a highly regarded statistical online text book at http://www.statsoft.com/textbook/
stathome.html.
Contents
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Effective note taking strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Its all to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Which computer package? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 FAQ - Frequently Asked Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.1 Accessing journal articles from home . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.2 Downloading from the web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.3 Printing 2 pages per physical page and on both sides of the paper . . . . . . . . . 28
1.5.4 Is there an on-line textbook? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Chapter 2
Introduction to Statistics
Statistics was spawned by the information age, and has been dened as the science of extracting information
from data. Technological developments have demanded methodology for the efcient extraction of reli-
able statistics from complex databases. As a result, Statistics has become one of the most pervasive of all
disciplines.
Theoretical statisticians are largely concerned with developing methods for solving the problems in-
volved in such a process, for example, nding new methods for analyzing (making sense of) types of data
that existing methods cannot handle. Applied statisticians collaborate with specialists in other elds in ap-
plying existing methodologies to real world problems. In fact, most statisticians are involved in both of these
activities to a greater or lesser extent, and researchers in most quantitative elds of enquiry spend a great
deal of their time doing applied statistics.
The public and private sector rely on statistical information for such purposes as decision making, regu-
lation, control and planning.
Ordinary citizens are exposed to many statistics on a daily basis. For example:
In a poll of 1089 Canadians, 47% were in favor of the constitution accord. This result is accurate to
within 3 percentage points, 19 time out of 20.
The seasonally adjusted unemployment rate in Canada was 9.3%.
Two out of three dentists recommend Crest.
What does this all mean?
Our goal is not to make each student a professional statistician, but rather to give each student a subset
of tools with which they can condently approach many real world problems and make sense of the numbers.
30
CHAPTER 2. INTRODUCTION TO STATISTICS
2.1 TRRGET - An overview of statistical inference
Section summary:
1. Distinguish between a population and a sample
2. Why it is important to choose a probability sample
3. Distinguish among the roles of randomization, replication, and blocking
4. Distinguish between an estimate or a statistic and the parameter of interest.
Most studies can be broadly classied into either surveys or experiments.
In surveys, the researcher is typically interested in describing some population - there is usually no
attempt to manipulate units within the population. In experiments, units from the population are manipulated
in some fashion and a response to the manipulation is observed.
There are four broad phases to the survey or the experiment. These phases dene the paradigm of
Statistical Inference. These phases will be illustrated in the context of a political poll of Canadians on some
issue as illustrated in the following diagram.
The four phases are:
1. What is the population of interest and what is the parameter of interest? This formulates the research
question - what is being measured and what is of interest.
In this case, the population of interest is likely all eligible voters in Canada and the parameter of
interest is the proportion of all eligible voters in favor of the accord.
It is conceivable, but certainly impractical, that every eligible voter could be contacted and their opin-
ion recorded. You would then know the value of the parameter exactly and there would be no need
to do any statistics. However, in most real world situations, it is impossible or infeasible to measure
every unit in the population.
Consequently, a sample is taken.
2. Selecting a sample
We would like our sample to be as representative as possible - how is this achieved? We would like
our answer from our sample to be as precise as possible - how is this achieved? And, we may like to
modify our sample selection method to take into account known division of the population - how is
this achieved?
Three fundamental principles of Statistics are randomization, replication and blocking.
Randomization This is the most important aspect of experimental design and surveys. Randomiza-
tion makes the sample representative of the population by ensuring that, on average, the
sample will contain a proportion of population units that is about equal, for any variable as
found in the population.
If an experiment is not randomized or a survey is not randomly collected, it rarely (if ever)
provides useful information.
Many people confuse random with haphazard. The latter only means that the sample was
collected without a plan or thought to ensure that the sample obtained is representative of the
population. A truly random sample takes surprisingly much effort to collect!
E.g. The Gallup poll uses random digit dialing to select at random from all households in Canada
with a phone. Is this representative of the entire voting population? How does the Gallup Poll
account for the different patterns of telephone use among genders within a household?
A simple random sample is an example of a equal probability sample where every unit in the
population has an equal chance of being selected for the sample. As you will see later in the
notes, the assumption of equal probability of selection not crucial. What is crucial is that every
unit in the population have a known probability of selection, but this probability could vary
among units. For example, you may decide to sample males with a higher probability than
females.
Replication = Sample Size This ensures that the results from the experiment or the survey will be
precise enough to be of use. A large sample size does not imply that the sample is representative
- only randomization ensures representativeness.
Do not confuse replication with repeating the survey a second time.
In this example, the Gallup poll interviews about 1100 Canadians. It chooses this number of
people to get a certain precision in the results.
Blocking (or stratication) In some experiments or surveys, the researcher knows of a variable that
strongly inuences the response. In the context of this example, there is strong relationship
between the region of the country and the response.
Consequently, precision can be improved, by rst blocking or stratifying the population into more
homogeneous groups. Then a separate randomized survey is done in each and every stratum and
the results are combined together at the end.
In this example, the Gallup poll often straties the survey by region of Canada. Within each
region of Canada, a separate randomized survey is performed and the results are then combined
appropriately at the end.
3. Data Analysis
Once the survey design is nalized and the survey is conducted, you will have a mass of information
- statistics - collected from the population. This must be checked for errors, transcribed usually into
machine readable form, and summarized.
The analysis is dependent upon BOTH the data collected (the sample) and the way the data was
collected (the sample selection process). For example, if the data were collected using a stratied
sampling design, it must be analyzed using the methods for stratied designs - you cant simply
pretend after the fact that the data were collected using a simple random sampling design.
We will emphasize this point continually in this course - you must match the analysis with the design!
For example, consider a Gallup Poll where 511 out of 1089 Canadians interviewed were in favor of
an issue. Then our statistics is that 47% of our sample respondents were in favor.
4. Inference back to the Population
Despite an enormous amount of money spent collecting the data, interest really lies in the population,
not the sample. The sample is merely a device to gather information about the population.
How should the information from the sample, be used to make inferences about the population?
Graphing A good graph is always preferable to a table of numbers or to numerical statistics. A graph
should be clear, relevant, and informative. Beware of graphs that try to mislead by design or
accident through misleading scales, chart junk, or three dimensional effects.
There a number of good books on effective statistical graphics - these should be consulted for fur-
ther information.
1
Unfortunately, many people rely upon the graphical tools available in spread-
sheet software such as Excel which invariably leads to poor graphs. As a rule of thumb, Excel
has the largest collection of bad graphical designs available in the free world! You may enjoy
the artilce on Using Microsoft Excel to obscure your data and annow your readers available at
http://www.biostat.wisc.edu/~kbroman/presentations/graphs_uwpath08_
handout.pdf.
Estimation The number obtained from our sample is an estimate of the true, unknown, value of the
population parameter. How precise is our estimate? Are we within 10 percentage points of the
correct answer?
A good survey or experiment will report a measure of precision for any estimate.
In this example, 511 of 1089 people were in favor of the accord. Our estimate of the proportion
of all Canadian voters in favor of the accord is 511/1089 = 47%. These results are accurate to
1
An perfect thesis defense would be to place a graph of your results on the overhead and then sit down to thunderous applause!
within 3 percentage points, 19 times out of 20, which implies that we are reasonably condent
that the true proportion of voters in favor of the accord is between 47% 3% = 44% and
47% + 3% = 50%.
Technically, this is known as a 95% condence interval - the details of which will be explored
later in this chapter.
(Hypothesis) Testing Suppose that in last months poll (conducted in a similar fashion), only 42% of
voters were in favor. Has the support increased? Because each percentage value is accurate to
about 3 percentage points, it is possible that in fact there has been no change in support!.
It is possible to make a more formal test of the hypothesis of no change. Again, this will be
explored in more detail later in this chapter.
2.2 Parameters, Statistics, Standard Deviations, and Standard Er-
rors
Section summary:
1. Distinguish between a parameter and a statistic
2. What does a standard deviation measure?
3. What does a standard error measure?
4. How are estimated standard errors determined (in general)?
2.2.1 A review
DDTs is a very persistent pesticide. Once applied, it remains in the environment for many years and tends
to accumulate up the food chain. For example, birds which eat rodents which eat insects which ingest DDT
contaminated plants can have very high levels of DDT and this can interfere with reproduction. [This is
similar to what is happening in the Great Lakes where herring gulls have very high levels of pesticides
or what is happening in the St. Lawrence River where resident beluga whales have such high levels of
contaminants, they are considered hazardous waste if they die and wash up on shore.] DDT has been banned
in Canada for several years, and scientists are measuring the DDT levels in wildlife to see how quickly it is
declining.
The Science of Statistics is all about measurement and variation. If there was no variation, there would be
no need for statistical methods. For example, consider a survey to measure DDT levels in gulls on Triangle
Island off the coast of British Columbia, Canada. If all the gulls on Triangle Island had exactly the same
DDT level, then it would sufce to select a single gull from the island and measure its DDT level.
Alas, the DDT level can vary by the age of the gull, by where it feeds and a host of other unknown and
uncontrollable variables. Consequently the average DDT level over ALL gulls on Triangle Island seems like
a sensible measure of the pesticide load in the population. We recognize that some gulls may have levels
above this average, some gulls below this average, but feel that the average DDT level is indicative of the
health of the population, and that changes in the population mean (e.g. a decline) are an indication of an
improvement..
Population mean and population standard deviation. Conceptually, we can envision a listing of the
DDT levels of each and every gull on Triangle Island. From this listing, we could conceivably compute the
true population average and compute the (population) standard deviation of the DDT levels. [Of course in
practice these are unknown and unknowable.] Statistics often uses Greek symbols to represent the theoretical
values of population parameters. In this case, the population mean is denoted by the Greek letter mu () and
the population standard deviation by the Greek letter sigma (). The population standard deviation measures
the variation of individual measurements about the mean in the population.
In this example, would represent the average DDT over all gulls on the island, and would represent
the variation of values around the population mean. Both of these values are unknown.
Scientists took a random sample (how was this done?) of 10 gulls and found the following DDT levels
(ppm).
100, 105, 97, 103, 96, 106, 102, 97, 99, 103.
The data is available in the ddt.csv le in the Sample Program Library at http://www.stat.sfu.
ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in the usual fash-
ion:
data ddt;
infile ddt.csv dlm=, dsd missover firstobs=2;
input ddt;
run;
The raw data are shown below:
Obs ddt
1 100
2 105
3 97
4 103
5 96
6 106
7 102
Obs ddt
8 97
9 99
10 103
Sample mean and sample standard deviation The sample average and sample standard deviation could
be computed from these value using a spreadsheet, calculator, or a statistical package.
Proc Univariate provides basic plots and summary statistics in the SAS system.
ods graphics on;
proc univariate data=ddt plots cibasic;
var ddt;
histogram ddt /normal;
qqplot ddt /normal;
ods output Moments=DDTmoments;
ods output BasicIntervals=DDTci;
run;
ods graphics off;
This gives the following plots and summary table:
VarName Statistic Value Statistic Value
ddt N 10.000000 Sum Weights 10.000000
ddt Mean 100.800000 Sum Observations 1008.000000
ddt Std Deviation 3.521363 Variance 12.400000
ddt Skewness 0.035116 Kurtosis -1.429377
ddt Uncorrected SS 101718 Corrected SS 111.600000
ddt Coeff Variation 3.493416 Std Error Mean 1.113553
A different notation is used to represent sample quantities to distinguish them from population parame-
ters. In this case the sample mean, denoted Y and pronounced Y-bar, has the value of 100.8 ppm, and the
sample standard deviation, denoted using the letter s, has the value of 3.52 ppm. The sample mean is a
measure of the middle of the sample data and the sample standard deviation measures the variation of the
sample data around the sample mean.
What would happen if a different sample of 10 gulls was selected? It seems reasonable that the sample
mean and sample standard deviation would also change among samples, and we hope that if our sample is
large enough, that the change in the statistics would not be that large.
Here is the data from an additional 8 samples, each of size 10:
Sample
Set DDT levels in the gulls mean std
1 102 102 103 95 105 97 95 104 98 103 100.4 3.8
2 100 103 99 98 95 98 94 100 90 103 98.0 4.1
3 101 96 106 102 104 95 98 103 108 104 101.7 4.2
4 101 100 99 90 102 99 105 92 100 102 99.0 4.6
5 107 98 101 100 100 98 107 99 104 98 101.2 3.6
6 102 102 101 101 92 94 104 100 101 97 99.4 3.8
7 94 101 100 100 96 101 100 98 94 98 98.2 2.7
8 104 102 97 104 97 99 100 100 109 102 101.4 3.7
Note that the statistics (Y - the sample mean, and s - the sample standard deviation) change from sample
to sample. This is not unexpected as it highly unlikely that two different samples would give identical results.
What does the variation in the sample mean over repeated samples from the same population tell us? For
example, based on the values of the sample mean above, could the true population mean DDT over all gulls
be 150 ppm? Could it be 120 ppm? Could it be 101 ppm? Why?
If more and more samples were taken, you would end up with a a large number of sample means. A
histogram of the sample means over the repeated samples could be drawn. This would be known as the
sampling distribution of the sample mean.
The latter result is a key concept of statistical inference and is quite abstract because, in practice, you
never see the sampling distribution. The distribution of individual values over the entire population can be
visualized; the distribution of individual values in the particular sample can be examined directly as you
have actual data; the hypothetical distribution of a statistics over repeated samples from the population is
always present, but remains one level of abstraction away from the actual data.
Because the sample mean varies from sample to sample, it is theoretically possible to compute a standard
deviation of the statistic as it varies over all possible samples drawn from the population. This is known as
the standard error (abbreviated SE) of the statistic (in this case it would be the standard error of the sample
mean).
Because we have repeated samples in the gull example, we can compute the actual standard deviation
of the sample mean over the 9 replicates (the original sample, plus the additional 8 sample). This gives an
estimated standard error of 1.40 ppm. This measures the variability of the statistic (Y ) over repeated samples
from the same population.
But - unless you take repeated samples from the same population, how can the standard error ever be
determined?
Now statistical theory comes into play. Every statistic varies over repeated samples. In some cases, it
is possible to derive from statistical theory, how much the statistic will vary from sample to sample. In the
case of the sample mean for a sample selected using a simple random sample
2
from any population, the se
is theoretically equal to:
se(Y ) =

n
Note that every statistic will have a different theoretical formula for its standard error and the formula
will change depending on how the sample was selected.
But this theoretical standard error depends upon an unknown quantity (the theoretical population stan-
dard deviation ). It seems sensible to estimate the standard error by replacing the value of by an estimate
- the sample standard deviation s. This gives:
Estimated Std Error Mean = s/
n = 3.5214/
10 = 1.1136 ppm.
This number is an estimate of the variability of Y in repeated samples of the same size selected at random
from the same population.
SAS reports the se in the lower right corner of the summary statistics seen earlier.
A Summary of the crucial points:
Parameter The parameter is a numerical measure of the entire population. Two common parame-
ters are the population mean (denoted by ) and the population standard deviation (denoted by ).
The population standard deviation measures the variation of individual values over all units in the
population. Parameters always refer to the population, never to the sample.
Statistics or Estimate: Astatistic or an estimate is a numerical quantity computed fromthe SAMPLE.
This is only a guess as to the true value of the population parameter. If you took a new sample, your
estimate computed from the second sample, would be different than the value computed form the rst
sample. Two common statistics are the sample mean (denoted Y ), and the sample standard deviation
(denotes s). The sample standard deviation measures the variation of individual values over the units
in the sample. Statistics always refer to the sample, never to the population.
2
A simple random sample implies that every unit in the population has the same chance of being selected and that every unit is
selected independently of every other unit.
Sampling distribution Any statistic or estimate will change if a new sample is taken. The distribution
of the statistic or estimate over repeated samples from the same population is known as the sampling
distribution.
Theoretical Standard error: The variability of the estimate over all possible repeated samples from
the population is measured by the standard error of the estimate. This is a theoretical quantity and
could only be computed if you actually took all possible samples from the population.
Estimated standard error Now for the hard part - you typically only take a single sample from
the population. But, based upon statistical theory, you know the form of the theoretical standard
error, so you can use information from the sample to estimate the theoretical standard error. Be
careful to distinguish between the standard deviation of individual values in your sample and the
estimated standard error of the statistic as they refer to different types of variation. The formula for
the estimated standard error is different for every statistic and also depends upon the way the sample
was selected. Consequently it is vitally important that the method of sample selection and the
type of estimate computed be determined carefully before using a computer package to blindly
compute standard errors.
The concept of a standard error is the MOST DIFFICULT CONCEPT to grasp in statistics. The reason
that it is so difcult, is that there is an extra layer of abstraction between what you observe and what is really
happening. It is easy to visualize variation of individual elements in a sample because the values are there
for you to see. It is easy to visualize variation of individual elements in a population because you can picture
the set of individual units. But it is difcult to visualize the set of all possible samples because typically you
only take a single sample, and the set of all possible samples is so large.
As a nal note, please do NOT use the notation for standard errors. The problem is that the notation
is ambiguous and different papers in the same journal and different parts of the same paper use the notation
for different meanings. Modern usage is to write phrases such as the estimated mean DDT level was 100.8
(SE 1.1) ppm.
2.2.2 Theoretical example of a sampling distribution
Here is more detailed examination of a sampling distribution where the actual set of all possible samples
can be constructed. It shows that the sample mean is unbiased and that its standard error computed from all
possible samples matches that derived from statistical theory.
Suppose that a population consisted of ve mice and we wish to estimate the average weight based on
a sample of size 2. [Obviously, the example is hopelessly simplied compared to a real population and
sampling experiment!]
Normally, the population values would not be known in advance (because then why would you have to
take a sample?). But suppose that the ve mice had weights (in grams) of:
33, 28, 45, 43, 47.
The population mean weight and population standard deviation are found as:
= (33+28+45+43+47) = 39.20 g, and
= 7.39 g.
The population mean is the average weight over all possible units in the population. The population standard
deviation measures the variation of individual weights about the mean, over the population units.
Now there are 10 possible samples of size two from this population. For each possible sample, the
sample mean and sample standard deviation are computed as shown in the following table.
Sample units Sample Mean Sample std dev
(Y ) (s)
33 28 30.50 3.54
33 45 39.00 8.49
33 43 38.00 7.07
33 47 40.00 9.90
28 45 36.50 12.02
28 43 35.50 10.61
28 47 37.50 13.44
45 43 44.00 1.41
45 47 46.00 1.41
43 47 45.00 2.83
Average 39.20 7.07
Std dev 4.52 4.27
This table illustrates the following:
this is a theoretical table of all possible samples of size 2. Consequently it shows the actual sampling
distribution for the statistics Y and s. The sampling distribution of Y refers to the variation of Y over
all the possible samples from the population. Similarly, the sampling distribution of s refers to the
variation of s over all possible samples from the population.
some values of Y are above the population mean, and some values of Y are below the population
mean. We dont know for any single sample if we are above or below the true value of the population
parameter. Similarly, values of s (which is a sample standard deviation) also varies above and below
the population standard deviation.
the average (expected) value of Y over all possible samples is equal to the population mean. We say
such estimators are unbiased. This is the hard concept! The extra level of abstraction is here - the
statistic computed from an individual sample, has a distribution over all possible samples, hence the
sampling distribution.
the average (expected) value of s over all possible samples is NOT equal to the population standard
deviation. We say that s is a biased estimator. This is a difcult concept - you are taking the average of
an estimate of the standard deviation. The average is taken over possible values of s from all possible
samples. The latter is an extra level of abstraction from the raw data. [There is nothing theoretically
wrong with using a biased estimator, but most people would prefer to use an unbiased estimator. It
turns out that the bias in s decreases very rapidly with sample size and so is not a concern.]
the standard deviation of Y refers to the variation of Y over all possible samples. We call this the
standard error of a statistic. [The term comes from an historical context that is not important at
this point.]. Do not confuse the standard error of a statistic with the sample standard deviation or the
population standard deviation. The standard error measures the variability of a statistic (e.g. Y ) over
all possible samples. The sample standard deviation measures variability of individual units in the
sample. The population standard deviation measures variability of individual units in the population.
if the previous formula for the theoretical standard error was used in this example it would fail to give
the correct answer:
i.e. se(Y ) = 4.52 =

n
=
7.39
2
= 5.22 The reason that this formulae didnt work is that the sample
size was an appreciable fraction of the entire population. A nite population correction needs to be
applied in these cases. As you will see in later chapters, the se in this case is computed as:
se(Y ) =

n
_
(1 f)
N
(N 1)
=
7.39
2
_
(1
2
5
_
5
4
= 4.52
Refer to the chapter on survey sampling for more details.
2.3 Condence Intervals
Section summary:
1. Understand the general logic of why a condence interval works.
2. How to graph a condence interval for a single parameter.
3. How to interpret graphs of several condence intervals.
4. Effect of sample size upon the size of a condence interval.
5. Effect of variability upon the size of a condence interval.
6. Effect of condence level upon the size of a condence interval.
2.3.1 A review
The basic premise of statistics is that every unit in a population cannot be measured consequently, a sample
is taken. But the statistics from a sample will vary from sample to sample and it is highly unlikely that the
value of the statistic will equal the true, unknown value of the population parameter.
Condence intervals are a way to express the level of certainty about the true population parameter value
based upon the sample selected. The formulae for the various condence intervals depend upon the statistic
used and how the sample was selected, but are all derived from a general unied theory.
The following concepts are crucial and will be used over and over again in what follows:
Estimate: The estimate is the quantity computed from the SAMPLE. This is only a guess as to the true
value of the population parameter. If you took a new sample, your estimate computed from the second
sample, would be different than the value computed form the rst sample. It seems reasonable that
if you select your sample carefully that these estimates will sometimes be lower than the theoretical
population parameters; sometimes it will be higher.
Standard error: The variability of the estimate over repeated samples from the population is mea-
sured by the standard error of the estimate. It again seems reasonable that if you select your sample
carefully, that the statistics should be close to the true population parameters and that the standard
error should provide some information about the closeness of the estimate to the true population pa-
rameter.
Refer back to the DDT example considered in the last section. Scientists took a random sample of gulls
from Triangle Island (off the coast of Vancouver Island, British Columbia) and measured the DDT levels in
10 gulls. The following values were obtained (ppm):
100, 105, 97, 103, 96, 106, 102, 97, 99, 103.
What does the sample tell us about the true population average DDT level over all gulls on Triangle Island?
We again use Proc Univariate to compute summary statistics in the SAS system.
ods graphics on;
proc univariate data=ddt plots cibasic;
var ddt;
histogram ddt /normal;
qqplot ddt /normal;
ods output Moments=DDTmoments;
ods output BasicIntervals=DDTci;
run;
ods graphics off;
ddt N 10.000000 Sum Weights 10.000000
ddt Mean 100.800000 Sum Observations 1008.000000
ddt Std Deviation 3.521363 Variance 12.400000
ddt Skewness 0.035116 Kurtosis -1.429377
ddt Uncorrected SS 101718 Corrected SS 111.600000
ddt Coeff Variation 3.493416 Std Error Mean 1.113553
The sample mean, Y = 100.8 ppm, measures the middle of the sample data and the sample standard
deviation, s = 3.52 ppm, measures the spread of the sample data around the sample mean.
Based on this sample information, is it plausible to believe that the average DDT level over ALL gulls
could be as high as 150 ppm? Could it be a low a 50 ppm? Is it plausible that it could be as high as 110
ppm? As high as 101 ppm?
Suppose you had the information from the other 8 samples.
Sample
Set DDT levels in the gulls mean std
1 102 102 103 95 105 97 95 104 98 103 100.4 3.8
2 100 103 99 98 95 98 94 100 90 103 98.0 4.1
3 101 96 106 102 104 95 98 103 108 104 101.7 4.2
4 101 100 99 90 102 99 105 92 100 102 99.0 4.6
5 107 98 101 100 100 98 107 99 104 98 101.2 3.6
6 102 102 101 101 92 94 104 100 101 97 99.4 3.8
7 94 101 100 100 96 101 100 98 94 98 98.2 2.7
8 104 102 97 104 97 99 100 100 109 102 101.4 3.7
Based on this new information, what would you believe to be a plausible value for the true population
mean?
It seems reasonable that because the sample means when taken over repeated samples from the same
population seem to lie between 98 and 102 ppm that this should provide some information about the true
population value. For example, if you saw in the 8 additional samples that the range of the sample means
varied between 90 and 110 ppm would your plausible interval change?
Again statistical theory come into play. A very famous and important (for statisticians!) theorem, the
Central Limit Theorem, gives the theoretical sampling distribution of many statistics for most common
sampling methods.
In this case, the Central Limit Theorem states that the sample mean from a simple random sample from
a large population should have an approximate normal distribution with the se(Y ) =

n
. The se of Y
measures the variability of Y around the true population mean when different samples of the same size are
taken. Note that the sample mean is LESS variable than individual observations does this makes sense?
Using the properties of a Normal distribution, there is a 95% probability that Y will vary within about
2se of the true mean (why?). Conversely, there should be about a 95% probability that the true mean
should be within 2se of Y ! This is the crucial step in statistical reasoning.
Unfortunately, - the population standard deviation is unknown so we cant nd the se of Y . However,
it seems reasonable to assume that s, the sample standard deviation, is a reasonable estimator of , the
population standard deviation. So,
s
n
, should be a reasonable estimator of

n
. This is what is reported in
the above output and we have that the Estimated Std Error Mean = s/
n = 3.5214/
10 = 1.1136 ppm.
This number is an estimate of how variable Y is around the true population mean in repeated samples of the
same size from the same population.
Consequently, it seems reasonable that there should be about a 95% probability, that the true mean is
within 2 estimated se of the sample mean, or, we state that an approximate 95% condence interval is
computed as: Y 2(estimated se) or 100.8 2(1.1136) = 100.8 2.2276 = (98.6 103.0) ppm.
It turns out that we have to also account for the fact that s is only an estimate of (s can also vary from
sample to sample) and so the estimated se may not equal the theoretical standard error. Consequently, the
multiplier (2) has to be increased slightly to account for this.
Proc Univariate also reports the 95% condence interval if you specify the cibasic option on the Proc
statement:
VarName Parameter Estimate
Lower
95%
Condence
Limit
Upper
95%
Condence
Limit
ddt Mean 100.80000 98.28097 103.31903
ddt Std Deviation 3.52136 2.42212 6.42864
ddt Variance 12.40000 5.86665 41.32737
For large samples (typically greater than 30), the multiplier is vey close to 2 (actually has the value of
1.96) and there is virtually no difference in the intervals because then s is a very good estimator of and no
additional correction is needed.
We say that we are 95% condent the true population mean (what ever it is) is somewhere in the interval
(98.3 103.3) ppm. What does this mean? We are pretty sure that the true mean DDT is not 110 ppm, nor
is it 90 ppm. But we dont really know if it is 99 or 102 ppm. Plausible values for the true mean DDT for
ALL gulls is any value in the range 98.3 103.3 ppm.
Note that the interval is NOT an interval for the individual values, i.e. it is NOT CORRECT to say that a
95% condence interval includes 95% of the raw data. Rather the condence interval tells you a plausible
range for the true population mean . Also, it is not a condence interval for the sample mean (which you
know to be 100.8) but rather for the unknown population mean . These two points are the most common
mis-interpretations of condence intervals.
To obtain a plot of the condence interval, use Proc Ttest in SAS. Note that Proc Ttest could also have
been used to obtain the basic summary statistics and graphs similar to that from Proc Univariate.
Notice that the upper and lower bars of a box-plot and the upper and lower limits of the condence
intervals are telling you different stories. Be sure that you understand the difference!
Many packages and published papers dont show condence intervals, but rather simply show the mean
and then either 1se or 2se from the mean as approximate 68% or 95% condence intervals such as below:
There really isnt any reason to plot 1se as these are approximate 68% condence limits which seems
kind of silly. The reason this type of plot persists is because it is the default option in Excel which has
the largest collection of bad graphs in the world. [A general rule of thumb - DONT USE EXCEL FOR
STATISTICS!]
What are the likely effects of changing sample sizes, different amount of variability, and different
levels of condence upon the condence interval width?
It seems reasonable that a large sample size should be more precise, i.e. have less variation over re-
peated samples from the same population. This implies that a condence interval based on a larger sample
should be narrower for the same level of condence, i.e. a 95% condence interval from a sample with
n = 100 should be narrower than a 95% condence interval from a sample with n = 10 when taken from
the same population.
Also, if the elements in a population are more variable, then the variation of the sample mean should be
larger and the corresponding condence interval should be wider.
And, why stop at a 95% condence level - why not nd a 100% condence interval? In order to be 100%
condent, you would have to sample the entire population which is not practical for most cases. Again,
it seems reasonable that interval widths will increase with the level of condence, i.e. a 99% condence
interval will be wider than a 95% condence interval.
How are several groups of means compared if all were selected using a random sample? For now,
one simple way to compare several groups is through the use of side-by-side condence intervals. For
example, consider a study that looked at the change in weight of animals when given one of three different
drugs (A, D, or placebo). Here is a side-by-side condence interval plot:
As you will see in later chapters, the procedure for comparing more than one groups mean depends on
how the data are collected. In many cases, SAS produces plots similar to:
These are known as notched-box plots. The notches on the box-plot indicate the condence intervals for
the mean. If the notches from two groups do not overlap, then there is evidence that the population means
could differ.
What does this show? Because the 95% condence intervals for drug A and drug D have considerable
overlap, there doesnt appear to be much of a difference in the population means (the same value could
be common to both groups). However, the small overlap in the condence intervalss of the Placebo and
the other drugs provides evidnece that the population means may differ. [Note the distinction between the
sample and population means in the above discussion.]
As another example, consider the following graph of barley yields for three years along with 95% con-
dence intervals drawn on the graphs. The data are from a study of crop yields downwind of a coal red
generating plant that started operation in 1985. What does this suggest?
Because the 95% condence intervals for 1984 and 1980 overlap considerably, there really isnt much
evidence that the true mean yield differs. However, because the 95% condence interval for 1988 does not
overlap the other two groups, there is good evidence that the population mean in 1988 is smaller than in the
previous two years.
In general, if the 95% condence intervals of two groups do not overlap, then there is good evidence
that the group population means differ. If there is considerable overlap, then the population means of both
groups might be the same.
2.3.2 Some practical advice
In order for condence intervals to have any meaning, the data must be collected using a probability
sampling method. No amount of statistical wizardry will give valid inference for data collected in a
haphazard fashion. Remember, haphazard does not imply a random selection.
If you consult statistical textbooks, they are lled with many hundreds of formulae for condence
intervals under many possible sampling designs. The formulae for condence intervals are indeed
different for various estimators and sampling designs but they are all interpreted in a similar fashion.
A rough and ready rule of thumb is that a 95% condence interval is found as estimate 2se and a
68% condence interval is found as estimate 1se. Dont worry too much about the exact formulae
- if a study doesnt show clear conclusions based on these rough rules, then using more exact methods
wont improve things.
The crucial part is nding the se. This depends upon the estimator and sampling design pay careful
attention that the computer package you are using and the options within the computer package match
the actual data collection methods. I cant emphasize this too much! This is the most likely spot
where you may inadvertently use inappropriate analysis!
Condence intervals are sensitive to outliers because both the sample mean and standard deviation are
sensitive to outliers.
If the sample size is small, then you must also make a very strong assumption about the population
distribution. This is because the central limit theorem only works for large samples. Recent work
using bootstrap and other resampling methods may be an alternative approach.
The condence interval only tells you the uncertainty in knowing the true population parameter
because you only measured a sample from the population
3
. It does not cover potential imprecision
caused by nonresponse, under-coverage, measurement errors etc. In many cases, these can be orders
of magnitude larger - particularly if the data was not collected according to a well dened plan.
2.3.3 Technical details
The formula for a condence interval for a single mean when the data are collected using a simple random
sample from a population with normally distributed data is:
Y t
n1
se
or
Y t
n1
s
n
where the t
n1
refers to values from a t-distribution with (n 1) degrees of freedom. Values of the t-
distribution are tabulated in the tables located at http://www.stat.sfu.ca/~cschwarz/CourseNotes/.
For the above example for gulls on Triangle Island, n = 10, so the multiplier for a 95% condence interval is
t
9
= 2.2622 and the condence interval was found as: 100.82.262(1.1136) = 100.82.5192 = (98.28
103.32) ppm which matches the results provided by JMP.
Note that different sampling schemes may not use a t-distribution and most certainly will have different
degrees of freedom for the t-distribution.
This formula is useful when the raw data is not given, and only the summary statistics (typically the
sample size, the sample mean, and the sample standard deviation) are given and a condence interval needs
to be computed.
What is the effect of sample size? If the above formula is examined, the primary place where the sample
size n comes into play is the denominator of the standard error. So as n increases, the se decreases. However
note that the se decreases as a function of
n, i.e. it takes 4 the sample size to reduce the standard error

3
This is technically known as the sampling error
by a factor of 2. This is sensible because as the sample size increases, Y should be less variable (and usually
closer to the true population mean). Consequently, the width of the interval decreases. The condence level
doesnt change - we would still be roughly 95% condent, but the interval is smaller. The sample size also
affects the degrees of freedom which affects the t-value, but this effect is minor compared to that change in
the se.
What is the effect of increasing the condence level? If you wanted to be 99% condent, the t-value
from the table increases. For example, the t-value for 9 degrees of freedom increases from 2.262 for a 95%
condence interval to 3.25 for a 99% condence interval. In general, a higher condence level will give a
wider condence interval.
2.4 Hypothesis testing
Section summary:
1. Understand the basic paradigm of hypothesis testing.
2. Interpret p-values correctly.
3. Understand Type I, Type II, and Type III errors.
4. Understand the limitation of hypothesis testing.
2.4.1 A review
Hypothesis testing is an important paradigm of Statistical Inference, but has its limitations. In recent years,
emphasis has moved away from formal hypothesis testing to more inferential statistics (e.g. condence
intervals) but hypothesis testing still has an important role to play.
There are two common hypothesis testing situations encountered in ecology:
Comparing the population parameter against a known standard. For example, environmental regula-
tions may specify that the mean contaminant loading in water must be less than a certain xed value.
Comparing the population parameter among 2 or more groups. For example, is the average DDT
loading the same for males and female birds.
The key steps in hypothesis testing are:
Formulate the hypothysis of NO CHANGE in terms of POPULATION parameters.
Collect data using a good sampling design or a good experimental design paying careful attention to
the RRRs.
Using the data, compute the difference between the sample estimate and the standard, or the difference
in the sample estimates among the groups.
Evaluate if the observed change (or difference) is consistent with NO EFFECT. This is usually sum-
marized by a p-value.
2.4.2 Comparing the population parameter against a known standard
Again consider the example of gulls on Triangle Island introduced in previous sections.
Of interest is the population mean DDT level over ALL the gulls. Let () represent the average DDT
over all gulls on the island. The value of this population parameter is unknown because you would have to
measure ALL gulls which is logistically impossible to do.
Now suppose that the value of 98 ppm is a critical value for the health of the species. Is there evidence
that the current population mean level is different than 98 ppm?
Scientists took a random sample (how was this done?) of 10 gulls and found the following DDT levels.
100, 105, 97, 103, 96, 106, 102, 97, 99, 103.
Proc Ttest in SAS can be used to examine if the true mean is 98. First examine the condence interval
for the mean based on the sample of 10 birds selected earlier. [Proc Univariate was used earlier, but the
following output is from Proc Ttest.]
Variable Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
ddt 100.8 98.2810 103.3
First examine the 95% condence interval presented above. The condence interval excludes the value
of 98 ppm, so one is fairly condent that the population mean DDT level differs from 98 ppm. Furthermore
the condence interval gives information about what the population mean DDT level could be. Note that the
hypothesized value of 98 ppm is just outside the 95% condence interval.
A hypothesis test is much more formal and consists of several steps:
1. Formulate hypothesis. This is a formal statement of two alternatives. The null hypothesis (denoted
as H
0
or H) indicates the state of ignorance or no effect. The alternate hypothesis (denoted as H
1
or
A) indicates the effect that is to be detected if present.
Both the null and alternate hypothesis can be formulated before any data are collected and are always
formulated in terms of the population parameter. It is NEVER formulated in terms of the sample
statistics as these would vary from sample to sample.
In this case, the null and alternate hypothesis are:
H: = 98, i.e. the mean DDT levels for ALL gulls is 98 ppm.
A: = 98, i.e. the mean DDT levels for ALL gulls is not 98 ppm.
This is known as a two-sided test because we are interested if the mean is either greater than or less
than 98 ppm.
4
2. Collect data. Again it is important that the data be collected using probability sampling methods and
the RRRs. The form of the data collection will inuence the next step.
3. Compute a test-statistic and p-value. The test-statistic is computed from the data and measures the
discrepancy between the observed data and the null hypothesis, i.e. how far is the observed sample
mean of 100.8 ppm from the hypothesized value of 98 ppm?
The JMP output is:
We use the H0=98 option on the Proc
4
It is possible to construct what are known as one-sided tests where interest lies ONLY if the population mean exceeds 98 ppm, or
a test if interest lies ONLY if the population mean is less than 98 ppm. These are rarely useful in ecological work.
Ttest statement in SAS
ods graphics on;
proc ttest data=ddt dist=normal h0=98;
title2 Test if the mean is 98;
var ddt;
ods output TTests=Test98;
ods output ConfLimits=CIMean1;
run;
ods graphics off;
giving:
Variable
t
Value DF
Pr
>
|t|
ddt 2.51 9 0.0331
The output examines if the data are consistent with the hypothesized value (98 ppm) followed by the
estimate (the sample mean) of 100.8 ppm. From the earlier output, we know that the se of the sample
mean is 1.11.
How discordant is the sample mean of 100.8 ppm with the hypothesized value of 98 ppm? One
discrepancy measure is known as a T-ratio and is computed as:
T =
(estimate hypothesized value)
estimated se
=
(100.8 98)
3.52136/
10
= 2.5145
This implies the estimate is about 2.5 se different from the null hypothesis value of 98. This T-ratio is
labelled as the Test Statistic in the output.
Note that there are many measures of discrepancy of the estimate with the null hypothesis - JMP also
provides a non-parametric statistic suitable when the assumption of normality in the population may
be suspect - this is not covered in this course.
How is this measure of discordance between the sample mean (100.8) and the hypothesized value of
98 assessed? The unusualness of the test statistic is measured by nding the probability of observing
the current test statistic assuming the null hypothesis is true. In other words, if the hypothesis were
true (and the true population mean is 98 ppm), what is the probability nding a sample mean of 100.8
ppm?
5
This is denoted the p-value. Notice that the p-value is attached to the data - it measures the probability
of the sample mean given the hypothesis is true. Probabilities CANNOT be attached to hypotheses -
it would be incorrect to say that there is a 3% that the hypothesis is true. The hypothesis is either true
or false, it cant be partially true!
6
5
In actual fact, the probability is computed of nding a value of 100.8 or more distant from the hypothesized value of 98. This will
be explained in more detail later in the notes.
6
This is similar to asking a small child if they took a cookie. The truth is either yes or no, but often you will get the response of
maybe which really doesnt make much sense!
It is possible to construct what are known as one-sided tests using

Proc Ttest these are not pursued
in this course and contact me for more details.
4. Make a decision. How are the test statistic and p-value used? The basic paradigm of hypothesis
testing is that unusual events provide evidence against the null hypothesis. Logically, rare events
shouldnt happen if the null hypothesis is true. This logic can be confusing! We will discuss it
more in class.
In our case, the p-value of 0.0331 indicates there is an approximate 3.3% chance of observing a sample
mean that differs from the hypothesized value of 98 if the null hypothesis were true.
Is this unusual? There are no xed guidelines for the degree of unusualness expected before declaring
it to be unusual. Many people use a 5% cut-off value, i.e. if the p-value is less than 0.05, then this is
evidence against the null hypothesis; if the p-value is greater than 0.05 then this not evidence against
the null hypothesis. [This cut-off value is often called the -level.] If we adopt this cut-off value,
then our observed p-value of 0.0331 is evidence against the null hypothesis and we nd that there is
evidence that the true mean DDT level is different than 98 ppm.
The plot at the bottom of the output that is presented by JMP is helpful in trying to understand what is
going on. [No such equivalent plot is readily available in R or SAS.] It tries to give a measure of how unusual
the sample mean of 100.8 is relative to the hypothesized value of 98. If the hypothesis were true, and the true
population mean was 98, then you would expect the sample means to be clustered around the value of 98.
The bell-shaped curve shows the distribution of the SAMPLE MEANS if repeated samples are taken from
the same population. It is centered over the true population mean (98) with a variability measured by the se
of 1.11. The small vertical tick mark just under the value of 101, represents the observed sample mean of
100.8. You can see that the observed sample mean of 100.8 is somewhat unusual compared to the population
value of 98. The shaded areas in the two tails represents the probability of observing the value of the sample
mean so far away from the hypothesised value (in either direction) if the hypothesis were true and represents
the p-value.
If you repeated the same steps with a hypothesized value of 80, you would see that the observed sample
mean of 100.8 is extremely unusual relative to the population value of 80:
The JMP output is:
We use the H0=80 option on the Proc Ttest statement in SAS
ods graphics on;
var ddt;
run;
ods graphics off;
giving:
Variable
t
Value DF
Pr
>
|t|
ddt 18.68 9 <.0001
The p-value is < .0001 indicating that the observed sample mean of 100.8 is very unusual relative to the
hypothesis. We have evidence against the hypothesized value of 80.
Again, look at the graph produced by JMP. If the hypothesis were true, i.e. the true population mean ()
was 80 ppm, then you would expect to see most of the sample means clustered around the value of 80 (the
curve shown). The actual value of 100.8 is very unusual (if the hypothesis were true) the vertical tick mark
is so far from what would be expected.
Conversely, repeat the steps with a hypothesized value of 100:
The JMP output is:
We use the H0=100 option on the Proc Ttest statement in SAS
ods graphics on;
var ddt;
run;
ods graphics off;
giving:
Variable
t
Value DF
Pr
>
|t|
ddt 18.68 9 <.0001
Now we would expect the sample means to cluster around the value of 100 as shown by the curve. The
observed value of 100.8 for the sample mean is not very unusually at all. The p-value is .49 which is quite
large there is no evidence that the observed sample mean is unusual relative to the hypothesized mean of
100.
Technical details
The example presented above is a case of testing a population mean against a known value when the
population values have a normal distribution and the data is selected using a simple random sample.
The null and alternate hypotheses are written as:
H: =
0
A: =
0
where
0
= 98 is the hypothesized value.
The test statistic:
T =
(estimate hypothesized value)
estimated se
=
Y
0
s/
n
then has a t-distribution with n 1 degrees of freedom.
The observed value of the test statistic is
T
0
=
(100.8 98)
3.52136
= 2.5145
The p-value is computed by nding
Prob(T > |T
0
|)
and is found to be 0.0331.
In some very rare cases, the population standard deviation is known. In these cases, the true standard
error is known, and the test-statistics is compared to a normal distribution. This is extremely rare in
practise.
The assumption of normality of the population values can be relaxed if the sample size is sufciently
large. In those cases, the central limit theorem indicates that the distribution of the test-statistics is
known regardless of the underlying distribution of population values.
2.4.3 Comparing the population parameter between two groups
A common situation in ecology is to compare a population parameter in two (or more) groups. For example,
it may be of interest to investigate if the mean DDT levels in males and females birds could be the same.
There are now two population parameters of interest. The mean DDT level of male birds is denoted as
m
while the mean DDT level of female birds is denoted as
f
. These would be the mean DDT level of all
birds of their respective sex once again this cannot be measured as not all birds can be sampled.
The hypotheses of interest are:
H :
m
=
f
or H :
m
f
= 0
A :
m
=
f
or H :
m
f
= 0
Again note that the hypotheses are in terms of the POPULATION parameters. The alternate hypothesis
indicates that a difference in means in either direction is of interest, i.e. we dont have an a prior belief that
male birds have a smaller or larger population mean compared to female birds.
A random sample is taken from each of the populations using the RRR.
The raw data are read in the usual way:
data ddt2g;
infile ddt2g.csv dlm=, dsd missover firstobs=2;
input sex $ ddt;
run;
giving:
Obs sex ddt
1 m 100
2 m 98
3 m 102
4 m 103
5 m 99
6 f 104
7 f 105
8 f 107
Obs sex ddt
9 f 105
10 f 103
Notice there are now two columns. One column identies the group membership of each bird (the sex) and
is nominal or ordinal in scale. The second column gives the DDT reading for each bird.
7
We start by using Proc SGplot to creates a side-by-side dot-plots and box plots:
proc sgplot data=ddt2g;
title2 Plot of ddt vs. sex;
scatter x=sex y=ddt;
xaxis offsetmin=.05 offsetmax=.05;
run;
and
proc sgplot data=ddt2g;
title2 Box plots;
vbox ddt / group=sex notches;
/
*
the notches options creates overlap region to compare if medians are equal
*
/
run;
which gives
7
The columns can be in any order. As well, the data can be in any order and male and female birds can be interspersed.
Next, compute simple summary statistics for each group:
Proc Tabulate is used to construct a table of means and standard deviations:
proc tabulate data=ddt2g;
title2 some basic summary statistics;
class sex;
var ddt;
table sex, ddt
*
(n
*
f=5.0 mean
*
f=5.1 std
*
f=5.1 stderr
*
f=7.2 lclm
*
f=7.1 uclm
*
f=7.1) / rts=20;
run;
which gives:
ddt
N Mean Std StdErr 95_LCLM 95_UCLM
sex 5 104.8 1.5 0.66 103.0 106.6
f
m 5 100.4 2.1 0.93 97.8 103.0
The individual sample means and se for each sex are reported along with 95% condence intervals for the
population mean DDT of each sex. The 95% condence intervals for the two sexes have virtually no overlap
which implies that a single plausible value common to both sexes is unlikely to exist.
Because we are interested in comparing the two population means, it seems sensible to estimate the dif-
ference in the means. This can be done for this experiment using a statistical technique called (for historical
reasons) a t-test
8
Proc Ttest is used to perform the test of the hypothesis that the two means are the same:
ods graphics on;
proc ttest data=ddt2g plot=all dist=normal;
title2 test of equality of ddts between the two sexs;
class sex;
var ddt;
ods output ttests = TtestTest;
ods output ConfLimits=TtestCL;
ods output Statistics=TtestStat;
run;
8
The t-test requires a simple random sample from each group.
ods graphics off;
The output is voluminous, and selected portions are reproduced below:
Variable Method Variances
t
Value DF
Pr
>
|t|
ddt Pooled Equal 3.86 8 0.0048
ddt Satterthwaite Unequal 3.86 7.2439 0.0058
Variable sex Method Variances Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
ddt Diff (1-2) Pooled Equal 4.4000 1.7708 7.0292
ddt Diff (1-2) Satterthwaite Unequal 4.4000 1.7222 7.0778
Variable sex N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
ddt f 5 104.8 0.6633 103.0 106.6
ddt m 5 100.4 0.9274 97.8252 103.0
ddt Diff (1-2) _ 4.4000 1.1402 1.7708 7.0292
and a nal plot of:
The rst part of the output estimates the difference in the population means. Because each sample mean
is an unbiased estimator for the corresponding population mean, it seems reasonable that the difference in
sample means should be unbiased for the difference in population mean.
Unfortunately, many packages do NOT provide information on which order the difference in means was
computed. Many packages order the groups alphabetically, but this can be often be changed. The estimated
difference in means is 4.4 ppm. The difference is negative indicating that the sample mean DDT for the
males is less than the sample mean DDT for the females. As usual, a measure of precision (the se) should be
reported for each estimate. The se for the difference in means is 1.14 (refer to later chapters on how the se is
computed), and the 95%condence interval for the difference in population means is from(7.1 1.72).
Because the 95% condence interval for the difference in population means does NOT include the value of
0, there is evidence that the mean DDT for all males could be different than the mean DDT for all females.
The t ratio is again a measure of how far the difference in sample means is from the hypothesized value
of 0 difference and is found as the observed difference divided by the se of the difference. The p-value of
.0058 indicates that the observed difference in sample means of 4.4 is quite unusual if the hypothesis were
true. Because the p-value is quite small, there is strong evidence against the hypothesis of no difference.
The comparison of means (and other parameters) will be explored in more detail in future chapters.
2.4.4 Type I, Type II and Type III errors
Hypothesis testing can be thought of as analogous to a court room trial. The null hypothesis is that the
defendant is innocent while the alternate hypothesis is that the defendant is guilty. The role of the
prosecutor is to gather evidence that is inconsistent with the null hypothesis. If the evidence is so unusual
(under the assumption of innocence), the null hypothesis is disbelieved.
Obviously the criminal justice system is not perfect. Occasionally mistakes are made (innocent people
are convicted or guilty parties are not convicted).
The same types of errors can also occur when testing scientic hypotheses. For historical reasons, two
possible types of errors than can occur are labeled as Type I and Type II errors:
Type I error. Also known as a false positive. A Type I error occurs when evidence against the null
hypothesis is erroneously found when, in fact, the hypothesis is true. How can this occur? Well, the
p-value measures the probability that the data could have occurred, by chance, if the null hypothesis
were true. We usually conclude that the evidence is strong against the null hypothesis if the p-value
is small, i.e. a rare event. However, rare events do occur and perhaps the data is just one of these rare
events. The Type I error rate can be controlled by the cut-off value used to decide if the evidence
against the hypothesis is sufciently strong. If you believe that the evidence strong enough when the
p-value is less than the = .05 level, then you are willing to accept a 5% chance of making a Type I
error.
Type II error. Also known as a false negative. A Type II error occurs when you believe that the
evidence against the null hypothesis is not strong enough, when, in fact, the hypothesis is false. How
can this occur? The usual reasons for a Type II error to occur are that the sample size is too small
to make a good decision. For example, suppose that the condence interval for the gull example,
extended from 50 to 150 ppm. There is no evidence that any value in the range of 50 to 150 is
inconsistent with the null hypothesis.
There are two types of correct decision:
Power or Sensitivity. The power of a hypothesis test is the ability to conclude that the evidence is
strong enough against the null hypothesis when in fact is is false, i.e. the ability to detect if the null
hypothesis is false. This is controlled by the sample size.
Specicity. The specicity of a test is the ability to correctly nd no evidence against the null hypoth-
esis when it is true.
In any experiment, it is never known if one of these errors or a correct decision has been made. The Type
I and Type II errors and the two correct decision can be placed into a summary table:
Action Taken
p-value is < . Evidence
against the null hypothesis.
p-value is > . No evi-
dence against the null hy-
pothesis.
True state of
nature
Null Hypoth-
esis true
Type I error = False pos-
itive error. This is con-
trolled by the -level used
to decide if the evidence is
strong enough against the
null hypothesis.
Correct decision. Also
known as the specicity of
the test.
Null Hypoth-
esis false
Correct decision. This is
known as the power of the
test or the sensitivity of
the test. Controlled by the
sample size with a larger
sample size having greater
power to detect a false null
hypothesis.
Type II error= False neg-
ative error. Controlled
by sample size with a
larger sample size leading
to fewer Type II errors.
In the context of a monitoring design to determine if there is an evironmental impact due to some action,
the above table reduces to:
Action Taken
p-value is < . Evidence
against the null hypothe-
sis. Impact is apparently
observed.
p-value is > . No evi-
dence against the null hy-
pothesis. Impact is appar-
ently not observed
True state of
nature
Null Hy-
pothesis true.
No envi-
ronmental
impact.
Type I error= False positive
error. An environmental
impact is detected when
in fact, none occured.
Correct decision. No envi-
ronmental impact detected.
Null Hypoth-
esis false.
Environmen-
tal impact
exists.
Correct decision. Environ-
mental impact detected.
Type II error= False nega-
tive error. Environmental
impact not detected.
Usually, a Type I error is more serious (convicting an innocent person; false detecting an environmental
impact and ning an organization millions of dollar) and so we want good evidence before we conclude
against the null hypothesis. We measure the strength of the evidence by the p-value. Typically, we want
the p-value to be less than about 5% before we believe that the evidence is strong enough against the null
hypothesis, but this can be varied depending on the problem. If the consequences of a Type I error are
severe, the evidence must be very strong before action is taken, so the level might be reduced to .01 from
the usual .05.
Most experimental studies tend to ignore power (and Type II error) issues. However, these are important
for example, should an experiment be run that only has a 10% chance of detecting an important effect?
What are the consequence of failing to detect an environmental impact. What is the price tag to letting a
species go extinct without detecting it? We will explore issues of power and sample size in later chapters.
What is a Type III error? This is more whimsical, as it refers to a correct answer to the wrong question!
Too often, researchers get caught up in their particular research project and spent much time and energy in
obtaining an answer, but the answer is not relevant to the question of interest.
2.4.5 Some practical advice
The p-value does NOT measure the probability that the null hypothesis is true. It measures the prob-
ability of observing the sample data assuming the null hypothesis were true. You cannot attach a
probability statement to the null hypothesis in the same way you cant be 90% pregnant! The hy-
pothesis is either true or false there is no randomness attached to a hypothesis. The randomness is
attached to the data.
A rough rule of thumb is there is sufcient evidence against the hypothesis if the observed test statistic
is more than 2 se away from the hypothesized value.
The p-value is also known as the observed signicance level. In the past, you choose a prespecied
signicance level (known as the level) and if the p-value is less than , you concluded against the
null hypothesis. For example, is often set at 0.05 (denoted =0.05). If the p-value is < = 0.05,
then you concluded tat the evidence was strong aganst the null hypothesis; otherwise you the evidence
was not strong enought against the null hypothesis. Scientic papers often reported results using a
series of asterisks, e.g. * meant that a result was statistically signicant at = .05; ** meant that
a result was statistically signicant at = .01; *** meant that a result was statistically signicant
at = .001. This practice reects a time when it was quite impossible to compute the exact p-values,
and only tables were available. In this modern era, there is no excuse for failing to report the exact
p-value. All scientic papers should report the actual p-value for a test so that the reader can use their
own personal signicance level.
Some traditional and recommended nomenclature for the results of hypothesis testing:
p-value Traditional Recommended
p-value
< 0.05
Reject the null hypothesis There is strong evidence
.05 <p-value
< .15
Barely fail to reject the null
hypothsis
Evidence is equivocol and
we need more data.
p-value
> .15
Fail to reject the null hy-
pothesis.
There is no evidence
However, the point at which we conclude that there is sufcient evidence against the null hypothesis
(the level which was .05 above) depends upon the situation at hand and the consequences of wrong
decisions (see later in this chapter)..
It is not good form to state things like:
accept the null hypothesis;
accept the alternate hypothesis;
the null hypothesis is true;
the null hypothesis is false.
The reason is that you havent proved the truthfulness or falseness of the hypothesis; rather you have
not or have sufcient evidence that contradict it. It is the same reasons that jury trials return verdicts
of guilty (evidence against the hypothesis of innocence) or not guilty (insufcent evidence against
the hypothesis of innocence). A jury trial does NOT return an innocent verdict.
If there is evidence against the null hypothesis, a natural question to ask is well, what value of the
parameter are plausible given this data? This is exactly what a condence interval tells you. Conse-
quently, I usually prefer to nd condence intervals, rather than doing formal hypothesis testing.
Carrying out a statistical test of a hypothesis is straightforward with many computer packages. How-
ever, using tests wisely is not so simple. Hypothesis testing demands the RRR. Any survey or
experiment that doesnt follow the three basic principles of statistics (randomization, replication, and
blocking) is basically useless. In particular, non randomized surveys or experiments CANNOT be
used in hypothesis testing or inference. Be careful that random is not confused with haphazard.
Computer packages do not know how you collected data. It is your responsibility to ensure that your
brain in engaged before putting the package in gear. Each test is valid only in circumstances where
the method of data collection adheres to the assumptions of the test. Some hesitation about the use of
signicance tests is a sign of statistical maturity.
Beware of outliers or other problems with the data. Be prepared to spend a fair amount of time
examining the raw data for spurious points.
2.4.6 The case against hypothesis testing
In recent years, there has been much debate about the usefulness of hypothesis testing in scientic research
(see the next section for a selection of articles). There a number of problems with the uncritical use of
hypothesis testing:
Sharp null hypothesis The value of 98 ppm as a hypothesized value seems rather arbitrary. Why not
97.9 ppm or 98.1 ppm. Do we really think that the true DDT value is exactly 98.000000000 ppm?
Perhaps it would be more reasonable to ask How close is the actual mean DDT in the population to
98 ppm?
Choice of The choice of -level (i.e. 0.05 signicance level) is also arbitrary. The value of should
reect the costs of Type I errors, i.e. the costs of false positive results. In a murder trial, the cost of
sending an innocent person to the electric chair is very large - we require a very large burden of proof,
i.e. the p-value must be very small. On the other hand, the cost of an innocent person paying for a
wrongfully issued parking ticket is not very large; a lesser burden of proof is required, i.e. a higher
p-value can be used to conclude that the evidence is strong enough against the hypothesis.
A similar analysis should be made for any hypothesis testing case, but rarely is done.
The tradeoffs between Type I and II errors, power, and sample size are rarely discussed in this context.
Sharp decision rules Traditional hypothesis testing says that if the p-value is less than , you should
conclude that there is sufcient evidence against the null hypothesis and if the p-value is greater than
there is not enough evidence against the null hypothesis. Suppose that is set at .05. Should
different decisions be made if the p-value is 0.0499 or 0.0501? It seems unlikely that extremely minor
differences in the p-value should lead to such dramatic differences in conclusions.
Obvious tests In many cases, hypothesis testing is used when the evidence is obvious. For example,
why would you even bother testing if the true mean is 50 ppm? The data clearly shows that it is not.
Interpreting p-values P-values are prone to mis-interpretation as they measure the plausibility of the
data assuming the null hypothesis is true, not the probability that the hypothesis is true. There is also
the confusion between selecting the appropriate p-value for one- and two-sided tests.
Refer to the Ministry of Forests publication Pamphlet 30 on interpreting the p-value available at
http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/index.html.
Effect of sample size P-values are highly affected by sample size. With sufciently large sample
sizes every effect is statistically signicant but may be of no biological interest.
Practical vs. statistical signicance. Just because you nd evidence against the null hypothesis (e.g.
p-value < .05) does not imply that the effect is very large. For example, if you were to test if a
coin were fair and were able to toss it 1,000,000 times, you would nd strong evidence against the
null hypothesis of fairness if the observed proportion of heads was 50.001%. But for all intents and
purposes, the coin is fair enough for real use. Statistical signicance is not the same as practical
signicance. Other examples of this trap, are the numerous studies that show cancerous effects of
certain foods. Unfortunately, the estimated increase in risk from these studies is often less than 1/100
of 1%!
The remedy for confusing statistical and practical signicance is to ask for a condence interval for
the actual parameter of interest. This will often tell you the size of the purported effect.
Failing to detect a difference vs. no effect. Just because an experiment fails to nd evidence against
the null hypothesis (e.g. p-value > .05) does not mean that there is no effect! A Type II error - a false
negative error - may have been committed. These usually occur when experiments are too small (i.e.,
inadequate sample size) to detect effects of interest.
The remedy for this is to ask for the power of the test to detect the effect of practical interest, or failing
that, ask for the condence interval for the parameter. Typically power will be low, or the condence
interval will be so wide as to be useless.
Multiple testing. In some experiments, hundreds of statistical tests are performed. However, remem-
ber that the p-value represents the chance that this data could have occurred given that the hypothesis
is true. So a p-value of 0.01 implies, that this event could have occurred in about 1% of cases EVEN
IF THE NULL IS TRUE. So nding one or two signicant results out of hundreds of tests is not
surprising!
There are more sophisticated analyses available to control this problem called multiple comparison
techniques and are covered in more advanced classes.
On the other hand, a condence interval for the population parameter gives much more information.
The condence interval shows you how precise the estimate is, and the range of plausible values that are
consistent with the data collected.
For example consider the following illustration:
All three results on the left would be statistically signicant but your actions would be quite different. On
the extreme left, you detected an effect and it is biologically important you must do something. In the
second case, you detected an effect, but cant yet decide if is biologically important more data needs to be
collected. In the third case, you detected an effect, but it is small and not biologically important.
The two right cases are both where the results are not statistically signicant. In the fourth case you
failed to detect an effect, but the experiment was so well planned that you are condent that if an effect were
real, it would be small. There actually is NOT difference in your conclusions between the 3rd and 4th cases!
The rightmost case is a poor experiment you failed to detect anything because the experiment was so
small and so poorly planned that you really dont know anything! Try not to be in the right most case!
Modern statistical methodology is placing more and more emphasis upon the use of condence intervals
rather than a blind adherence on hypothesis testing.
2.4.7 Problems with p-values - what does the literature say?
There were two inuential papers in the Wildlife Society publications that have affected how people view
the use of p-values.
Statistical tests in publications of the Wildlife Society
Cherry, S. (1998)
Statistical tests in publication of the Wildlife Society
Wildlife Society Bulletin, 26, 947-954.
http://www.jstor.org/stable/3783574.
The 1995 issue of the Journal of Wildlife Management has > 2400 p-values. I believe that is
too many. In this article, I will argue that authors who publish in the Journal and in he Wildlife
Society Bulletin are over using and misunderstanding hypothesis tests. They are conducting too
many unnecessary tests, and they are making common mistakes in carrying out and interpreting
the results of the tests they conduct. A major cause of the overuse of testing in the Journal and
the Bulletin seems to be the mistaken belief that testing is necessary in order for a study to be
valid or scientic.
What are the problems in the analysis of habitat availability.
What additional information do condence intervals provide that signicance levels do not provide?
When is the assumption of normality critical, in testing if the means of two population are equal?
What does Cherry recommend in lieu of hypothesis testing?
The Insignicance of Statistical Signicance Testing
Johnson, D. H. (1999)
The Insignicance of Statistical Signicance Testing
Journal of Wildlife Management, 63, 763-772.
http://dx.doi.org/10.2307/3802789 or online at
http://www.npwrc.usgs.gov/resource/methods/statsig/index.htm
Despite their wide use in scientic journals such as The Journal of Wildlife Management, sta-
tistical hypothesis tests add very little value to the products of research. Indeed, they frequently
confuse the interpretation of data. This paper describes how statistical hypothesis tests are often
viewed, and then contrasts that interpretation with the correct one. He discusses the arbitrari-
ness of p-values, conclusions that the null hypothesis is true, power analysis, and distinctions
between statistical and biological signicance. Statistical hypothesis testing, in which the null
hypothesis about the properties of a population is almost always known a priori to be false, is
contrasted with scientic hypothesis testing, which examines a credible null hypothesis about
phenomena in nature. More meaningful alternatives are briey outlined, including estimation
and condence intervals for determining the importance of factors, decision theory for guid-
ing actions in the face of uncertainty, and Bayesian approaches to hypothesis testing and other
statistical practices.
This is a very nice, readable paper, that discusses some of the problems with hypothesis testing. As in the
Cherry paper above, Johnson recommends that condence intervals be used in place of hypothesis testing.
So why are condence intervals not used as often as they should? Johnson give several reasons
hypothesis testing has become a tradition;
the advantages of condence intervals are not recognized;
there is some ignorance of the procedures available;
major statistical packages do not include many condence interval estimates;
sizes of parameter estimates are often disappointingly small even though they may be very signi-
cantly different from zero;
the wide condence intervals that often result from a study are embarrassing;
some hypothesis tests (e.g., chi square contingency table) have no uniquely dened parameter associ-
ated with them; and
recommendations to use condence intervals often are accompanied by recommendations to abandon
statistical tests altogether, which is unwelcome advice.
These reasons are not valid excuses for avoiding condence intervals in lieu of hypothesis tests in situations
for which parameter estimation is the objective.
Followups
In
Robinson, D. H. and Wainer, H. W. (2002).
On the past and future of null hypothesis signicance testing.
Journal of Wildlife Management 66, 263-271.
http://dx.doi.org/10.2307/3803158.
the authors argue that there is some benet to p-values in wildlife management, but then
Johnson, D. H. (2002).
The role of hypothesis testing in wildlife science.
http://dx.doi.org/10.2307/3803159.
counters many of these arguments. Both papers are very easy to read and are highly recommended.
2.5 Meta-data
Meta-data are data about data, i.e. how has it been collected, what are the units, what do the codes used in
the dataset represent, etc. It is good practice to store the meta-data as close as possible to the raw data. For
example, some computer packages (e.g. JMP) allow the user to store information about each variable and
about the data table.
In some cases, data can be classied into broad classications called scale or roles.
2.5.1 Scales of measurement
Data comes in various sizes and shapes and it is important to know about these so that the proper analysis
can be used on the data.
Some computer packages (e.g. JMP) use the scales of measurement to determine appropriate analyses
of the data. For example, as you will see later in the course, if the response variable (Y ) has an interval scale
and the explanatory variable (X) has a nominal scale, then an ANOVA-type analysis comparing means
is performed. If both the Y and X variables have a nominal scale, then a
2
-type analysis on comparing
proportions is performed.
There are usually 4 scales of measurement that must be considered:
1. Nominal Data
The data are simply classications.
The data have no ordering.
The data values are arbitrary labels.
An example of nominal data is sex using codes m and f, or codes 0 and 1. Note that just
because a numeric code is used for sex, the variable is still nominally scaled. The practice of
using numeric codes for nominal data is discouraged (see below).
2. Ordinal Data
The data can be ordered but differences between values cannot be quantied.
Some examples of ordinal data are:
Ranking political parties on left to right spectrum using labels 0, 1, or 2.
Using a Likert scale to rank you degree of happiness on a scale of 1 to 5.
Giving a restaurant ratings as terric, good, or poor.
Ranking the size of animals as small, medium, large or coded as 1, 2, 3. Again numeric
codes for ordinal data are discouraged (see below).
3. Interval Data
The data can be ordered, have a constant scale, but have no natural zero.
This implies that differences between data values are meaningful, but ratios are not.
There are really only two common interval scaled variables, e.g., temperature (
C,

F), dates.
For example, 30
C 20
C = 20
C 10
C, but 20
C/10
C is not twice as hot!

4. Ratio Data
Data can be ordered, have a constant scale, and have a natural zero.
Examples of ratio data are height, weight, age, length, etc.
Some packages (e.g. JMP) make no distinction between Interval or Ratio data, calling them both con-
tinuous scaled. However, this is, technically, not quite correct.
Only certain operations can be performed on certain scales of measurement. The following list sum-
marizes which operations are legitimate for each scale. Note that you can always apply operations from a
lesser scale to any particular data, e.g. you may apply nominal, ordinal, or interval operations to an interval
scaled datum.
Nominal Scale. You are only allowed to examine if a nominal scale datum is equal to some particular
value or to count the number of occurrences of each value. For example, gender is a nominal scale
variable. You can examine if the gender of a person is F or count the number of males in a sample.
Talking the average of nominally scaled data is not sensible (e.g. the average sex is not sensible).
In order to avoid problems with computer packages trying to take averages of nominal data, it is
recommended that alphanumerical codes be used for nominally scaled data, e.g. use M and F for sex
rather than 0 or 1. Most packages can accept alphanumeric data without problems.
Ordinal Scale. You are also allowed to examine if an ordinal scale datum is less than or greater than
another value. Hence, you can rank ordinal data, but you cannot quantify differences between
two ordinal values. For example, political party is an ordinal datum with the NDP to the left of the
Conservative Party, but you cant quantify the difference. Another example is preference scores, e.g.
ratings of eating establishments where 10 = good and 1 = poor, but the difference between an
establishment with a 10 ranking and an 8 ranking cant be quantied.
Technically speaking, averages are not really allowed for ordinal data, e.g. taking the average of small,
medium and large as data values doesnt make sense. Again alphanumeric codes are recommended for
ordinal data. Some case should be taken with ordinal data and alphanumeric codes as many packages
sort values alphabetically and so the ordering of large, medium , small may not correspond to the
ordering desired. JMP allows the user to specify the ordering of values in the Column Information of
each variable. A simple trick to get around this problem, is to use alphanumeric codes such as 1.small,
2.medium, 3.large as the data values as an alphabetic sort then keeps the values in proper order.
Interval Scale. You are also allowed to quantify the difference between two interval scale values but
there is no natural zero. For example, temperature scales are interval data with 25
C warmer than
20
C and a 5
C difference has some physical meaning. Note that 0
C is arbitrary, so that it does not

make sense to say that 20
C is twice as hot as 10
C.
Values for interval scaled variables are recorded using numbers so that averages can be taken.
Ratio Scale. You are also allowed to take ratios among ratio scaled variables. Physical measurements
of height, weight, length are typically ratio variables. It is now meaningful to say that 10 m is twice as
long as 5 m. This ratio hold true regardless of which scale the object is being measured in (e.g. meters
or yards). This is because there is a natural zero.
Values for ratio scaled variables are recorded as numbers so that averages can be taken.
2.5.2 Types of Data
Data can also be classied by its type. This is less important than the scale of measurement, as it usually
does not imply a certain type of analysis, but can have subtle effects.
Discrete data Only certain specic values are valid, points between these values are not valid. For example,
counts of people (only integer values allowed), the grade assigned in a course (F, D, C-, C, C+, . . .).
Continuous data All values in a certain range are valid. For example, height, weight, length, etc. Note that
some packages label interval or ratio data as continuous. This is not always the case.
Continuous but discretized Continuous data cannot be measured to innite precision. It must be dis-
cretized, and consequently is technically discrete. For example, a persons height may be measured to
the nearest cm. This can cause problems if the level of discretization is too coarse. For example, what
would happen if a persons height was measured to the nearest meter.
As a rule of thumb, if the discretization is less than 5% of the typical value, then a discretized contin-
uous variable can be treated as continuous without problems.
2.5.3 Roles of data
Some computer packages (e.g. JMP) also make distinctions about the role of a variable.
Label A variable whose value serves as an identication of each observation - usually for plotting.
Frequency A variable whose value indicates how many occurrences of this observation occur. For example,
rather than having 100 lines in a data set to represent 100 females, you could have one line with a count
of 100 in the Frequency variable.
Weight This is rarely used. It indicates the weight that this observation is to have in the analysis. Usually
used in advanced analyses.
X Identies a variables as a predictor variable. [Note that the use of the term independent variable is
somewhat old fashioned and is falling out of favour.] This will be more useful when actual data
analysis is started.
Y Identies a variable as a response variable. [Note tht the use of the term dependent variable is some-
what old fashione and is falling out of favour.] This will be more useful when actual data analysis is
started.
2.6 Bias, Precision, Accuracy
The concepts of Bias, Precision and Accuracy are often used interchangeably in non-technical writing and
speaking. However these have very specic statistical meanings and it important that these be carefully
differentiated.
The rst important point about these terms is that they CANNOT be applied in the context of a single
estimate from a single set of data. Rather, they are measurements of the performance of an estimator over
repeated samples from the same population. Recall, that a fundamental idea of statistics is that repeated
samples from the same population will give different estimates, i.e. estimates will vary as different samples
are selected.
9
Bias is the difference between average value of the estimator over repeated sampling from the popula-
tion and the true parameter value. If the estimates from repeated sampling vary above and below the true
population parameter value so that the average over all possible samples equals the true parameter value, we
say that the estimator is unbiased.
There are two types of bias - systemic and statistical. Systemic Bias is caused by problems in the
apparatus or the measuring device. For example, if a scale systematically gave readings that were 10 g too
small, this would be a systemic bias. Or is snorkelers in stream survey consistently only see 50% of the
available sh, this would also be an example of systemic bias. Statistical bias is related to the choice of
9
The standard error of a estimator measures this variation over repeated samples from the same population.
sampling design and estimator. For example, the usual sample statistics in an simple random sample give
unbiased estimates of means, totals, variances, but not for standard deviations. The ratio estimator of survey
sampling (refer to later chapters) is also biased.
It is not possible to detect systemic biases using the data ta hand. The researcher must examine the
experimental apparatus and design very carefully. For example, if repeated surveys were made by snorkeling
over sections of streams, estimates may be very reproducible (i.e. very precise) but could be consistently
WRONG, i.e. divers only see about 60% of the sh (i.e. biased). Systemic Bias is controlled by careful
testing of the experimental apparatus etc. In some cases, it is possible to calibrate the method using "known"
populations, e.g. mixing a solution of a known concentration and then having your apparatus estimate the
concentration.
Statistical biases can be derived from statistical theory. For example, statistical theory can tell you
that the sample mean of a simple random sample is unbiased for the population mean; that the sample
VARIANCE is unbiased for the population variance; but that the sample standard deviation is a biased
estimator for the population standard deviation. [Even though the sample variance is unbiased, the sample
standard deviation is a NON-LINEAR function of the variance (i.e. square-rooted) and non-linear functions
dont preserve unbiasedness.] The ratio estimator is also biased for the population ratio. In many cases, the
statistical bias can be shown to essentially disappear with reasonably large sample sizes.
Precision of an estimator refers to how variable the repeated estimates will be over repeated sampling
from the same population. Again recall that every different sample from the same population will lead to a
different estimate. If these estimates have very little variation over repeated sample, we say that the estimate
is precise. The standard error (SE) of the estimator measures the variation of the estimator over repeated
sampling from the same population.
The precision of an estimator is controlled by the sample size. In general, a larger sample size leads to
more precise estimates than a smaller sample size.
The precision of an estimator is also determined by statistical theory. For example, the precision (stan-
dard error) of a sample mean selected using a simple random sample from a large population is found using
mathematics to be equal to
pop std dev
n
. A common error is to use this latter formula for all estimators that
look like a mean however the formula for the standard error of any estimator depends upon the way the
data are collected (i.e. is a simple random sample, a cluster sample, a stratied sample etc), the estimator
of interest (e.g. different formulae are used for standard errors of mean, proportions, total, slopes etc.) and,
in some cases, the distribution of the population values (e.g. do elements from the population come from a
normal distribution, or a Weibull distribution, etc.).
Finally, accuracy is a combination of precision and bias. It measures the average distance of the
estimator from the population parameter. Technically, one measure of the accuracy of an estimator is the
Root Mean Square Error (RMSE) and is computed as
_
(Bias)
2
+ (SE)
2
. A precise, unbiased estimator
will be accurate, but not all accurate estimators will be unbiased.
The relationship between bias, precision, and accuracy can be view graphically as shown below. Let *
represent the true population parameter value (say the population mean), and periods (.) represent values of
the estimator (say the sample mean) over repeated samples from the same population.
Precise, Unbiased, Accurate Estimator
*
Pop mean
----------------------------------
.. . .. Sample means
Imprecise, Unbiased, less accurate estimator
*
Pop mean
----------------------------------
... .. . .. ... Sample means
Precise, Biased, but accurate estimator
*
Pop mean
----------------------------------
... Sample means
Imprecise, Biased, less accurate estimator
*
Pop mean
----------------------------------
.. ... ... Sample means
Precise, Biased, less accurate estimator
*
Pop mean
----------------------------------
... Sample means
Statistical theory can tell if an estimator is statistically unbiased, its precision, and its accuracy if a
probabilistic sample is taken. If data is collected haphazardly, the properties of an estimator cannot be
determined. Systemic biases caused by poor instruments cannot be detected statistically.
2.7 Types of missing data
Missing data happens frequently. There are three types of missing data and an important step in any analysis
is to thing of the mechanisms that could have caused the missing data.
First, data can be Missing Completely at Random (MCAR). In this case, the missing data is unrelated
to the response variable nor to any other variable in the study. For example, in eld trials, a hailstorm
destroys a test plot. It is unlikely that the hailstorm location is related to the the response variable of the
experiment or any other variable of interest to the experiment. In medical trials, a patient may leave the
study because of they win the lottery. It is unlikely that this is related to anything of interest in the study.
If data are MCAR, most analyses proceed unchanged. The design may be unbalanced, the estimates
have poor precision than if all data were present, but no biases are introduced into the estimates.
Second, data can be Missing at Random (MAR). In this case, the missingness is unrelated to the
response variable, but may be related to other variables in the study. For example, suppose that in drug
study involving males and females, that some females must leave the study because they became pregnant.
Again, as long as the missingness is not related to the response variable, the design is unbalanced, the
estimates have poorer precision, but no biases are introduced into the estimates.
Third, and the most troublesome case, is Informative Missing. Here the missingness is related to the
response. For example, a trial was conducted to investigate the effectiveness of fertilizer on the regrowth
of trees after clear cutting. The added fertilizer increased growth, which attracted deer, which ate all the
regrowth!
10
The analyst must also carefully distinguish between values of 0 and missing values. They are NOT THE
SAME! Here is a little example to illustrate the perils of missing data related to 0-counts. The Department
of Fisheries and Oceans has a program called ShoreKeepers which allows community groups to collect data
on the ecology of the shores of oceans in a scientic fashion that could be used in later years as part of an
evironmental assessment study. As part of the protocol, volunteers randomly place 1 m
2
quadrats on the
shore and count the number of species of various organisms. Suppose the following data were recorded for
three quadrats:
10
There is an urban legend about an interview with an opponent of compulsory seat belt legislation who compared the lengths of stays
in hospitals of auto accident victims who were or were not wearing seat belts. People who wore seat belts spent longer, on average, in
hospitals following the accident than people not wearing seat belts. The opponent felt that this was evidence for not making seat belts
compulsory!
Quadrat Species Count
Q1 A 5
C 10
Q2 B 5
C 5
Q3 A 5
B 10
Now based on the above data, what is the average density of species A? At rst glance, it would appear
to be (5 + 5)/2 = 5 per quadrat. However, there was no data recorded for species A in Q2. Does this mean
that the density of species A was not recorded because people didnt look for species A, or the density was
not recorded because the density was 0? In the rst instance, the value of A is Missing at Random from Q2
and the correct estimated density of species Ais indeed 5. In the second case, the missingness is informative,
and the correct estimated density is (5 + 0 + 5)/3 = 3.33 per quadrat.
The above example may seem simplistic, but many database programs are set up in this fashion to save
storage space by NOT recording zero counts. Unfortunately, one cannot distinguish between a missing
value implying that the count was zero, or a missing value indicating that the data was not collected. Even
worse, many database queries could erroneously treat the missing data as missing at random and not as zeros
giving wrong answers to averages!
For example, the Breeding Bird Survey is an annual survey of birds that follows specic routes and
records the number of each type of species encountered. According to the documentation about this survey
11
only the non-zero counts are stored in the database and some additional information such as the number of
actual routes run is required to impute the missing zeroes:
Since only non-zero counts are included in the database, the complete list of years a route is
run allows the times in which the species wasnt seen to be identied and included in the data
analysis.
If this extra step was not done, then you would be in the exact problem above on quadrat sampling.
The moral of the story is that 0 is a valid value and should be recorded as such! Computer storage costs
are declining so quickly, that the savings by not recording 0s soon vanish when people cant or dont
remember to adjust for the unrecorded 0 values.
If your experiment or survey has informative missing values, you could have a serious problem in the
analysis and expert help should be consulted.
11
http://www.cws-scf.ec.gc.ca/nwrc-cnrf/migb/stat_met_e.cfm
2.8 Transformations
2.8.1 Introduction
Many of the procedures in this course have an underlying assumption that the data from each group is
normally distributed with a common variance. In some cases this is patently false, e.g. the data are highly
skewed with variances that change, often with the mean.
The most common method to x this problem is a transformation of the data and the most common
transformation in ecology is the logarithmic transform, i.e. analyze the log(Y ) rather than Y . Other trans-
formations are possible - these will not be discussed in this course, but the material below applies equally
well to these other transformations.
If you are unsure of the proper transformation, there are a number of methods than can assist including
a Box-Cox Transform and an applicaiton of Taylors Power Law. These are beyond the scope of this course.
The logarithmic transform is often used when the data are positive and exhibit a pronounced long right
tail. For example, the following are plots of (made-up) data before and after a logarithmic transformation:
There are several things to note in the two graphs.
The distribution of Y is skewed with a long right tail, but the distribution of log(Y ) is symmetric.
The mean is the right of the median in the original data, but the mean and median are the same in the
transformed data.
The standard deviation of Y is large relative to the mean (cv =
std dev
mean
=
131
421
= 31%) where as the
standard deviation is small relative to the mean on the transformed data (cv =
std dev
mean
=
.3
6.0
= 5%).
The box-plots show a large number of potential outliers in the original data, but only a few on
the transformed data. It can be shown that in the case of a a log-normal distribution, about 5% of
observations are more than 3 standard deviations from the mean compared to a normal distribution
with less than 1/2 of 1% of such observations.
The form of the Y data above occurs quite often in ecology and is often called a log-normal distribution
given that a logarithmic transformation seems to normalize the data.
2.8.2 Conditions under which a log-normal distribution appears
Under what conditions would you expect to see a log-normal distribution? Normal distributions often occur
when the observed variable is the sum of underlying processes. For example, heights of adults (within
a sex) are t very closely by a normal distribution. The height of a person is determined by the sum of
heights of the shin, thigh, trunk, neck, head and other portions of the body. A famous theorem of statistics
(the Central Limit Theorem) says that data that are formed as the sum of other data, will tend to have a
normal distribution.
In some cases, the underlying process act multiplicatively. For example, the distribution of household
income is often a log-normal distribution. You can imagine that factor such as level of education, motivation,
parental support act to multiply income rather than simply adding a xed amount of money. Similarly, data
on animal abundance often has a log-normal distribution because factors such as survival act multiplicatively
on the populations.
2.8.3 ln() vs. log()
There is often much confusion about the form of the logarithmic transformation. For example, many calcu-
lators and statistical packages differential between the common logarithm (base 10, or log) and the natural
logarithm (base e or ln). Even worse, is that many packages actually use log to refer to natural logarithms
and log 10 to refer to common logarithms. IT DOESNT MATTER which transformation is used as long
as the proper back-transformation is applied. When you compare the actual values after these transforma-
tions, you will see that ln(Y ) = 2.3 log
10
(Y ), i.e. the log-transformed values differ by a xed multiplicative
constant. When the anti-logs are applied this constant will disappear.
In accordance with common convention in statistics and mathematics, the use of log(Y ) will refer to the
natural or ln(Y ) transformation.
2.8.4 Mean vs. Geometric Mean
The simple mean of Y is called the arithmetic mean (or simply) the mean and is computed in the usual
fashion.
The anti-log of the mean of the log(Y ) values is called the geometric mean. The geometric mean of a
set of data is ALWAYS less than the mean of the original data. In the special case of log-normal data, the
geometric mean will be close to the MEDIAN of the original data.
For example, look that the data above. The mean of Y is 421. The mean of log(Y ) is 5.999 and
exp(5.999) = 403 which is close to the median of the original data.
This implies that when reporting results, you will need to be a little careful about how the back-
transformed values are interpreted.
It is possible to go from the mean on the log-scale to the mean on the anti-log scale and vice-versa. For
log-normal data,
12
it turns out that
Y
antilog
exp(Y
log
+
s
2
log
2
)
and
Y
log
log(Y
antilog
)
s
2
antilog
2Y
2
antilog
In this case:
Y
antilog
exp(5.999 +
(.3)
2
2
) = 422.
and
Y
log
log(421.81)
131.2
2
2 421.8
2
= 5.996
Unfortunately, the formula for the standard deviations is not as straight forward. There is somewhat
complicated formula available in many reference books, but a close approximation is that:
s
anitlog
s
log
exp(Y
log
)
and
s
log

s
antilog
Y
antilog
12
Other transformation will have a different formula
For the data above we see that:
s
antilog
.3 exp(5.999) = 121
and
s
log

131.21
421.81
=, 311
which is close, but not exactly on the money.
2.8.5 Back-transforming estimates, standard errors, and ci
Once inference is made on the transformed scale, it is often nice to back-transform and report results on the
original scale.
For example, a study of turbidity (measured in NTU) on a stream in BC gave the following results on
the log-scale:
Statistics value
Mean on log scale 5.86
Std Dev on log scale .96
SE on log scale 0.27
upper 95% ci Mean 6.4
lower 95% ci Mean 5.3
How should these be reported on the original NTU scale?
Mean on log-scale back to MEDIAN on anti-log scale
The simplest back-transform goes from the mean on the log-scale to the MEDIAN on the anti-log scale.
The distribution is often symmetric on the log-scale so the mean, median, and mode on the log-scale all
co-incide. However, when you take anti-logs, the upper tail gets larger much faster than the lower tail and
the anti-log transform re-introduces skewness into the back-transformed data. Hence, the center point on the
log-scale gets back-transformed to the median on the anti-logscale.
The estimated MEDIAN (or GEOMETRIC MEAN) on the original scale is found by the back transform
of the mean on the log-scale, i.e.
median
anti-log
= exp(mean
logscale
)
estimated median = exp(5.86) = 350 NTU.
The 95% condence interval for the MEDIAN is found by doing a simple back-transformation on the 95%
condence interval for the mean on the log-scale, i.e. from exp(5.3) = 196 to exp(6.4) = 632 NTUs. Note
that the condence interval on the back-transformed scale is no longer symmetric about the estimate.
There is no direct back-transformation of the standard error from the log-scale to the original scale, but
an approximate standard error on the back-transformed scale is found as se
antilog
= se
log
exp(5.86) =
95 NTUs.
If the MEAN on the anti-log scale is needed, recall from the previous section that
Mean
antilog
exp(Mean
log
+
std dev
2
log
2
)
Mean
antilog
exp(Mean
log
) exp(
std dev
2
log
2
)
Mean
antilog
Median exp(
std dev
2
log
2
)
Mean
antilog
Median exp(
.96
2
2
) = Median 1.58.
Hence multiply the median, standard error of the median, and limits of the 95% condence interval all by
1.58.
2.8.6 Back-transforms of differences on the log-scale
Some care must be taken when back-transforming differences on the log-scale. The general rule of thumb is
that a difference on the log-sacle corresponds to a log(ratio) on the original scale. Hence a back-transform
of a difference on the log-scale corresponds to a ratio on the original scale.
13
For example, here are the results from a study to compare turbidity before and after remediation was
completed on a stream in BC.
Statistics value on log-scale
Difference 0.8303
Std Err Dif 0.3695
Upper CL Dif 0.0676
Lower CL Dif 1.5929
p-value 0.0341
A difference of .83 units on the log-scale corresponds to a ratio of exp(.83) = .44 in the NTU on
the original scale. In otherwords, the median NTU after remediation was .44 times that of the median NTU
before mediation. Of the median NTU before remediation was exp(.83) = 2.29 times that of the median
NTU after remediation. Note that 2.29 = 1/0.44.
13
Recall that log(
Y
Z
) = log(Y ) log(Z)
The 95% condence intervals are back-tranformed in a similar fashion. In this case the 95% condence
interval on the RATIO of median NTUs lies between exp(1.59) = .20 to exp(.067) = .93, i.e. the
median NTU after remediation was between .20 and .95 of the median NTU before remediation.
If necessary you could also back-transform the standard error to get a standard error for the ratio on the
original scale, but this is rarely done.
2.8.7 Some additional readings on the log-transform
Here are some additional readings on the use of the log-transform taken from the WWW. The URL is
presented at the bottom of each page.
Stats: Log transformation 2/18/05 10:51
- 1 (7) -
http://www.cmh.edu/stats/model/linear/log.asp
Search
Stats >> Model >> Log transformation
Dear Professor Mean, I have some data that I need help with analysis. One suggestion is that I use a log
transformation. Why would I want to do this? -- Stumped Susan
Dear Stumped
Think of it as employment security for us statisticians.
Short answer
If you want to use a log transformation, you compute the logarithm of each data value and
then analyze the resulting data. You may wish to transform the results back to the original
scale of measurement.
The logarithm function tends to squeeze together the larger values in your data set and
stretches out the smaller values. This squeezing and stretching can correct one or more of the
following problems with your data:
1. Skewed data
2. Outliers
3. Unequal variation
Not all data sets will suffer from these problems. Even if they do, the log transformation is not
guaranteed to solve these problems. Nevertheless, the log transformation works surprisingly
well in many situations.
Furthermore, a log transformation can sometimes simplify your statistical models. Some
statistical models are multiplicative: factors influence your outcome measure through
multiplication rather than addition. These multiplicative models are easier to work with after a
log transformation.
If you are unsure whether to use a log transformation, here are a few things you should look
for:
1. Is your data bounded below by zero?
2. Is your data defined as a ratio?
3. Is the largest value in your data more than three times larger than the smallest
value?
85
- 2 (7) -
Squeezing and stretching
The logarithm function squeezes together big data values (anything larger than 1). The
bigger the data value, the more the squeezing. The graph below shows this effect.
The first two values are 2.0 and 2.2. Their logarithms, 0.69 and 0.79 are much closer. The
second two values, 2.6 and 2.8, are squeezed even more. Their logarithms are 0.96 and 1.03.
The logarithm also stretches small values apart (values less than 1). The smaller the values
the more the stretching. This is illustrated below.
86
- 3 (7) -
The values of 0.4 and 0.45 have logarithms (-0.92 and -0.80) that are further apart. The values
of 0.20 and 0.25 are stretched even further. Their logarithms are -1.61 and -1.39, respectively.
Skewness
If your data are skewed to the right, a log transformation can sometimes produce a data
set that is closer to symmetric. Recall that in a skewed right distribution, the left tail (the
smaller values) is tightly packed together and the right tail (the larger values) is widely spread
apart.
The logarithm will squeeze the right tail of the distribution and stretch the left tail, which
produces a greater degree of symmetry.
If the data are symmetric or skewed to the left, a log transformation could actually make
things worse. Also, a log transformation is unlikely to be effective if the data has a narrow
range (if the largest value is not more than three times bigger than the smallest value).
Outliers
If your data has outliers on the high end, a log transformation can sometimes help. The
squeezing of large values might pull that outlier back in closer to the rest of the data. If your
data has outliers on the low end, the log transformation might actually make the outlier worse,
since it stretches small values.
Unequal variation
Many statistical procedures require that all of your subject groups have comparable
variation. If you data has unequal variation, then the some of your tests and confidence
intervals may be invalid. A log transformation can help with certain types of unequal variation.
A common pattern of unequal variation is when the groups with the large means also tend
to have large standard deviations. Consider housing prices in several different neighborhoods.
In one part of town, houses might be cheap, and sell for 60 to 80 thousand dollars. In a different
neighborhood, houses might sell for 120 to 180 thousand dollars. And in the snooty part of
town, houses might sell for 400 to 600 thousand dollars. Notice that as the neighborhoods got
more expensive, the range of prices got wider. This is an example of data where groups with
large means tend to have large standard deviations.
87
- 4 (7) -
With this pattern of variation, the log transformation can equalize the variation. The log
transformation will squeeze the groups with the larger standard deviations more than it
will squeeze the groups with the smaller standard deviations. The log transformation is
especially effective when the size of a group's standard deviation is directly proportional to the
size of its mean.
Multiplicative models
There are two common statistical models, additive and multiplicative. An additive model
assumes that factors that change your outcome measure, change it by addition or
subtraction. An example of an additive model would when we increase the number of mail
order catalogs sent out by 1,000, and that adds an extra $5,000 in sales.
A multiplicative model assumes that factors that change your outcome measure, change it
by multiplication or division. An example of a multiplicative model woud be when an inch of
rain takes half of the pollen out of the air.
In an additive model, the changes that we see are the same size, regardless of whether we are on
the high end or the low end of the scale. Extra catalogs add the same amount to our sales
regardless of whether our sales are big or small. In a multiplicative model, the changes we see
are bigger at the high end of the scale than at the low end. An inch of rain takes a lot of pollen
out on a high pollen day but proportionately less pollen out on a low pollen day.
If you remember your high school algebra, you'll recall that the logarithm of a product is equal
to the sum of the logarithms.
Therefore, a logarithm converts multiplication/division into addition/subtraction. Another way
to think about this in a multiplicative model, large values imply large changes and small values
imply small changes. The stretching and squeezing of the logarithm levels out the changes.
When should you consider a log transformation?
There are several situations where a log transformation should be given special consideration.
Is your data bounded below by zero? When your data are bounded below by zero, you often
have problems with skewness. The bound of zero prevents outliers on the low end, and
constrains the left tail of the distribution to be tightly packed. Also groups with means close to
zero are more constrained (hence less variable) than groups with means far away from zero.
It does matter how close you are to zero. If your mean is within a standard deviation or two of
zero, then expect some skewness. After all the bell shaped curve which speads out about three
standard deviations on either side would crash into zero and cause a traffic jam in the left tail.
Is your data defined as a ratio? Ratios tend to be skewed by their very nature. They also tend
to have models that are multiplicative.
Is the largest value in your data more than three times larger than the smallest value? The
88
- 5 (7) -
relative stretching and squeezing of the logarithm only has an impact if your data has a wide
range. If the maximum of your data is not at least three times as big as your minimum, then the
logarithm can't squeeze and stretch your data enough to have any useful impact.
Example
The DM/DX ratio is a measure of how rapidly the body metabolizes certain types of
medication. A patient is given a dose of dextrometorphan (DM), a common cough medication.
The patients urine is collected for four hours, and the concentrations of DM and DX (a
metabolite of dextrometorphan) are measured. The ratio of DM concentration to DX is a
measure of how well the CYD 2D6 metabolic pathway functions. A ratio less than 0.3 indicates
normal metabolism; larger ratios indicate slow metabolism.
Genetics can influence CYP 2D6 metabolism. In this set of 206 patients, we have 15 with no
functional alleles and 191 with one or more functional alleles.
The DM/DX ratio is a good candidate for a log transformation since it is bounded below by
zero. It is also obviously a ratio. The standard deviation for this data (0.4) is much larger than
the mean (0.1).
Finally, the largest value is several orders of magnitude bigger than the smallest value.
Skewness
The boxplots below show the original (untransformed) data for the 15 patients with no
functional alleles. The graph also shows the log transformed data. Notice that the untransformed
data shows quite a bit of skewness. The lower whisker and the lower half of the box are much
packed tightly, while the upper whisker and the upper half of the box are spread widely.
The log transformed data, while not perfectly symmetric, does tend to have a better balance
between the lower half and the upper half of the distribution.
89
- 6 (7) -
Outliers
The graph below shows the untransformed and log transformed data for the subset of patients
with exactly two functional alleles (n=119). The original data has two outliers which are almost
7 standard deviations above the mean. The log transformed data are not perfect, and perhaps
there is now an outlier on the low end. Nevertheless, the worst outlier is still within 4 standard
deviations of the mean. The influence of outliers is much less extreme with the log transformed
data.
Unequal variation
When we compute standard deviations for the patients with no functional alleles and the
patients with one or more functional alleles, we see that the former group has a much larger
standard deviation. This is not too surprising. The patients with no functional alleles are further
from the lower bound and thus have much more room to vary.
After a log transformation, the standard deviations are much closer.
90
- 7 (7) -
Summary
Stumped Susan wants to understand why she should use a log transformation for her data.
Professor Mean explains that a log transformation is often useful for correcting problems with
skewed data, outliers, and unequal variation. This works because the log function squeezes
the large values of your data together and stretches the small values apart. The log
transformation is also useful when you believe that factors have a mutliplicative effect. You
should consider a log transformation when your data are bound below by zero, when you data
are defined as a ratio, and/or when the largest value in your data is at least three times as big as
the smallest value.
Related pages in Stats
Stats: Geometric mean
Further reading
The log transformation is special. Keene ON. Stat Med 1995: 14(8); 811-9. [Medline]
Stats >> Model >> Log transformation
This page was last modified on 01/10/2005 . Send feedback to ssimon at cmh dot edu or click on the email
link at the top of the page.
91
Condence Intervals Involving Logarithmically Transformed Data 2/18/05 10:52
- 1 (3) -
http://www.tufts.edu/~gdallal/ci_logs.htm
Confidence Intervals Involving Data
to Which a Logarithmic Transformation Has Been Applied
These data were originally presented in Simpson J, Olsen A, and Eden J
(1975), "A Bayesian Analysis of a Multiplicative Treatment effect in
Weather Modification," Technometrics, 17, 161-166, and subsequently
reported and analyzed by Ramsey FL and Schafer DW (1997), The
Statistical Sleuth: A Course in Methods of Data Analysis. Belmont, CA:
Duxbury Press. They involve an experiment performed in southern
Florida between 1968 and 1972. An aircraft was flown through a series
of cloud and, at random, seeded some of them with massive amounts of
silver iodide. Precipitation after the aircraft passed through was measured
in acre-feet.
The distribution of precipitation within group (seeded or not) is positively
skewed (long-tailed to the right). The group with the higher mean has a
proportionally larger standard deviation as well. Both characteristics
suggest that a logarithmic transformation be used to make the data more
symmetric and homoscedastic (more equal spread). The second pair of
box plots bears this out. This transformation will tend to make CIs more
reliable, that is, the level of confidence is more likely to be what is
claimed.
N Mean Std. Deviation Median
Not Seeded 26 164.6 278.4 44.2
Rainfall
Seeded 26 442.0 650.8 221.6
N Mean Std. Deviation Geometric Mean
Not Seeded 26 1.7330 .7130 54.08
LOG_RAIN
Seeded 26 2.2297 .6947 169.71
95% Confidence Interval for the Mean Difference
Seeded - Not Seeded
(logged data)
Lower Upper
Equal variances assumed 0.1046 0.8889
Equal variances not assumed 0.1046 0.8889
Researchers often transform data back to the original scale when a logarithmic
transformation is applied to a set of data. Tables might include Geometric Means,
which are the anti-logs of the mean of the logged data. When data are positively
skewed, the geometric mean is invariably less than the arithmetic mean. This leads
to questions of whether the geometric mean has any interpretation other than as the
anti-log of the mean of the log transformed data.
92
- 2 (3) -
The geometric mean is often a good estimate of the original median. The
logarithmic transformation is monotonic, that is, data are ordered the same way in
the log scale as in the original scale. If a is greater than b, then log(a) is greater
than log(b). Since the observations are ordered the same way in both the original
and log scales, the observation in the middle in the original scale is also the
observation in the middle in the log scale, that is,
the log of the median = the median of the logs
If the log transformation makes the population symmetric, then the population
mean and median are the same in the log scale. Whatever estimates the mean also
estimates the median, and vice-versa. The mean of the logs estimates both the
population mean and median in the log transformed scale. If the mean of the logs
estimates the median of the logs, its anti-log--the geometric mean--estimates the
median in the original scale!
The median rainfall for the seeded clouds is 221.6 acre-feet. In the picture, the solid line between the two
histograms connects the median in the original scale to the mean in the log-transformed scale.
One property of the logarithm is that "the difference between logs is the log of the ratio", that is, log(x)-
log(y)=log(x/y). The confidence interval from the logged data estimates the difference between the
population means of log transformed data, that is, it estimates the difference between the logs of the
geometric means. However, the difference between the logs of the geometric means is the log of the ratio of
the geometric means. The anti-logarithms of the end points of this confidence interval give a confidence
interval for the ratio of geometric means itself. Since the geometric mean is sometime an estimate of the
median in the original scale, it follows that a confidence interval for the geometric means is approximately a
confidence interval for the ratio of the medians in the original scale.
In the (common) log scale, the mean difference between seeded and unseeded clouds is 0.4967. Our best
estimate of the ratio of the median rainfall of seeded clouds to that of unseeded clouds is 10
0.4967
[= 3.14].
Our best estimate of the effect of cloud seeding is that it produces 3.14 times as much rain on average as not
seeding.
Even when the calculations are done properly, the conclusion is often misstated.
The difference 0.4967 does not mean seeded clouds produce 0.4967 acre-feet more rain that unseeded
clouts. It is also improper to say that seeded clouds produce 0.4967 log-acre-feet more than unseeded
clouds.
The 3.14 means 3.14 times as much. It does not mean 3.14 times more (which would be 4.14 times as
much). It does not mean 3.14 acre-feet more. It is a ratio and has to be described that way.
The a 95% CI for the population mean difference (Seeded - Not Seeded) is (0.1046, 0.8889). For reporting
purposes, this CI should be transformed back to the original scale. A CI for a difference in the log scale
becomes a CI for a ratio in the original scale.
The antilogarithms of the endpoints of the confidence interval are 10
0.1046
= 1.27, and 10
0.8889
= 7.74.
93
- 3 (3) -
Thus, the report would read: "The geometric mean of the amount of rain produced by a seeded cloud is 3.14
times as much as that produced by an unseeded cloud (95% CI: 1.27 to 7.74 times as much)." If the logged
data have a roughly symmetric distribution, you might go so far as to say,"The median amount of rain...is
approximately..."
Comment: The logarithm is the only transformation that produces results that can be cleanly expressed in
terms of the original data. Other transformations, such as the square root, are sometimes used, but it is
difficult to restate their results in terms of the original data.
Copyright 2000 Gerard E. Dallal
Last modified: Mon Sep 30 2002 14:15:42.
94
2.9 Standard deviations and standard errors revisited
The use of standard deviations and standard errors in reports and publications can be confusing. Here are
some typical questions asked by students about these two concepts.
I am confused about why different graphs in different publication display the mean 1 standard
deviation; the mean 2 standard deviations; the mean 1 se; or the mean 2 se. When should
each graph be used?
What is the difference between a box-plot; 2se; and 2 standard deviations?
The foremost distinction between the use of standard deviation and standard errors can be made as
follows:
Standard deviations should be used when information about INDIVIDUAL observations is to be
conveyed; standard errors should be used when information about the precision of an estimate
is to be conveyed.
There are in fact, several common types of graphs that can be used to display the distribution of the
INDIVIDUAL data values. Common displays from "closest to raw data" to "based on summary statistics"
are:
dot plots
stem and leaf plots
histograms
box plots
mean 1 std dev. NOTE this is NOT the same as the estimate 1 se
mean 2 std dev. NOTE this is NOT the same as the estimate 2 se
The dot plot is a simple plot of the actual raw data values (e.g. that seen in JMP when the Analyze->Fit
Y-by-X platform is invoked. It is used to check for outliers and other unusual points. Often jittering is used
to avoid overprinting any duplicate data points. It useful for up to about 100 data points. Here is an example
of a dot plot of air quality data in several years:
Stem and leaf plots and histograms are similar. Both rst start by creating bins representing ranges
of the data (e.g. 04.9999, 59.9999, 1015.9999, etc.). Then the number of data points in each bin
is tabulated. The display shows the number or the frequency in each bin. The general shape of the data is
examined (e.g. is it symmetrical, or skewed, etc).
Here are two examples of histograms and stem-and-leaf charts:
The box-plot is an alternate method of displaying the individual data values. The box portion displays
the 25th, 50th, and 75th percentiles
14
of the data. The denition of the extent of the whiskers depends upon
the statistical package, but generally stretch to show the "typical" range to be expected from data. Outliers
may be indicated in some plots.
The box-plot is an alternative (and in my opinion a superior) display to a graph showing the mean 2
standard deviations because it conveys more information. For example, a box plot will show if the data are
symmetric (25
th
, 50
th
, and 75
th
percentiles roughly equally spaced) or skewed (the median much closer to
one of the 25
th
or 75
th
percentiles). The whiskers show the range of the INDIVIDUAL data values. Here is
an example of side-by-side box plots of the air quality data:
14
The p
th
percentile in a data set is the value such that at least p% of the data are less than the percentile; and at least (100-p)% of
the data values are greater than the percentile. For example, the median=.5 quantile = 50th percentile is the value such that at least 50%
of the data values are below the median and at least 50% of the data values are above the median. The 25th percentile=.25 quantile =
1st quartile is the value such that at least 25% of the data values are less than the value and at least 75% of the data values are greater
than this value.
The mean 1 STD DEV shows a range where you would expect about 68% of the INDIVIDUAL data
VALUES assuming the original data came from a normally distributed population. The mean 2 STD DEV
shows a range where you would expect about 95% of INDIVIDUAL data VALUES assuming the original
data came from a normally distributed population. The latter two plots are NOT RELATED to condence
intervals! This plot might be useful when the intent is to show the variability encountered in the sampling or
the presence of outliers etc. It is unclear why many journals still accept graphs with 1 standard deviation
as most people are interested in the range of the data collected so 2 standard deviations would be more
useful. Here is a plot of the mean 1 standard deviation plots of the mean 2 standard deviations are not
available in JMP:
I generally prefer the use of dot plots and bax-plots are these are much more informative than stem-and-
leaf plots, histograms, or the mean some multiple of standard deviations.
Then there are displays showing precision of estimates: Common displays are:
mean 1 SE
mean 2 SE
lower and upper bounds of condence intervals
diamond plots
These displays do NOT have anything to do with the sample values - they are trying to show the location
of plausible values for the unknown population parameter - in this case - the population mean. A standard
error measures how variable an estimate would likely be if repeated samples/experiments from the same
population were performed. Note that a se says NOTHING about the actual sample values! For example, it
is NOT correct to say that a 95% condence interval contains 95% of INDIVIDUAL data values.
The mean 1 SE display is not very informative as it corresponds to an approximate 68% condence
interval. The mean 2 SE corresponds to an approximate 95% condence interval IN THE CASE OF
SIMPLE RANDOM SAMPLING. Here is a plot of the mean 1 se a plot of the mean 2 se is not
available directly in JMP except as bar above and below the mean.
Alternatively, JMP can plot condence interval diamonds.
Graphs showing 1 or 2 standard errors are showing the range of plausible values for the underlying
population mean. It is unclear why many journals still publish graphs with 1 se as this corresponds
to an approximate 68% condence interval. I think that a 95% condence interval would be more useful
corresponding to 2 se.
Caution. Often these graphs (e.g. created by Excel) use the simple formula for the se of the sample
mean collected under a simple random sample even if the underlying design is more complex! In this case,
the graph is in error and should not be interpreted!
Both the condence interval and the diamond plots (if computed correctly for a particular sampling
design and estimator) correspond to a 95% condence interval.
2.10 Other tidbits
2.10.1 Interpreting p-values
I have a question about p-values. Im confused about the wording used when they explain
the p-value. They say with p = 0.03, in 3 percent of experiments like this we would observe
sample means as different as or more different than the ones we got, if in fact the null hypothesis
were true. The part that gets me is the as different as or more different than. I think Im just
having problems putting it into words that make sense to me. Do you have another way of
saying it?
The p-value measures the unusualness of the data assuming that the null hypothesis is true. The con-
fusing part is how to measure unusualness.
For example; is a person 7 ft (about 2 m) unusually tall? Yes, because only a small fraction of people are
AS TALL OR TALLER.
Now if the hypothesis is 2-sided, both small and large values of the sample mean (relative to the hypoth-
esized value) are unusual. For example, suppose that null hypothesis is that the mean amount in bottles of
pop is 250 mL. We would be very surprised if the sample mean was very small (e.g. 150 mL) or very large
(e.g. 350 mL).
That is why, for a two-sided test, the unusualness is as different or more different. You arent just
interested in the probability of getting exactly 150 or 350, but rather in the probability that the sample mean
is < 150 or > 350 (analogous to the probability of being 7 ft or higher).
2.10.2 False positives vs. false negatives
What is the difference between a false positive and a false negative
A false positive (Type I) error occurs if you conclude that the evidence against the hypothesis of interest
is strong, when, in fact, the hypothesis is true. For example, in a pregnancy test, the null hypothesis is that
the person is NOT pregnant. A false positive reading would indicate that the test indicates a pregnancy, when
in fact the person is not pregnant. A false negative (Type II error) occurs if insufcient evidence against the
null hypothesis, in fact, the hypothesis is false. In the case of a pregnancy test, a false negative would occur
if the test indicates not pregnant, when in fact, the person is pregnant.
2.10.3 Specicity/sensitivity/power
Please clarify specicity/sensitivity/power of a test. Are they the same?
The power and sensitivity are two terms for the ability to nd sufcient evidence against the the null
hypothesis when, in fact, the null hypothesis is false. For example, a pregnancy test with a 99that the test
correctly identies a pregnancy when in fact the person is pregnant.
The specicity of a test indicates the ability to NOT nd evidence against the null hypothesis when the
null hypothesis is true - the opposite of a Type I error. A pregnancy test would have high specicity if it
rarely declares a pregnancy for a non-pregnant person.
Chapter 3
Sampling
Contents
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.1.1 Difference between sampling and experimental design . . . . . . . . . . . . . . . 108
3.1.2 Why sample rather than census? . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.4 Probability sampling vs. non-probability sampling . . . . . . . . . . . . . . . . . 110
3.1.5 The importance of randomization in survey design . . . . . . . . . . . . . . . . . 111
3.1.6 Model vs. Design based sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2 Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.2.6 Panel design - suitable for long-term monitoring . . . . . . . . . . . . . . . . . . 128
3.2.7 Sampling non-discrete objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.2.8 Key considerations when designing or analyzing a survey . . . . . . . . . . . . . 129
3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.4 Simple Random Sampling Without Replacement (SRSWOR) . . . . . . . . . . . . . 131
3.4.1 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.4.2 Estimating the Population Mean . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.3 Estimating the Population Total . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.4 Estimating Population Proportions . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.4.5 Example - estimating total catch of sh in a recreational shery . . . . . . . . . . 134
3.5 Sample size determination for a simple random sample . . . . . . . . . . . . . . . . . 141
106
CHAPTER 3. SAMPLING
3.5.1 Example - How many angling-parties to survey . . . . . . . . . . . . . . . . . . . 144
3.6 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.6.1 Advantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.6.2 Disadvantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . 148
3.6.3 How to select a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.6.4 Analyzing a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.6.5 Technical notes - Repeated systematic sampling . . . . . . . . . . . . . . . . . . 149
3.7 Stratied simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.7.1 A visual comparison of a simple random sample vs. a stratied simple random sample154
3.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.7.4 Example - sampling organic matter from a lake . . . . . . . . . . . . . . . . . . . 164
3.7.5 Example - estimating the total catch of salmon . . . . . . . . . . . . . . . . . . . 168
3.7.6 Sample Size for Stratied Designs . . . . . . . . . . . . . . . . . . . . . . . . . 177
3.7.7 Allocating samples among strata . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.7.8 Example: Estimating the number of tundra swans. . . . . . . . . . . . . . . . . . 183
3.7.9 Post-stratication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.7.10 Allocation and precision - revisited . . . . . . . . . . . . . . . . . . . . . . . . . 189
3.8 Ratio estimation in SRS - improving precision with auxiliary information . . . . . . . 190
3.8.1 Summary of Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.8.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.8.3 Example - Grouse numbers - using a ratio estimator to estimate a population total . 201
3.9 Additional ways to improve precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
3.9.1 Using both stratication and auxiliary variables . . . . . . . . . . . . . . . . . . 210
3.9.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
3.9.3 Sampling with unequal probability - pps sampling . . . . . . . . . . . . . . . . . 211
3.10 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
3.10.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3.10.2 Advantages and disadvantages of cluster sampling compared to SRS . . . . . . . 219
3.10.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.10.5 Example - estimating the density of urchins . . . . . . . . . . . . . . . . . . . . . 221
3.10.6 Example - estimating the total number of sea cucumbers . . . . . . . . . . . . . . 227
3.11 Multi-stage sampling - a generalization of cluster sampling . . . . . . . . . . . . . . . 235
3.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
3.11.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
3.11.4 Example - estimating number of clams . . . . . . . . . . . . . . . . . . . . . . . 238
3.11.5 Some closing comments on multi-stage designs . . . . . . . . . . . . . . . . . . 242
3.12 Analytical surveys - almost experimental design . . . . . . . . . . . . . . . . . . . . . 242
3.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
3.14 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
3.14.1 Confusion about the denition of a population . . . . . . . . . . . . . . . . . . . 247
CHAPTER 3. SAMPLING
3.14.2 How is N dened . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
3.14.3 Multi-stage vs. Multi-phase sampling . . . . . . . . . . . . . . . . . . . . . . . . 248
3.14.4 What is the difference between a Population and a frame? . . . . . . . . . . . . . 249
3.14.5 How to account for missing transects. . . . . . . . . . . . . . . . . . . . . . . . . 249
3.1 Introduction
Today the word "survey" is used most often to describe a method of gathering information from a sample of
individuals or animals or areas. This "sample" is usually just a fraction of the population being studied.
You are exposed to survey results almost every day. For example, election polls, the unemployment rate,
or the consumer price index are all examples of the results of surveys. On the other hand, some common
headlines are NOT the results of surveys, but rather the results of experiments. For example, is a new drug
just as effective as an old drug.
Not only do surveys have a wide variety of purposes, they also can be conducted in many ways in-
cluding over the telephone, by mail, or in person. Nonetheless, all surveys do have certain characteristics in
common. All surveys require a great deal of planning in order that the results are informative.
Unlike a census, where all members of the population are studied, surveys gather information from
only a portion of a population of interest the size of the sample depending on the purpose of the study.
Surprisingly to many people, a survey can give better quality results than an census.
In a bona de survey, the sample is not selected haphazardly. It is scientically chosen so that each
object in the population will have a measurable chance of selection. This way, the results can be reliably
projected from the sample to the larger population.
Information is collected by means of standardized procedures The surveys intent is not to describe the
particular object which, by chance, are part of the sample but to obtain a composite prole of the population.
3.1.1 Difference between sampling and experimental design
There are two key differences between survey sampling and experimental design.
In experiments, one deliberately perturbs some part of population to see the effect of the action. In
sampling, one wishes to see what the population is like without disturbing it.
In experiments, the objective is to compare the mean response to changes in levels of the factors. In
sampling the objective is to describe the characteristics of the population. However, refer to the section
on analytical sampling later in this chapter for when sampling looks very similar to experimental
design.
CHAPTER 3. SAMPLING
3.1.2 Why sample rather than census?
There are a number of advantages of sampling over a complete census:
reduced cost
greater speed - a much smaller scale of operations is performed
greater scope - if highly trained personnel or equipment is needed
greater accuracy - easier to train small crew, supervise them, and reduce data entry errors
reduced respondent burden
in destructive sampling you cant measure the entire population - e.g. crash tests of cars
3.1.3 Principle steps in a survey
The principle steps in a survey are:
formulate the objectives of the survey - need concise statement
dene the population to be sampled - e.g. what is the range of animals or locations to be measured?
Note that the population is the set of nal sampling units that will be measured - refer to the FAQ at
the end of the chapter for more information.
establish what data is to be collected - collect a few items well rather than many poorly
what degree of precision is required - examine power needed
establish the frame - this is a list of sampling units that is exhaustive and exclusive
in many cases the frame is obvious, but in others it is not
it is often very difcult to establish a frame - e.g. a list of all streams in the lower mainland.
choose among the various designs; will you stratify? There are a variety of sampling plans some of
which will be discussed in detail later in this chapter. Some common designs in ecological studies are:
simple random sampling
systematic sample
cluster sampling
multi-stage design
All designs can be improved by stratication, so this should always be considered during the design
phase.
CHAPTER 3. SAMPLING
pre-test - very important to try out eld methods and questionnaires
organization of eld work - training, pre-test, etc
summary and data analysis - easiest part if earlier parts done well
post-mortem - what went well, poorly, etc.
3.1.4 Probability sampling vs. non-probability sampling
There are two types of sampling plans - probability sampling where units are chosen in a random fashion
and non-probability sampling where units are chosen in some deliberate fashion.
In probability sampling
every unit has a known probability of being in the sample
the sample is drawn with some method consistent with these probabilities
these selection probabilities are used when making estimates from the sample
The advantages of probability sampling
we can study biases of the sampling plans
standard errors and measures of precision (condence limits) can be obtained
Some types of non-probability sampling plan include:
quota sampling - select 50 M and 50 F from the population
less expensive than a probability sample
may be only option if no frame exists
judgmental sampling - select average or typical value. This is a quick and dirty sampling method
and can perform well if there are a few extreme points which should not be included.
convenience sampling - select those readily available. This is useful if is dangerous or unpleasant to
sample directly. For example, selecting blood samples from grizzly bears.
haphazard sampling (not the same as random sampling). This is often useful if the sampling material
is homogeneous and spread throughout the population, e.g. chemicals in drinking water.
The disadvantages of non-probability sampling include
CHAPTER 3. SAMPLING
unable to assess biases in any rational way.
no estimates of precision can be obtained. In particular the simple use of formulae from probability
sampling is WRONG!.
experts may disagree on what is the best sample.
3.1.5 The importance of randomization in survey design
[With thanks to Dr. Rick Routledge for this part of the notes.]
. . . I had to make a cover degree study... This involved the use of a Raunkiaers Circle, a
device designed in hell. In appearance it was all simple innocence, being no more than a big
metal hoop; but in use it was a devils mechanism for driving sane men mad. To use it, one
stood on a stretch of muskeg, shut ones eyes, spun around several times like a top, and then
ung the circle as far away as possible. This complicated procedure was designed to ensure that
the throw was truly random; but, in the event, it inevitably resulted in my losing sight of the
hoop entirely, and having to spend an unconscionable time searching for the thing.
Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963.
Why would a eld biologist in the early post-war period be instructed to follow such a bizarre-looking
scheme for collecting a representative sample of tundra vegetation? Could she not have obtained a typical
cross-section of the vegetation by using her own judgment? Undoubtedly, she could have convinced herself
that by replacing an awkward, haphazard sampling scheme with one dependent solely on her own judgment
and common sense, she could have been guaranteed a more representative sample. But would others be
convinced? A careful, objective scientist is trained to be skeptical. She would be reluctant to accept any
evidence whose validity depended critically on the judgment and skills of a stranger. The burden of proof
would then rest squarely with Farley Mowat to prove his ability to take representative, judgmental samples.
It is typically far easier for a scientist to use randomization in her sampling procedures than it is to prove her
judgmental skills.
Hovering and Patrolling Bees
It is often difcult, if not impossible, to take a properly randomized sample. Consider, e.g., the problem
faced by Alcock et al. (1977) in studying the behavior of male bees of the species, Centris pallida, in the
deserts of south-western United States. Females pupate in underground burrows. To maximize the presence
of his genes in the next generation, a male of the species needs to mate with as many virgin females as
possible. One strategy is to patrol the burrowing area at a lowaltitude, and nab an emerging female as soon as
her presence is detected. This patrolling strategy seems to involve a relatively high risk of confrontation with
other patrolling males. The other strategy reported by the authors is to hover farther above the burrowing
area, and mate with those females who escape detection by the hoverers. These hoverers appear to be
involved in fewer conicts.
Because the hoverers tend to be less involved in aggressive confrontations, one might guess that they
would tend to be somewhat smaller than the more aggressive patrollers. To assess this hypothesis, the
CHAPTER 3. SAMPLING
authors took measurements of head widths for each of the two subpopulations. Of course, they could not
capture every single male bee in the population. They had to be content with a sample.
Sample sizes and results are reported in the Table below. How are we to interpret these results? The
sampled hoverers obviously tended to be somewhat smaller than the sampled patrollers, although it appears
from the standard deviations that some hoverers were larger than the average-sized patroller and vice-versa.
Hence, the difference is not overwhelming, and may be attributable to sampling errors.
Table Summary of head width measurements on two samples of bees.
Sample n y SD
Hoverers 50 4.92 mm 0.15 mm
Patrollers 100 5.14 mm 0.29 mm
If the sampling were truly randomized, then the only sampling errors would be chance errors, whose
probable size can be assessed by a standard t-test. Exactly how were the samples taken? Is it possible that
the sampling procedure used to select patrolling bees might favor the capture of larger bees, for example?
This issue is indeed addressed by the authors. They carefully explain how they attempted to obtain unbiased
samples. For example, to sample the patrolling bees, they made a sweep across the sampling area, attempting
to catch all the patrolling bees that they observed. To assess the potential for bias, one must in the end make
a subjective judgment.
Why make all this fuss over a technical possibility? It is important to do so because lack of attention
to such possibilities has led to some colossal errors in the past. Nowhere are they more obvious than in the
eld of election prediction. Most of us never nd out the real nature of the population that we are sampling.
Hence, we never know the true size of our errors. By contrast, pollsters errors are often painfully obvious.
After the election, the actual percentages are available for everyone to see.
Lessons from Opinion Polling
In the 1930s, political opinion was in its formative years. The pioneers in this endeavor were training
themselves on the job. Of the inevitable errors, two were so spectacular as to make international headlines.
In 1935, an American magazine with a large circulation, The Literary Digest, attempted to poll an enor-
mous segment of the American voting public in order to predict the outcome of the presidential election that
autumn. Roosevelt, the Democratic candidate, promised to develop programs designed to increase opportu-
nities for the disadvantaged; Landon, the candidate for the Republican Party, appealed more to the wealthier
segments of American society. The Literary Digest mailed out questionnaires to about ten million people
whose names appeared in such places as subscription lists, club directories, etc. They received over 2.5 mil-
lion responses, on the basis of which they predicted a comfortable victory for Landon. The election returns
soon showed the massive size of their prediction error.
The cumbersome design of this highly publicized survey provided a young, wily pollster with the chance
of a lifetime. Between the time that the Digest announced its plans and released its predictions, George
Gallup planned and executed a remarkable coup. By polling only a small fraction of these individuals, and a
relatively small number of other voters, he correctly predicted not only the outcome of the election, but also
CHAPTER 3. SAMPLING
the enormous size of the error about to be committed by The Literary Digest.
Obviously, the enormous sample obtained by the Digest was not very representative of the population.
The selection procedure was heavily biased in favor of Republican voters. The most obvious source of bias
is the method used to generate the list of names and addresses of the people that they contacted. In 1935,
only the relatively afuent could afford magazines, telephones, etc., and the more conservative policies of
the Republican Party appealed to a greater proportion of this segment of the American public. The Digests
sample selection procedure was therefore biased in favor of the Republican candidate.
The Literary Digest was guilty of taking a sample of convenience. Samples of convenience are typically
prone to bias. Any researcher who, either by choice or necessity, uses such a sample, has to be prepared
to defend his ndings against possible charges of bias. As this example shows, it can have catastrophic
consequences.
How did Gallup obtain his more representative sample? He did not use randomization. Randomization
is often criticized on the grounds that once in a while, it can produce absurdly unrepresentative samples.
When faced with a sample that obviously contains far too few economically disadvantaged voters, it is small
consolation to know that next time around, the error will likely not be repeated. Gallup used a procedure that
virtually guaranteed that his sample would be representative with respect to such obvious features as age,
race, etc. He did so by assigning quotas which his interviewers were to ll. One interviewer might, e.g., be
assigned to interview 5 adult males with specied characteristics in a tough, inner-city neighborhood. The
quotas were devised so as to make the sample mimic known features of the population.
This quota sampling technique suited Gallups needs spectacularly well in 1935 even though he under-
estimated the support for the Democratic candidate by about 6%. His subsequent polls contained the same
systematic error. In 1947, the error nally caught up with him. He predicted a narrow victory for the Re-
publican candidate, Dewey. A Newspaper editor was so condent of the prediction that he authorized the
printing of a headline proclaiming the victory before the ofcial results were available. It turned out that the
Democrat, Truman, won by a narrow margin.
What was wrong with Gallups sampling technique? He gave his interviewers the nal decision as to
whom would be interviewed. In a tough inner-city neighborhood, an interviewer had the option of passing
by a house with several motorcycles parked out in front and sounds of a raucous party coming from within.
In the resulting sample, the more conservative (Republican) voters were systematically over-represented.
Gallup learned from his mistakes. His subsequent surveys replaced interviewer discretion with an objec-
tive, randomized scheme at the nal stage of sample selection. With the dominant source of systematic error
removed, his election predictions became even more reliable.
Implications for Biological Surveys
The bias in samples of convenience can be enormous. It can be surprisingly large even in what appear to
be carefully designed surveys. It can easily exceed the typical size of the chance error terms. To completely
remove the possibility of bias in the selection of a sample, randomization must be employed. Sometimes
this is simply not possible, as for example, appears to be the case in the study on bees. When this happens
and the investigators wish to use the results of a nonrandomized sample, then the nal report should discuss
CHAPTER 3. SAMPLING
the possibility of selection bias and its potential impact on the conclusions.
Furthermore, when reading a report containing the results of a survey, it is important to carefully evaluate
the survey design, and to consider the potential impact of sample selection bias on the conclusions.
Should Farley Mowat really have been content to take his samples by tossing Raunkiers Circle to the
winds? Denitely not, for at least two reasons. First, he had to trust that by tossing the circle, he was gener-
ating an unbiased sample. It is not at all certain that certain types of vegetation would not be selected with a
higher probability than others. For example, the higher shrubs would tend to intercept the hoop earlier in its
descent than would the smaller herbs. Second, he has no guarantee that his sample will be representative with
respect to the major habitat types. Leaving aside potential bias, it is possible that the circle could, by chance,
land repeatedly in a snowbed community. It seems indeed foolish to use a sampling scheme which admits the
possibility of including only snowbed communities when tundra bogs and fellelds may be equally abundant
in the population. In subsequent chapters, we shall look into ways of taking more thoroughly randomized
surveys, and into schemes for combining judgment with randomization for eliminating both selection bias
and the potential for grossly unrepresentative samples. There are also circumstances in which a systematic
sample (e.g., taking transects every 200 meters along a rocky shore line) may be justiable, but this subject
is not discussed in these notes.
3.1.6 Model vs. Design based sampling
Model-based sampling starts by assuming some sort of statistical model for the data in the population and
the goal is to select data to estimate the parameters of this distribution. For example, you may be willing to
assume that the distribution of values in the population is log-normally distributed. The data collected from
the survey are then used along with a likelihood function to estimate the parameters of the distribution.
Model-based sampling is very powerful because you are willing to make a lot of assumptions about the
data process. However, if your model is wrong, there are big problems. For example, what if you assume
log-normality but data is not log-normally distributed? In these cases, the estimates of the parameters can be
extremely biased and inefcient.
Design-based sampling makes no assumptions about the distribution of data values in the population.
Rather it relies upon the randomization procedure to select representative elements of the population. Es-
timates from design-based methods are unbiased regardless of the distribution of values in the population,
but in strange populations can also be inefcient. For example, if a population is highly clustered, a ran-
dom sample of quadrats will end up with mostly zero observations and a few large values and the resulting
estimates will have a large standard error.
Most of the results in this chapter on survey sampling are design-based, i.e. we dont need to make any
assumptions about normality in the population for the results to valid.
CHAPTER 3. SAMPLING
3.1.7 Software
For a review of packages that can be used to analyze survey data please refer to the article at http:
//www.fas.harvard.edu/~stats/survey-soft/survey-soft.html.
CAUTIONS IN USING STANDARD STATISTICAL SOFTWARE PACKAGES Standard statistical
software packages generally do not take into account four common characteristics of sample survey data:
(1) unequal probability selection of observations, (2) clustering of observations, (3) stratication and (4)
nonresponse and other adjustments. Point estimates of population parameters are impacted by the value
of the analysis weight for each observation. These weights depend upon the selection probabilities and
other survey design features such as stratication and clustering. Hence, standard packages will yield biased
point estimates if the weights are ignored. The estimated standard errors based on sample survey data
are impacted by clustering, stratication and the weights. By ignoring these aspects, standard packages
generally underestimate the standard error, sometimes substantially so.
Most standard statistical packages can perform weighted analyses, usually via a WEIGHT statement
added to the program code. Use of standard statistical packages with a weighting variable may yield the
same point estimates for population parameters as sample survey software packages. However, the estimated
standard error often is not correct and can be substantially wrong, depending upon the particular program
within the standard software package.
For further information about the problems of using standard statistical software packages in survey
sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/
donna_brogan.html.
Fortunately, for simple surveys, we can often do the analysis using standard software as will be shown
in these notes. Many software packages also have specialized software and, if available, these will be
demonstrated.
SAS includes many survey design procedures as shown in these notes.
3.2 Overview of Sampling Methods
3.2.1 Simple Random Sampling
This is the basic method of selecting survey units. Each unit in the population is selected with equal prob-
ability and all possible samples are equally likely to be chosen. This is commonly done by listing all the
members in the population (the set of sampling units) and then choosing units using a random number table.
An example of a simple random sample would be a vegetation survey in a large forest stand. The stand is
divided into 480 one-hectare plots, and a random sample of 24 plots was selected and analyzed using aerial
photos. The map of the units selected might look like:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Units are usually chosen without replacement, i.e., each unit in the population can only be chosen once.
In some cases (particularly for multi-stage designs), there are advantages to selecting units with replacement,
i.e. a unit in the population may potentially be selected more than once. The analysis of a simple random
sample is straightforward. The mean of the sample is an estimate of the population mean. An estimate of the
population total is obtained by multiplying the sample mean by the number of units in the population. The
sampling fraction, the proportion of units chosen from the entire population, is typically small. If it exceeds
5%, an adjustment (the nite population correction) will result in better estimates of precision (a reduction
in the standard error) to account for the fact that a substantial fraction of the population was surveyed.
A simple random sample design is often hidden in the details of many other survey designs. For
example, many surveys of vegetation are conducted using strip transects where the initial starting point of
the transect is randomly chosen, and then every plot along the transect is measured. Here the strips are the
sampling unit, and are a simple random sample from all possible strips. The individual plots are subsamples
from each strip and cannot be regarded as independent samples. For example, suppose a rectangular stand
is surveyed using aerial overights. In many cases, random starting points along one edge are selected, and
the aircraft then surveys the entire length of the stand starting at the chosen point. The strips are typically
analyzed section- by-section, but it would be incorrect to treat the smaller parts as a simple random sample
from the entire stand.
Note that a crucial element of simple random samples is that every sampling unit is chosen independently
of every other sampling unit. For example, in strip transects plots along the same transect are not chosen
independently - when a particular transect is chosen, all plots along the transect are sampled and so the
selected plots are not a simple random sample of all possible plots. Strip-transects are actually examples of
cluster-samples. Cluster samples are discuses in greater detail later in this chapter.
3.2.2 Systematic Surveys
In some cases, it is logistically inconvenient to randomly select sample units from the population. An
alternative is to take a systematic sample where every k
th
unit is selected (after a random starting point);
k is chosen to give the required sample size. For example, if a stream is 2 km long, and 20 samples are
required, then k = 100 and samples are chosen every 100 m along the stream after a random starting point.
A common alternative when the population does not naturally divide into discrete units is grid-sampling.
Here sampling points are located using a grid that is randomly located in the area. All sampling points are a
xed distance apart.
An example of a systematice sample would be a vegetation survey in a large forest stand. The stand is
divided into 480 one-hectare plots. As a total sample size of 24 is required, this implies that we need to
sample every 480/24 = 20
th
plot. We pick a random starting point (the 9
th
) plot in the rst row, and then
every 20 plots reading across rows. The nal plan could look like:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
If a known trend is present in the sample, this can be incorporated into the analysis (Cochran, 1977,
Chapter 8). For example, suppose that the systematic sample follows an elevation gradient that is known to
directly inuence the response variable. A regression-type correction can be incorporated into the analysis.
However, note that this trend must be known from external sources - it cannot be deduced from the survey.
Pitfall: A systematic sample is typically analyzed in the same fashion as a simple random sample.
However, the true precision of an estimator from a systematic sample can be either worse or better than a
simple random sample of the same size, depending if units within the systematic sample are positively or
negatively correlated among themselves. For example, if a systematic samples sampling interval happens to
match a cyclic pattern in the population, values within the systematic sample are highly positively correlated
(the sampled units may all hit the peaks of the cyclic trend), and the true sampling precision is worse than a
SRS of the same size. What is even more unfortunate is that because the units are positively correlated within
the sample, the sample variance will underestimate the true variation in the population, and if the estimated
precision is computed using the formula for a SRS, a double dose of bias in the estimated precision occurs
(Krebs, 1989, p.227). On the other hand, if the systematic sample is arranged perpendicular to a known
trend to try and incorporate additional variability in the sample, the units within a sample are now negatively
correlated, the true precision is now better than a SRS sample of the same size, but the sample variance now
overestimates the population variance, and the formula for precision from a SRS will overstate the sampling
error. While logistically simpler, a systematic sample is only equivalent to a simple random sample of the
same size if the population units are in random order to begin with. (Krebs, 1989, p. 227). Even worse,
there is no information in the systematic sample that allows the manager to check for hidden trends and
cycles.
Nevertheless, systematic samples do offer some practical advantages over SRS if some correction can be
made to the bias in the estimated precision:
it is easier to relocate plots for long term monitoring
mapping can be carried out concurrently with the sampling effort because the ground is systematically
traversed. This is less of an issue now with GPS as the exact position can easily be recorded and the
plots revisited alter.
it avoids the problem of poorly distributed sampling units which can occur with a SRS [but this can
also be avoided by judicious stratication.]
Solution: Because of the necessity for a strong assumption of randomness in the original population,
systematic samples are discouraged and statistical advice should be sought before starting such a scheme. If
there are no other feasible designs, a slight variation in the systematic sample provides some protection from
the above problems. Instead of taking a single systematic sample every k
th
unit, take 2 or 3 independent
systematic samples of every 2k
th
or 3k
th
unit, each with a different starting point. For example, rather than
taking a single systematic sample every 100 m along the stream, two independent systematic samples can
be taken, each selecting units every 200 m along the stream starting at two random starting points. The total
sample effort is still the same, but now some measure of the large scale spatial structure can be estimated.
This technique is known as replicated sub-sampling (Kish, 1965, p. 127).
CHAPTER 3. SAMPLING
3.2.3 Cluster sampling
In some cases, units in a population occur naturally in groups or clusters. For example, some animals
congregate in herds or family units. It is often convenient to select a random sample of herds and then
measure every animal in the herd. This is not the same as a simple random sample of animals because
individual animals are not randomly selected; the herds are the sampling unit. The strip-transect example
in the section on simple random sampling is also a cluster sample; all plots along a randomly selected
transect are measured. The strips are the sampling units, while plots within each strip are sub-sampling
units. Another example is circular plot sampling; all trees within a specied radius of a randomly selected
point are measured. The sampling unit is the circular plot while trees within the plot are sub-samples.
The reason cluster samples are used is that costs can be reduced compared to a simple random sample
giving the same precision. Because units within a cluster are close together, travel costs among units are re-
duced. Consequently, more clusters (and more total units) can be surveyed for the same cost as a comparable
simple random sample.
For example, consider the vegation survey of previous sections. The 480 plots can be divided into 60
clusters of size 8. A total sample size of 24 is obtained by randomly selecting three clusters from the 60
clusters present in the map, and then surveying ALL eight members of the seleced clusters. A map of the
design might look like:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Alternatively, cluster are often formed when a transect sample is taken. For example, suppose that the
vegetation survey picked an initial starting point on the left margin, and then ew completely across the
landscape in a a straight line measuring all plots along the route. A map of the design migh look like:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
In this case, there are three clusters chosen from a possible 30 clusters and the clusters are of unequal size
(the middle cluster only has 12 plots measured compared to the 18 plots measured on the other two transects).
Pitfall A cluster sample is often mistakenly analyzed using methods for simple random surveys. This
is not valid because units within a cluster are typically positively correlated. The effect of this erroneous
analysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimated
standard error is too small and does not fully reect the actual imprecision in the estimate.
Solution: In order to be condent that the reported standard error really reects the uncertainty of
the estimate, it is important that the analytical methods are appropriate for the survey design. The proper
analysis treats the clusters as a random sample from the population of clusters. The methods of simple
random samples are applied to the cluster summary statistics (Thompson, 1992, Chapter 12).
3.2.4 Multi-stage sampling
In many situations, there are natural divisions of the population into several different sizes of units. For
example, a forest management unit consists of several stands, each stand has several cutblocks, and each
cutblock can be divided into plots. These divisions can be easily accommodated in a survey through the use
of multi-stage methods. Selection of units is done in stages. For example, several stands could be selected
from a management area; then several cutblocks are selected in each of the chosen stands; then several plots
are selected in each of the chosen cutblocks. Note that in a multi-stage design, units at any stage are selected
at random only from those larger units selected in previous stages.
Again consider the vegetation survey of previous sections. The population is again divided into 60
clusers of size 8. However, rather than surveying all units within a cluster, we decide to survey only two
units within each cluster. Hence, we now sample at the rst stage, a total of 12 clusters out of the 60. In
each cluster, we randomly sample 2 of the 8 units. A sample plan might look like the following where the
rectangles indicate the clusters selected, and the checks indicate the sub-sample taken from each cluster:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
The advantage of multi-stage designs are that costs can be reduced compared to a simple random sample
of the same size, primarily through improved logistics. The precision of the results is worse than an equiva-
lent simple random sample, but because costs are less, a larger multi-stage survey can often be done for the
same costs as a smaller simple random sample. This often results in a more precise estimate for the same
cost. However, due to the misuse of data from complex designs, simple designs are often highly preferred
and end up being more cost efcient when costs associated with incorrect decisions are incorporated.
Pitfall: Although random selections are made at each stage, a common error is to analyze these types
of surveys as if they arose from a simple random sample. The plots were not independently selected; if a
particular cutblock was not chosen, then none of the plots within that cutblock can be chosen. As in cluster
samples, the consequences of this erroneous analysis are that the estimated standard errors are too small and
do not fully reect the actual imprecision in the estimates. A manager will be more condent in the estimate
than is justied by the survey.
Solution: Again, it is important that the analytical methods are suitable for the sampling design. The
proper analysis of multi-stage designs takes into account that random samples takes place at each stage
(Thompson, 1992, Chapter 13). In many cases, the precision of the estimates is determined essentially by
the number of rst stage units selected. Little is gained by extensive sampling at lower stages.
3.2.5 Multi-phase designs
In some surveys, multiple surveys of the same survey units are performed. In the rst phase, a sample of
units is selected (usually by a simple random sample). Every unit is measured on some variable. Then in
subsequent phases, samples are selected ONLY from those units selected in the rst phase, not from the
entire population.
For example, refer back to the vegetation survey. An initial sample of 24 plots is closen in a simple
random survey. Aerial ights are used to quickly measure some characteristic of the plots. A second phase
sample of 6 units (circled below) is then measured using ground based methods.
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Multiphase designs are commonly used in two situations. First, it is sometimes difcult to stratify a
population in advance because the values of the stratication variables are not known. The rst phase is
used to measure the stratication variable on a random sample of units. The selected units are then stratied,
and further samples are taken from each stratum as needed to measure a second variable. This avoids having
to measure the second variable on every unit when the strata differ in importance. For example, in the rst
phase, plots are selected and measured for the amount of insect damage. The plots are then stratied by the
amount of damage, and second phase allocation of units concentrates on plots with low insect damage to
measure total usable volume of wood. It would be wasteful to measure the volume of wood on plot with
much insect damage.
The second common occurrence is when it is relatively easy to measure a surrogate variable (related to
the real variable of interest) on selected units, and then in the second phase, the real variable of interest is
measured on a subset of the units. The relationship between the surrogate and desired variable in the smaller
sample is used to adjust the estimate based on the surrogate variable in the larger sample. For example,
managers need to estimate the volume of wood removed from a harvesting area. A large sample of logging
trucks is weighed (which is easy to do), and weight will serve as a surrogate variable for volume. A smaller
sample of trucks (selected from those weighed) is scaled for volume and the relationship between volume
and weight from the second phase sample is used to predict volume based on weight only for the rst phase
sample. Another example is the count plot method of estimating volume of timber in a stand. A selection of
plots is chosen and the basal area determined. Then a sub-selection of plots is rechosen in the second phase,
and volume measurements are made on the second phase plots. The relationship between volume and area
in the second phase is used to predict volume from area measurements seen the rst phase.
3.2.6 Panel design - suitable for long-term monitoring
One common objective of long-term studies is to investigate changes over time of a particular population.
There are three common designs.
First, separate independent surveys can be conducted at each time point. This is the simplest design
to analyze because all observations are independent over time. For example, independent surveys can be
conducted at ve year intervals to assess regeneration of cutblocks. However, precision of the estimated
change may be poor because of the additional variability introduced by having new units sampled at each
time point.
At the other extreme, units are selected in the rst survey, permanent monitoring stations are established
and the same units are remeasured over time. For example, permanent study plots can be established that
are remeasured for regeneration over time. Ceteris paribus (all else being equal), this design is the more
efcient (i.e. has higher power) compared to the previous design. The advantage of permanent study plots
occurs because in comparisons over time, the effects of that particular monitoring site tend to cancel out and
so estimates of variability are free of additional variability introduced by new units being measured at every
time point. One possible problem is that survey units may become damaged over time, and the sample
size will tend to decline over time resulting in a loss of power. Additionally, an analysis of these types of
designs is more complex because of the need to account for the correlation over time of measurements on
the same sample plot and the need to account for possible missing values when units become damaged and
CHAPTER 3. SAMPLING
are dropped from the study.
A compromise between these two design are partial replacement designs or panel designs. In these
designs, a portion of the survey units are replaced with new units at each time point. For example, 1/5 of
the units could be replaced by new units at each time point - units would normally stay in the study for a
maximum of 5 time periods. This design combines the advantages of repeatedly measuring semi-permanent
monitoring stations with the ability to replace (or refresh) the sample if units become damaged or are lost.
The analysis of these designs is non-trival, but manageable with modern software.
3.2.7 Sampling non-discrete objects
In some cases, the population does not have natural discrete sampling units. For example, a large section of
land may be arbitrarily divided into 1 m
2
plots, or 10 m
2
plots. A natural question to ask is what is the best
size of unit. This has no simple answer and depends upon several factors which must be addressed for each
survey:
Cost. All else being equal, sampling many small plots may be more expensive than sampling fewer
larger plots. The primary difference in cost is the overhead in traveling and setup to measure the unit.
Size of unit. An intuitive feeling is that more smaller plots are better than few large plots because the
sample size is larger. This will be true if the characteristic of interest is patchy , but surprisingly,
makes no difference if the characteristic is randomly scattered through out the area (Krebs, 1989, p.
64). Indeed if the characteristic shows avoidance, then larger plots are better. For example, com-
petition among trees implies they are spread out more than expected if they were randomly located.
Logistic considerations often inuence the plot size. For example, if trampling the soil affects the
response, then sample plots must be small enough to measure without trampling the soil.
Edge effects. Because the population does not have natural boundaries, decisions often have to be
made about objects that lie on the edge of the sample plot. In general larger square or circular plots
are better because of smaller edge-to-area ratio. [A large narrow rectangular plot can have more edge
than a similar area square plot.]
Size of object being measured. Clearly a 1 m
2
plot is not appropriate when counting mature Douglas-
r, but may be appropriate for a lichen survey.
A pilot study should be carried out prior to a large scale survey to investigate factors that inuence the
choice of sampling unit size.
3.2.8 Key considerations when designing or analyzing a survey
Key considerations when designing a survey are
CHAPTER 3. SAMPLING
what are the objectives of the survey?
what is the sampling unit? This should be carefully distinguished from the observational unit. For
example, you may sample boats returning from shing, but interview the individual anglers on the
boat.
What frame is available (if any) for the sampling units? If a frame is available, then direct sampling
can be used where the units can be numbered and the randomization used to select the sampling units.
If no frame is available, then you will need to gure out how to identify the units and how to select
then on the y. For example, there is no frame of boats returning to an access point, so perhaps a
systematic survey of every 5th boat could be used.
Are all the sampling units are the same size? If so, then a simple random sample (or variant thereof)
is likely a suitable design. If the units vary considerably in size, then an unequal probability design
may be more suitable. For example, if your survey units are forest polygons (as displayed on a GIS),
these polygons vary considerably in size with many smaller polygons and fewer larger polygons. A
design that selects polygons with a probability proportional to the size of the polygon may be more
suited than a simple random sample of polygons.
Decide upon the sampling design used (i.e. simple random sample, or cluster sample, or multi-state
design, etc.) The availablity of the frame and the existence of different sized sampling units will often
dictate the type of design used.
What precision is required for the estimate? This (along with the variability in the response) will
determine the sample size needed.
If you are not stratifying your design, then why not? Stratication is a low-cost or no-cost way to
improve your survey.
When analyzing a survey, the key steps are:
Recognize the design that was used to collect the data. Key pointers to help recognize various designs
are:
How were the units selected? A true simple random sample makes a list of all possible items
and then chooses from that list.
Is there more than one size of sampling unit? For example, were transects selected at random,
and then quadrats within samples selected at random? This is usually a multi-stage design.
Is there a cluster? For example, transects are selected, and these are divided into a series of
quadrats - all of which are measured.
Any analysis of the data must use a method that matches the design used to collect the data!
Are there missing values? How did they occur? If the missingness is MCAR, then life is easy and the
analysis proceeds with a reduced sample size. If the missingness is MAR, then some reweighting of
the observed data will be required. If the missingness is IM, seek help - this is a difcult problem.
Use a suitable package to analyze the results (avoid Excel except for the simplest designs!).
Report both the estimate and the measure of precision (the standard error).
CHAPTER 3. SAMPLING
3.3 Notation
Unfortunately, sampling theory has developed its own notation that is different than that used for design of
experiments or other areas of statistics even though the same concepts are used in both. It would be nice to
adopt a general convention for all of statistics - maybe in 100 years this will happen.
Even among sampling textbooks, there is no agreement on notation! (sigh).
In the table below, Ive summarized the usual notation used in sampling theory. In general, large letters
refer to population values, while small letters refer to sample values.
Characteristic Population value Sample value
number of elements N n
units Y
i
y
i
total =
N
i=1
Y
i
y =
n
i=1
y
i
mean =
1
N
N
i=1
Y
i
y =
1
n
n
i=1
y
i
proportion =

N
p =
y
n
variance S
2
=
N
i=1
(Yi)
2
N1
s
2
=
n
i=1
(yiy)
2
n1
variance of a prop S
2
=
N
N1
(1 ) s
2
=
np(1p)
n1
Note:
The population mean is sometimes denoted as Y in many books.
The population total is sometimes denoted as Y in many books.
Again note the distinction between the population quantity (e.g. the population mean ) and the cor-
responding sample quantity (e.g. the sample mean y).
3.4 Simple Random Sampling Without Replacement (SRSWOR)
This forms the basis of many other more complex sampling plans and is the gold standard against which all
other sampling plans are compared. It often happens that more complex sampling plans consist of a series
of simple random samples that are combined in a complex fashion.
In this design, once the frame of units has been enumerated, a sample of size n is selected without
replacement from the N population units.
CHAPTER 3. SAMPLING
Refer to the previous sections for an illustration of how the units will be selected.
3.4.1 Summary of main results
It turns out that for a simple random sample, the sample mean (y) is the best estimator for the population
mean (). The population total is estimated by multiplying the sample mean by the POPULATION size.
And, a proportion is estimated by simply coding results as 0 or 1 depending if the sampled unit belongs to
the class of interest, and taking the mean of these 0,1 values. (Yes, this really does work - refer to a later
section for more details).
As with every estimate, a measure of precision is required. We say in an earlier chapter that the standard
error (se) is such a measure. Recall that the standard error measures how variable the results of our survey
would be if the survey were to be repeated. The standard error for the sample mean looks very similar to that
for a sample mean from a completely randomized design (refer to later chapters) with a common correction
of a nite population factor (the (1 f) term).
The standard error for the population total estimate is found by multiplying the standard error for the
mean by the POPULATION SIZE.
The standard error for a proportion is found again, by treating each data value as 0 or 1 and applying the
same formula as the standard error for a mean.
The following table summarizes the main results:
Parameter Population value Estimator Estimated se
Mean = y
_
s
2
n
(1 f)
Total = N = Nyy N se( ) = N
_
s
2
n
(1 f)
Proportion = p = y
0/1
=
y
n
_
p(1p)
n1
(1 f)
Notes:
Ination factor The term N/n is called the ination factor and the estimator for the total is sometimes
called the expansion estimator or the simple ination estimator.
Sampling weight Many statistical packages that analyze survey data will require the specication of
a sampling weight. A sampling weight represent how many units in the population are represented
by this unit in the sample. In the case of a simple random sample, the sampling weight is also equal
to N/n. For example, if you select 10 units at random from 150 units in the population, the sampling
weight for each observation is 15, i.e. each unit in the sample represents 15 units in the population.
The sampling weights are computed differently for various designs so wont always be equal to N/n.
CHAPTER 3. SAMPLING
sampling fraction the term n/N is called the sampling fraction and is denoted as f.
nite population correction (fpc) the term (1 f) is called the nite population correction factor
and reects that if you sample a substantial part of the population, the standard error of the estimator
is smaller than what would be expected from experimental design results. If f is less than 5%, this is
often ignored. In most ecological studies the sampling fraction is usually small enough that all of the
fpc terms can be ignored.
3.4.2 Estimating the Population Mean
The rst line of the above table shows the basic results and all the remaining lines in the table can be
derived from this line as will be shown later.
The population mean () is estimated by the sample mean (y). The estimated se of the sample mean is
se(y) =
_
s
2
n
(1 f) =
s
n
_
(1 f)
Note that if the sampling fraction (f) is small, then the standard error of the sample mean can be approximated
by:
se(y)
_
s
2
n
=
s
n
which is the familiar form seen previously. In general, the standard error formula changes depending
upon the sampling method used to collect the data and the estimator used on the data. Every different
sampling design has its own way of computing the estimator and se.
Condence intervals for parameters are computed in the usual fashion, i.e. an approximate 95% con-
dence interval would be found as: estimator 2se. Some textbooks use a t-distribution for smaller sample
sizes, but most surveys are sufciently large that this makes little difference.
3.4.3 Estimating the Population Total
Many students nd this part confusing, because of the term population total. This does NOT refer to the
total number of units in the population, but rather the sum of the individual values over the units. For exam-
ple, if you are interested in estimating total timber volume in an inventory unit, the trees are the sampling
units. A sample of trees is selected to estimate the mean volume per tree. The total timber volume over all
trees in the inventory unit is of interest, not the total number of trees in the inventory unit.
As the population total is found by N (total population size times the population mean), a natural
estimator is formed by the product of the population size and the sample mean, i.e.

TOTAL = = Ny.
Note that you must multiply by the population size not the sample size.
CHAPTER 3. SAMPLING
Its estimated se is found by multiplying the estimated se for the sample mean by the population size as
well, i.e.,
se( ) = N
_
s
2
n
(1 f)
In general, estimates for population totals in most sampling designs are found by multiplying estimates
of population means by the population size.
Condence intervals are found in the usual fashion.
3.4.4 Estimating Population Proportions
A standard trick used in survey sampling when estimating a population proportion is to replace the re-
sponse variable by a 0/1 code and then treat this coded data in the same way as ordinary data.
For example, suppose you were interested the proportion of sh in a catch that was of a particular
species. A sample of 10 sh were selected (of course in the real world, a larger sample would be taken), and
the following data were observed (S=sockeye, C=chum):
S C C S S S S C S S
Of the 10 sh sampled, 3 were chum so that the sample proportion of sh that were chum is 3/10 = 0.30.
If the data are recoded using 1=Chum, 0=Sockeye, the sample values would be:
0 1 1 0 0 0 0 1 0 0
The sample average of these numbers gives y = 3/10 = 0.30 which is exactly the proportion seen.
It is not surprising then that by recoding the sample using 0/1 variables, the rst line in the summary
table reduces to the last line in the summary table. In particular, s
2
reduces to np(1 p)/(n 1) resulting
in the se seen above.
Condence intervals are computed in the usual fashion.
3.4.5 Example - estimating total catch of sh in a recreational shery
This will illustrate the concepts in the previous sections using a very small illustrative example.
CHAPTER 3. SAMPLING
For management purposes, it is important to estimate the total catch by recreational shers. Unfortu-
nately, there is no central registry of shers, nor is there a central reporting station. Consequently, surveys
are often used to estimate the total catch.
There are two common survey designs used in these types of surveys (generically called creel surveys).
In access surveys, observers are stationed at access points to the shery. For example, if shers go out
in boats to catch the sh, the access points are the marinas where the boats are launched and are returned.
From these access points, a sample of shers is selected and interviews conducted to measure the number
of sh captured and other attributes. Roving surveys are commonly used when there is no common access
point and you can move among the shers. In this case, the observer moves about the shery and questions
anglers as they are encountered. Note that in this last design, the chances of encountering an angler are no
longer equal - there is a greater chance of encountering an angler who has a longer shing episode. And, you
typically dont encounter the angler at the end of the episode but somewhere in the middle of the episode.
The analysis of roving surveys is more complex - seek help. The following example is based on a real life
example from British Columbia. The actual survey is much larger involving several thousand anglers and
sample sizes in the low hundreds, but the basic idea is the same.
An access survey was conducted to estimate the total catch at a lake in British Columbia. Fortunately,
access to the lake takes place at a single landing site and most anglers use boats in the shery. An observer
was stationed at the landing site, but because of time constraints, could only interviewa portion of the anglers
returning, but was able to get a total count of the number of shing parties on that day. A total of 168 shing
parties arrived at the landing during the day, of which 30 were sampled. The decision to sample an shing
party was made using a random number table as the boat returned to the dock.
The objectives are to estimate the total number of anglers and their catch and to estimate the proportion
of boat trips (shing parties) that had sufcient life-jackets for the members on the trip. Here is the raw data
CHAPTER 3. SAMPLING
- each line is the results for a shing party..
Number Party Sufficient
Anglers Catch Life Jackets?
1 1 yes
3 1 yes
1 2 yes
1 2 no
3 2 no
3 1 yes
1 0 no
1 0 no
1 1 yes
1 0 yes
2 0 yes
1 1 yes
2 0 yes
1 2 yes
3 3 yes
1 0 no
1 0 yes
2 0 yes
3 1 yes
1 0 yes
2 0 yes
1 1 yes
1 0 yes
1 0 yes
1 0 no
2 0 yes
2 1 no
1 1 no
1 0 yes
1 0 yes
What is the population of interest?
The population of interest is NOT the sh in the lake. The Fisheries Department is not interested in esti-
mating the characteristics of the sh, such as mean sh weight or the number of sh in the lake. Rather, the
focus is on the anglers and shing parties. Refer to the FAQ at the end of the chapter for more details.
It would be tempting to conclude that the anglers on the lake are the population of interest. However,
note that information is NOT gathered on individual anglers. For example, the number of sh captured by
each angler in the party is not recorded - only the total sh caught by the party. Similarly, it is impossible to
CHAPTER 3. SAMPLING
say if each angler had an individual life jacket - if there were 3 anglers in the boat and only two life jackets,
which angler was without?
1
For this reason, the the population of interest is taken to be the set of boats shing at this lake. The
sheries agency doesnt really care about the individual anglers because if a boat with 3 anglers catches one
sh, the actual person who caught the sh is not recorded. Similarly, if there are only two life jackets, does
it matter which angler didnt have the jacket?
Under this interpretation, the design is a simple random sample of boats returning to the landing.
What is the frame?
The frame for a simple random sample is a listing of ALL the units in the population. This list is then used
to randomly select which units will be measured. In this case, there is no physical list and the frame is
conceptual. A random number table was used to decide which shing parties to interview.
What is the sampling design and sampling unit?
The sampling design will be treated as if it were a simple random sample from all boats (shing parties)
returning, but in actual fact was likely a systematic sample or variant. As you will see later, this may or may
not be a problem.
In many cases, special attention should be paid to identify the correct sampling unit. Here the sampling
unit is a shing party or boat, i.e. the boats were selected, not individual anglers. This mistake is often made
when the data are presented on an individual basis rather than on a sampling unit basis. As you will see in
later chapters, this is an example of pseudo-replication.
Excel analysis
As mentioned earlier, Excel should be used with caution in statistical analysis. However, for very simple
surveys, it is an adequate tool.
A copy of a sample Excel worksheet called creel is available in the AllofData workbook in the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
Here is a condensed view of the spreadsheet within the workbook:
1
If data were collected on individual anglers, then the anglers could be taken as the population of interest. However, in this case,
the design is NOT a simple random sample of anglers. Rather, as you will see later in the course, the design is a cluster sample where
a simple random sample of clusters (boats) was taken and all members of the cluster (the anglers) were interviewed. As you will see
later in the course, a cluster sample can be viewed as a simple random sample if you dene the population in terms of clusters.
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
The analysis proceeds in a series of logical steps as illustrated for the number of anglers in each party
variable.
Enter the data on the spreadsheet
The metadata (information about the survey) is entered at the top of the spreadsheet.
The actual data is entered in the middle of the sheet. One row is used to list the variables recorded for
each angling party.
Obtain the required summary statistics.
At the bottom of the data, the summary statistics needed are computed using the Excel built-in functions.
This includes the sample size, the sample mean, and the sample standard deviation.
Obtain estimates of the population quantity
Because the sample mean is the estimator for the population mean in if the design is a simple random sample,
no further computations are needed.
In order to estimate the total number of angler, we multiply the average number of anglers in each shing
party (1.533 angler/party) by the POPULATION SIZE (the number of shing parties for the entire day =
168) to get the estimated total number of anglers (257.6).
Obtain estimates of precision - standard errors
The se for the sample mean is computed using the formula presented earlier. The estimated standard error
OF THE MEAN is 0.128 anglers/party.
Because we found the estimated total by multiplying the estimates of the mean number of anglers/boat
trip times the number of boat trips (168), the estimated standard error of the POPULATION TOTAL is found
by multiplying the standard error of the sample mean by the same value, 0.128x168 = 21.5 anglers.
Hence, a 95% condence interval for the total number of anglers shing this day is found as 257.6
2(21.5).
Estimating total catch
The next column uses a similar procedure is followed to estimate the total catch.
Estimating proportion of parties with sufcient life-jackets
First, the character values yes/no are translated into 0,1 variables using the IF statement of Excel.
Then the EXACT same formula as used for estimating the total number of anglers or the total catch is
applied to the 0,1 data!
CHAPTER 3. SAMPLING
We estimate that 73.3% of boats have sufcient life-jackets with a se of 7.4 percentage points.
SAS analysis
SAS (Version 8 or higher) has procedures for analyzing survey data. Copies of the sample SAS program
called creel.sas and the output called creel.lst are available from the Sample Program Library at http:
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The program starts with the Data step that reads in the data and creates the metadata so that the purpose
of the program and how the data were collected etc are not lost.
data creel; /
*
read in the survey data
*
/
input angler catch lifej $;
enough = 0;
if lifej = yes then enough = 1;
The rst section of code reads the data and computes the 0,1 variable from the life-jacket information. A
Proc Print (not shown) lists the data so that it can be veried that it was read correctly.
Most program for dealing with survey data require that sampling weights be available for each observa-
tion.
data creel;
set creel;
sampweight = 168/30;
run;
A sampling weight is the weighting factor representing how many people in the population this observation
represents. In this case, each of the 30 parties represents 168/30=5.6 parties in the population.
Finally, Proc SurveyMeans is used to estimate the quantities of interest.
proc surveymeans data=creel
total=168 /
*
total population size
*
/
mean clm /
*
find estimates of mean, its se, and a 95% confidence interval
*
/
sum clsum /
*
find estimates of total,its se, and a 95% confidence interval
*
/
;
var angler catch lifej ; /
*
estimate mean and total for numeric variables, proportions for char variables
*
/
weight sampweight;
CHAPTER 3. SAMPLING
/
*
Note that it is not necessary to use the coded 0/1 variables in this procedure
*
/
run;
It is not necessary to code any formula as these are builtin into the SAS program. So how does the
SAS program know this is a simple random sample? This is the default analysis - more complex designs
require additional statements (e.g. a CLUSTER statement) to indicate a more complex design. As well,
equal sampling weights indicate that all items were selected with equal probability.
Here are portions of the SAS output
Data Summary
Number of Observations 30
Sum of Weights 168
Class Level Information
Class Variable Levels Values
lifej 2 no yes
Statistics
Variable Level Mean Std Error of Mean 95% CL for Mean Sum Std Dev 95% CL for Sum
angler 1.533333 0.128419 1.27068638 1.79598029 257.600000 21.574442 213.475312 301.724688
catch 0.666667 0.139688 0.38097171 0.95236162 112.000000 23.467659 64.003248 159.996752
lifej no 0.266667 0.074425 0.11444970 0.41888363 44.800000 12.503462 19.227550 70.372450
yes 0.733333 0.074425 0.58111637 0.88555030 123.200000 12.503462 97.627550 148.772450
All of the results match that from the Excel spreadsheet.
3.5 Sample size determination for a simple random sample
I cannot emphasize too strongly, the importance of planning in advance of the survey.
There are many surveys where the results are disappointing. For example, a survey of anglers may
show that the mean catch per angler is 1.3 sh but that the standard error is .9 sh. In other words, a 95%
condence interval stretches from0 to well over 4 sh per angler, something that is known with near certainty
even before the survey was conducted. In many cases, a back of the envelope calculation has showed that the
CHAPTER 3. SAMPLING
precision obtained from a survey would be inadequate at the proposed sample size even before the survey
was started.
In order to determine the appropriate sample size, you will need to rst specify some measure of precision
that is required to be obtained. For example, a policy decision may require that the results be accurate to
within 5% of the true value.
This precision requirement usually occurs in one of two formats:
an absolute precision, i.e. you wish to be 95% condent that the sample mean will not vary from
the population mean by a pre-specied amount. For example, a 95% condence interval for the total
number of sh captured should be 1,000 sh.
a relative precision, i.e. you wish to be 95% condent that the sample mean will be within 10% of
the true mean.
The latter is more common than the former, but both are equivalent and interchangeable. For example,
if the actual estimate is around 200, with a se of about 50, then the 95% condence interval is 100 and
the relative precision is within 50% of the true answer ( 100 / 200). Conversely, a 95% condence interval
that is within 40% of the estimate of 200, turns out to be 80 (40% of 200), and consequently, the se is
around 40 (=80/2).
A common question is:
What is the difference between se/est and 2se/est? When is the relative standard error divided
by 2? Does se/est have anything to do with a 95 % ci?
Precision requirements are stated in different ways (replace blah below by mean/total/proportion etc).
CHAPTER 3. SAMPLING
Expression Mathematics
- within xxx of the blah se = xxx
- margin of error of xxx 2se = xxx
- within xxx of the true value 19 times out of 20 2se = xxx
- within xxx of the true value 95% of the time 2se = xxx
- the width of the 95% condence interval is xxx 4se = xxx
- within 10% of the blah se/est = .10
- a rse of 10% se/est = .10
- a relative error of 10% se/est = .10
- within 10% of the blah 95% of the time 2se/est = .10
- within 10% of the blah 19 times out of 20 2se/est = .10
- margin of error of 10% 2se/est = .10
- width of 95% condence interval = 10% of the blah 4se/est = .10
As a rough rule of thumb, the following are often used as survey precision guidelines:
For preliminary surveys, the 95% condence interval should be 50% of the estimate. This implies
that the target rse is 25%.
For management surveys, the 95% condence interval should be 25% of the estimate. This implies
that the target rse is 12.5%.
For scientic work, the 95% condence interval should be 10% of the estimate. This implies that
the target rse is 5%.
Next, some preliminary guess for the standard deviation of individual items in the population (S) needs
to be taken along with an estimate of the population size (N) and possibly the population mean () or
population total (). These are not too crucial and can be obtained by:
taking a pilot study.
previous sampling of similar populations
expert opinion
CHAPTER 3. SAMPLING
A very rough estimate of the standard deviation can be found by taking the usual range of the data/4. If
the population proportion is unknown, the value of 0.5 is often used as this leads to the largest sample size
requirement as a conservative guess.
These are then used with the formulae for the condence interval to determine the relevant sample size.
Many text books have complicated formulae to do this - it is much easier these days to simply code the
formulae in a spreadsheet (see examples) and use either trial and error to nd a appropriate sample size,
or use the GOAL SEEKER feature of the spreadsheet to nd the appropriate sample size. This will be
illustrated in the example.
As an approximated answer, recall that se usally vary by
n. Suppose that the present rse is .07. A rse

of 5%, is smaller by a factor of .075/.05 = 1.5 which will require an increase of 1.5
2
= 2.25 in the sample
size.
If the raw data are available, you can also do a bootstrap selection (with replacement) to investigate the
effect of sample size upon the se. For each different bootstrap sample size, estimate the parameter, the rse
and then increase the sample size until the require rse is obtained. This is relatively easy to do in SAS using
the Proc SurveySelect that can select samples of arbitrary size. In saome packages, such as JMP, sampling is
without replacement so a direct sampling of 3x the observed sample size is not possible. In this case, create
a pseudo-data set by pasting 19 copies of the raw data after the original data. Then use the Table Subset
Random Sample Size to get the approximate bootstrap sample. Again compute the estimate and its rse,
and increase the sample size until the required precision is obtained.
The nal sample size is not to be treated as the exact sample size but more as a guide to the amount
of effort that needs to be expended. Remember that guesses are being made for the standard deviation,
the require precision, the approximate value of the estimate etc. Consequently, there really isnt a defensible
difference between a required sample size of 30 and 40. What really is of interest is the order of magnitude of
effort required. For example, if your budget allows for a sample size of 20, and the sample size computation
show that a sample size of 200 is required, then doing the survey with a sample size of 20 is a waste of time
and money. If the required sample size is about 30, then you may be ok with an actual sample size of 20.
If more than one item is being surveyed, these calculations must be done for each item. The largest
sample size needed is then chosen. This may lead to conict in which case some response items must be
dropped or a different sampling method must be used for this other response variable.
Precision essentially depends only the absolute sample
size, not the relative fraction of the population sampled.
For example, a sample of 1000 people taken from Canada (population of 33,000,000) is just as precise as a
sample of 1000 people taken from the US (population of 333,000,000)! This is highly counter-intuitive and
will be explored more in class.
3.5.1 Example - How many angling-parties to survey
We wish to repeat the angler creel survey next year.
CHAPTER 3. SAMPLING
How many angling-parties should be interviewed to be 95% condent of being with 10% of the true
mean catch?
What sample size would be needed to estimate the proportion of boats within 3 percentage points 19
times out of 20? In this case we are asking that the 95% condence interval be 0.03 or that the
se = 0.015.
The sample size spreadsheet is available in an Excel workbook called SurveySampleSize.xls which
can be downloaded from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms.
A SAS program to compute sample size is also available, but in my opinion, is neither user-friendly nor
as exible for the general user. The code and output is also available in the Sample Program Library referred
to above.
Here is a condensed view of the spreadsheet:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
First note that the computations for sample size require some PRIOR information about population size,
the population mean, or the population proportion. We will use information from the previous survey to help
plan future studies.
For example, about 168 boats were interviewed last year. The mean catch per angling party was about
.667 sh/boat. The standard deviation of the catch per party was .844. These values are entered in the
spreadsheet in column C.
A preliminary sample size of 40 (in green in Column C) was tried. This lead to a 95% condence interval
of 35% which did not meet the precision requirements.
Now vary the sample size (in green) in column C until the 95% condence interval (in yellow) is below
10%. You will nd that you will need to interview almost 135 parties - a very high sampling fraction
indeed. The problem for this variable is the very high variation of individual data points.
If you are familiar with Excel, you can use the Goal Seeker function to speed the search.
Similarly, the proportion of people wearing lifejackets last year was around 73%. Enter this in the blue
areas of Column E. The initial sample size of 20 is too small as the 95% condence interval is .186 (18
percentage points). Now vary the sample size (in green) until the 95% condence interval is .03. Note
that you need to be careful in dealing with percentages - condence limits are often specied in terms of
percentage points rather than percents to avoid problems where percents are taken of percents. This will be
explained further in class.
Try using the spreadsheet to compare the precision of a poll of 1000 people taken from Canada (popula-
tion 33,000,000) and 1000 people taken from the US (population 330,000,000) if both polls have about 40%
in favor of some issue.
Technical notes
If you really want to know how the sample size numbers are determined, here is the lowdown.
Suppose that you wish to be 95% sure that the sample mean is within 10% of the true mean.
We must solve z
S
n
_
Nn
N
for n where z is the term representing the multiplier for a particular
condence level (for a 95% c.i. use z = 2) and is the closeness factor (in this case = 0.10).
Rearranging this equation gives n =
N
1+N(
zS
)
2
3.6 Systematic sampling
Sometimes, logistical considerations make a true simple random sample not very convenient to administer.
For example, in the previous creel survey, a true random sample would require that a random number be
CHAPTER 3. SAMPLING
generated for each boat returning to the marina. In such cases, a systematic sample could be used to select
elements. For example, every 5
th
angler could be selected after a random starting point.
3.6.1 Advantages of systematic sampling
The main advantages of systematic sampling are:
it is easier to draw units because only one random number is chosen
if a sampling frame is not available but there is a convenient method of selecting items, e.g. the creel
survey where every 5
th
angler is chosen.
easier instructions for untrained staff
if the population is in random order relative to the variable being measured, the method is equivalent
to a SRS. For example, it is unlikely that the number of anglers in each boat changes dramatically over
the period of the day. This is an important assumption that should be investigated carefully in
any real life situation!
it distributes the sample more evenly over the population. Consequently if there is a trend, you will
get items selected from all parts of the trend.
3.6.2 Disadvantages of systematic sampling
The primary disadvantages of systematic sampling are:
Hidden periodicities or trends may cause biased results. In such cases, estimates of mean and standard
errors may be severely biased! See Section 4.2.2 for a detailed discussion.
Without making an assumption about the distribution of population units, there is no estimate of the
standard error. This is an important disadvantage of a systematic sample! Many studies very
casually make the assumption that the systematic sample is equivalent to a simple random sample
without much justication for this.
3.6.3 How to select a systematic sample
There are several methods, depending if you know the population size, etc. Suppose we need to choose
every k
th
record, where k is chosen to meet sample size requirements. - an example of choosing k will be
given in class. All of the following methods are equivalent if k divides N exactly. These are the two most
common methods.
CHAPTER 3. SAMPLING
Method 1 Choose a random number j from 1 k.. Then choose the j, j + k, j + 2k, records.
One problem is that different samples may be of different size - an example will be given in class
where n doesnt divide N exactly. This causes problems in sampling theory, but not too much of a
problem if n is large.
Method 2 Choose a random number from 1 N. Choose very k
th
item and continue in a circle when
you reach the end until you have selected n items. This will always give you the same sized sample,
however, it requires knowledge of N
3.6.4 Analyzing a systematic sample
Most surveys casually assume that the population has been sorted in random order when the systematic
sample was selected and so treat the results as if they had come from a SRSWOR. This is theoretically not
correct and if your assumption is false, the results may be biased, and there is no way of examining the
biases from the data at hand.
Before implementing a systematic survey or analyzing a systematic survey, please consult with an expert
in sampling theory to avoid problems. This is a case where an hour or two of consultation before spending
lots of money could potentially turn a survey where nothing can be estimated, into a survey that has justiable
results.
3.6.5 Technical notes - Repeated systematic sampling
To avoid many of the potential problems with systematic sampling, a common device is to use repeated
systematic samples on the same population.
For example, rather than taking a single systematic sample of size 100 from a population, you can take
4 systematic samples (with different starting points) of size 25.
An empirical method of obtaining a standard error from a systematic sample is to use repeated system-
atic sampling. Rather than choosing one systematic subsample of every k
th
unit, choose, m independent
systematic subsample of size n/m. Then estimate the mean of each sub-systematic sample. Treat these
means as a simple random sample from the population of possible systematic samples and use the usual
sampling theory. The variation of the estimate among the sub-systematic samples provides an estimate of
the standard error (after an appropriate adjustment). This will be illustrated in an example.
Example of replicated subsampling within a systematic sample
A yearly survey has been conducted in the Prairie Provinces to estimate the number of breeding pairs of
ducks. One breeding area has been divided into approximately 1000 transects of a certain width, i.e. the
breeding area was divided into 1000 strips.
CHAPTER 3. SAMPLING
What is the population of interest? As noted in class, the denition of a population depends, in part,
upon the interest of the researcher. Two possible denitions are:
The population is the set of individual ducks on the study area. However, no frame exists for the
individual birds. But a frame can be constructed based on the 1000 strips that cover the study area. In
this case, the design is a cluster sample, with the clusters being strips.
The population consists of the 1000 strips that cover the study area and the number of ducks in each
strip is the response variable. The design is then a simple random sample of the strips.
In either case, the analysis is exactly the same and the nal estimates are exactly the same.
Approximately 100 of the transects are own by an aircraft and spotters on the aircraft count the number
of breeding pairs visible from the aircraft.
For administrative convenience, it is easier to conduct systematic sampling. However, there is structure
to the data; it is well known that ducks do not spread themselves randomly through out the breeding area.
After discussions with our Statistical Consulting Service, the researchers ew10 sets of replicated systematic
samples; each set consisted of 10 transects. As each transect is own, the scientists also classify each transect
as prime or non-prime breeding habitat.
Here is the raw data reporting the number of nests in each set of 10 transects:
CHAPTER 3. SAMPLING
Prime Non-Prime Non-
Set Habitat Habitat ALL Prime prime
Total n Total n Total mean mean Diff
(b) (a) (c) (d) (e)
1 123 3 345 7 468 41.0 49.3 -8.3
2 57 2 36 8 93 28.5 4.5 24.0
3 85 5 46 5 131 17.0 9.2 7.8
4 97 2 131 8 228 48.5 16.4 32.1
5 34 5 43 5 77 6.8 8.6 -1.8
6 85 3 67 7 152 28.3 9.6 18.8
7 56 7 64 3 120 8.0 21.3 -13.3
8 46 2 65 8 111 23.0 8.1 14.9
9 37 4 43 6 80 9.3 7.2 2.1
10 93 2 104 8 197 46.5 13.0 33.5
Avg 71.3 165.7 10.97
s 29.5 117.0 16.38
n 10 10 10
Est
Est total 7130 16570 mean 10.97
Est se 885 3510 se 4.91
Several different estimates can be formed.
1. Total number of nests in the breeding area (refer to column (a) above). The total number of nests
in the breeding area for all types of habitat is of interest. Column (a) in the above table is the data that
will be used. It represents the total number of nests in the 10 transects of each set.
The principle behind the estimator is that the 1000 total transects can be divided into 100 sets of 10
transects, of which a random sample of size 10 was chosen. The sampling unit is the set of transects
the individual transects are essentially ignored.
Note that this method assumes that the systematic samples are all of the same size. If the systematic
samples had been of different sizes (e.g. some sets had 15 transects, other sets had 5 transects), then a
ratio-estimator (see later sections) would have been a better estimator.
compute the total number of nests for each set. This is found in column (a).
Then the sets selected are treated as a SRSWOR sample of size 10 from the 100 possible sets. An
estimate of the mean number of nests per set of 10 transects is found as: = (468 +93 + +
CHAPTER 3. SAMPLING
197)/10 = 165.7 with an estimated se of se( ) =
_
s
2
n
_
1
n
100
_
=
_
117.0
2
10
_
1
10
100
_
=
35.1
The average number of nests per set is expanded to cover all 100 sets = 100 = 16570 and
se( ) = 100se( ) = 3510
2. Total number of nests in the prime habitat only (refer to column (b) above). This is formed in
exactly the same way as the previous estimate. This is technically known as estimation in a domain.
The number of elements in the domain in the whole population (i.e. how many of the 1000 transects
are in prime-habitat) is unknown but is not needed. All that you need is the total number of nests in
prime habitat in each set you essentially ignore the non-prime habitat transects within each set.
The average number of nests per set in prime habitats is found as before: =
123++93
10
= 71.3
with an estimated se of se( ) =
_
s
2
n
(1
n
100
) =
_
29.5
2
10
(1
10
100
) = 8.85.
because there are 100 sets of transects in total, the estimate of the population total number of
nests in prime habitat and its estimated se is = 100 = 7130 with a se( ) = 100se( ) = 885
Note that the total number of transects of prime habitat is not known for the population and so
an estimate of the density of nests in prime habitat cannot be computed from this estimated total.
However, a ratio-estimator (see later in the notes) could be used to estimate the density.
3. Difference in mean density between prime and non-prime habitats The scientists suspect that the
density of nests is higher in prime habitat than in non-prime habitat. Is there evidence of this in the
data? (refer to columns (c)-(e) above). Here everything must be transformed to the density of nest
per transect (assuming that the transects were all the same size). Also, pairing (refer to the section on
experimental design) is taking place so a difference must be computed for each set and the differences
analyzed, rather than trying to treat the prime and non-prime habitats as independent samples.
Again, this is an example of what is known as domain-estimation.
Compute the domain means for type of habitat for each set (columns (c) and (d)). Note that the
totals are divided by the number of transects of each type in each set.
Compute the difference in the means for each set (column (e))
Treat this difference as a simple random sample of size 10 taken from the 100 possible sets of
transects. What does the nal estimated mean difference and se imply?
3.7 Stratied simple random sampling
A simple modication to a simple random sample can often lead to dramatic improvements in precision.
This is known as stratication. All survey methods can potentially benet from stratication (also known
as blocking in the experimental design literature).
Stratication will be benecial whenever variability in the response variable among the survey units can
be anticipated and strata can be formed that are more homogeneous than the original set of survey units.
All stratied designs will have the same basic steps as listed below regardless of the underlying design.
CHAPTER 3. SAMPLING
Creation of strata. Stratication begins by grouping the survey units into homogeneous groups
(strata) where survey units within strata should be similar and strata should be different. For example,
suppose you wished to estimate the density of animals. The survey region is divided into a large
number of quadrats based on aerial photographs. The quadrats can be stratied into high and low
quality habitat because it is thought that the density within the high quality quadrats may be similar
but different from the density in the low quality habitats. The strata do not have to be physically
contiguous for example, the high quality habitats could be scattered through out the survey region
and can be grouped into one single stratum.
Determine total sample size. Use the methods in previous sections to determine the total sample size
(number of survey units) to select. At this stage, some sort of average standard deviation will be
used to determine the sample size.
Allocate effort among the strata. there are several ways to allocate the total effort among the strata.
Equal allocatin. In equal allocation, the total effort is split equally among all strata. Equal
allocation is preferred when equally precise estimates are required for each stratum.
2
Proportional allocation. In proportional allocation, the total effort is allocated to the strata
in proportion to stratum importance. Stratum importance could be related to stratum size (e.g.
when allocating effort among the U.S. and Canada, then because the U.S. is 10 times larger in
Canada, more effort should be allocated to surveying the U.S.). But if density is your measure
of importance, allocate more effort to higher density strata. Proportional allocation is preferred
when more precise estimates are required in more important strata.
Neyman allocation. Neyman determined that if you also have information on the variability
within each stratum, then more effort should be allocated to strata that are more important and
more variable to give you the most precise overall estimate for a given sample size. This rarely
is performed in ecology because often information on intra-stratum variability is unknown.
3
Cost allocation. In general, effort should be allocated to more important strata, more variable
strata, or strata where sampling is cheaper to give the best overall precision for the entire survey.
As in the previous allocation method, ecologists rarely have sufciently detailed cost information
to do this allocation method.
Conduct separate surveys in each stratum Separate independent surveys are conducted in each
stratum. It is not necessary to use the same survey method in all strata. For example, low density
quadrats could be surveyed using aerial methods, while high density strata may require ground based
methods. Some strata may use simple random samples, while other strata may use cluster samples.
Many textbooks show examples were the same survey method is used in all strata, but this is NOT
required.
The ability to use different sampling methods in the different strata often leads to substantial
cost savings and is a very good reason to use stratied sampling!
Obtain stratum specic estimates. Use the appropriate estimators to estimate stratum means and the
se for EACH stratum. Then expand the estimated mean to get the estimated total (and se) in the usual
way.
2
Recall from previous sections that the absolute sample size is one of the drivers for precision.
3
However, in many cases, higher means per survey unit are accompanied by greater variances among survey units so allocations
based on stratum means often capture this variation as well.
CHAPTER 3. SAMPLING
Rollup The individual stratum estimates of the TOTAL are then combined to give an overall Grand
Total value for the entire survey region. The se of the Grand Total is found as:
se(
GT) =
_
se(
1
)
2
+ se(
2
)
2
+. . .
Finally, if you want the overall grand average, simply divide the grand total (and its se) by the appro-
priate divisor.
Stratication is normally carried out prior to the survey (pre- stratication), but can also be done after
the survey (post-stratication) refer to a later section for details. Stratication can be used with any type
of sampling design the concepts introduced here deal with stratication applied to simple random samples
but are easily extended to more complex designs.
The advantages of stratication are:
standard errors of the mean or of the total will be smaller (i.e. estimates are more precise) when
compared to the respective standard errors froman unstratied design if the units within strata are more
homogenous (i.e., less variable) compared to the variability over the entire unstratied population.
different sampling methods may be used in each stratum for cost or convenience reasons. [In the
detail below we assume that each stratum has the same sampling method used, but this is only for
simplication.] This can often lead to reductions in cost as the most appropriate and cost effective
sampling method can be used in each straum.
because randomization occurs independently in each stratum, corruption of the survey design due to
problems experienced in the eld may be conned.
separate estimates for each stratum with a given precision can be obtained
it may be more convenient to take a stratied random sample for administrative reasons. For example,
the strata may refer to different district ofces.
3.7.1 A visual comparison of a simple random sample vs. a stratied simple random
sample
You may nd it useful to compare a simple random sample of 24 vs. a stratied random sample of 24 using
the following visual plans:
Select a sample of 24 in each case.
CHAPTER 3. SAMPLING
Simple Random Sampling
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Describe how the sample was taken.
CHAPTER 3. SAMPLING
Stratied Simple Random Sampling
Suppose that there is a gradient in habitat quality across the population. Then a more efcient (i.e. leading
to smaller standard errors) sampling design is a stratied design.
Three strata are dened, consisting of the rst 3 rows, the next 5 rows, and nally, the last two rows. In
many cases, the sample sample design is used in all strata. For example, suppose it was decided to conduct a
simple random sample within each stratum, with sample sizes of 8, 10, and 6 in the three strata respectively.
[The decision process on allocating samples to strata will be covered later.]
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Stratied Sampling with a different method in each stratum
It is quite possible, and often desirable, to use different methods in the different strata. For example, it may
be more efcient to survey desert areas using a xed-wing aircraft, while ground surveys need to be used in
heavily forested areas.
For example, consider the following design. In the rst (top most) stratum, a simple random sample was
taken; in the second stratum a cluster sample was taken; in the third stratum a cluster sample (via transects)
was also taken.
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
3.7.2 Notation
Common notation is to use h as a stratum index and i or j as unit indices within each stratum.
Characteristic Population quantities sample quantities
number of strata H H
stratum sizes N
1
, N
2
, , N
H
n
1
, n
2
, , n
H
population units Y
hj
h=1, ,H, j=1, ,N
H
y
hj
h=1, ,H, j=1, ,n
H
stratum totals
h
y
h
stratum means
h
y
h
Population total = N
H
h=1
W
h
h
where W
h
=
N
h
N
Population mean =
H
h=1
W
h
h
Standard deviation S
2
h
s
2
h
It is assumed that from each stratum, a SRSWOR of size n
h
is selected independently of ALL OTHER
STRATA!
The results below summarize the computations that can be more easily thought as occurring in four steps:
1. Compute the estimated mean and its se for each stratum. In this chapter, we use a SRS design in each
stratum, but it not necessary to use this design in a stratum and each stratum could have a different
design. In the case of an SRS, the estimate of the mean for each stratum is found as:

h
= y
h
with associated standard error:
se(
h
) =
s
2
h
n
h
(1 f
h
)
where the subscript h refers to each stratum.
2. Compute the estimated total and its se for each stratum. In many cases this is simply the estimated
mean for the stratum multiplied by the STRATUM POPULATION size. In the case of an SRS in each
stratum this gives::

h
= N
h

h
= N
h
y
h
CHAPTER 3. SAMPLING
.
se(
h
) = N
h
se(
h
) = N
h
s
2
h
n
h
(1 f
h
)
3. Compute the grand total and its se over all strata. This is the sum of the individual totals. The se is
computed in a special way.
=
1
+
2
+. . .
se( ) =
_
se(
1
)
2
+se(
2
)
2
+. . .
4. Occasionally, the grand mean over all strata is needed. This is found by dividing the estimated grand
total by the total POPULATION sizes:
=

N
1
+N
2
+. . .
se( ) =
se( )
N
1
+N
2
+. . .
This can be summarized in a succinct form as follows. Note that the stratum weights W
h
are formed as
N
h
/N and are often used to derive weighted means etc:
Quantity Pop value Estimator se
Mean =
H
h=1
W
h
h

str
=
H
h=1
W
h
y
h
h=1
W
2
h
se
2
(y
h
) =
h=1
W
2
h
s
2
h
n
h
(1 f
h
)
Total = N
H
h=1
W
h
h
or
str
= N
H
h=1
W
h
y
h
or
h=1
N
2
h
se
2
(y
h
) or
=
H
h=1
h
or
str
=
H
h=1
N
h
y
h
h=1
N
2
h
s
2
h
n
h
(1 f
h
)
=
H
h=1
N
h
h
Notes
The estimator for the grand population mean is a weighted average of the individual stratum means
using the POPULATION weights rather than the sample weights. This is NOT the same as the simple
unweighted average of the estimated stratum means unless the n
h
/n equal the N
h
/N - such a design
is known as proportional allocation in stratied sampling.
The estimated standard error for the grand total is found as
_
se
2
1
+ se
2
2
+ + se
2
h
, i.e. the square
root of the sum of the individual se
2
of the strata TOTALS.
CHAPTER 3. SAMPLING
The estimators for a proportion are IDENTICAL to that of the mean except replace the variable of
interest by 0/1 where 1=character of interest and 0=character not of interest.
Condence intervals Once the se has been determined, the usual 2se will give approximate 95%
condence intervals if the sample sizes are relatively large in each stratum. If the sample sizes are
small in each stratum some authors suggest using a t-distribution with degrees of freedom determined
using a Satterthwaite approximation - this will not be covered in this course.
3.7.4 Example - sampling organic matter from a lake
[With thanks to Dr. Rick Routledge for this example].
Suppose that you were asked to estimate the total amount of organic matter suspended in a lake just
after a storm. The rst scheme that might occur to you could be to cruise around the lake in a haphazard
fashion and collect a few sample vials of water which you could then take back to the lab. If you knew the
total volume of water in the lake, then you could obtain an estimate of the total amount of organic matter by
taking the product of the average concentration in your sample and the total volume of the lake.
The accuracy of your estimate of course depends critically on the extent to which your sample is repre-
sentative of the entire lake. If you used the haphazard scheme outlined above, you have no way of objectively
evaluating the accuracy of the sample. It would be more sensible to take a properly randomized sample.
(How might you go about doing this?)
Nonetheless, taking a randomized sample from the entire lake would still not be a totally sensible ap-
proach to the problem. Suppose that the lake were to be fed by a single stream, and that most of the organic
matter were concentrated close to the mouth of the stream. If the sample were indeed representative, then
most of the vials would contain relatively low concentrations of organic matter, whereas the few taken from
around the mouth of the stream would contain much higher concentration levels. That is, there is a real
potential for outliers in the sample. Hence, condence limits based on the normal distribution would not be
trustworthy.
Furthermore, the sample mean is not as reliable as it might be. Its value will depend critically on the
number of vials sampled from the region close to the stream mouth. This source of variation ought to be
controlled.
Finally, it might be useful to estimate not just the total amount of organic matter in the entire lake, but
the extent to which this total is concentrated near the mouth of the stream.
You can simultaneously overcome all three deciencies by taking what is called a stratied random
sample. This involves dividing the lake into two or more parts called strata. (These are not the horizontal
strata that naturally form in most lakes, although these natural strata might be used in a more complex
sampling scheme than the one considered here.) In this instance, the lake could be divided into two parts,
one consisting roughly of the area of high concentration close to the stream outlet, the other comprising the
remainder of the lake.
CHAPTER 3. SAMPLING
Then if a simple random sample of xed size were to be taken from within each of these strata, the
results could be used to estimate the total amount of organic matter within each stratum. These subtotals
could then be added to produce an estimate of the overall total for the lake.
This procedure, because it involves constructing separate estimates for each stratum, permits us to assess
the extent to which the organic matter is concentrated near the stream mouth. It also permits the investigator
to control the number of vials sampled from each of the two parts of the lake. Hence, the chance variation in
the estimated total ought to be sharply reduced. Finally, we shall soon see that the condence limits that one
can construct are free of the outlier problem that invalidated the condence limits based on a simple random
sampling scheme.
A randomized sample is to be drawn independently from within each stratum.
How can we use the results of a stratied random sample to estimate the overall total? The simplest way
is to construct an estimate of the totals within each of the strata, and then to sum these estimates. A sensible
estimate of the average within the hth stratum is y
h
. Hence, a sensible estimate of the total within the hth
stratum is
h
= N
h
y
h
, and the overall total can be estimated by =
H
h=1

h
=
H
h=1
N
h
y
h
.
If we prefer to estimate the overall average, we can merely divide the estimate of the overall total by the
size of the population, N. The resulting estimator is called the stratied random sampling estimator of the
population average, and is given by =
H
h=1
N
h
y
h
/N.
This can be expressed as a fancy average if we adjust the order of operations in the above expression. If,
instead of dividing the sum by N, we divide each term by N and then sum the results, we shall obtain the
same result. Hence,

stratified
=
H
h=1
(N
h
/N)y
h
=
H
h=1
W
h
y
h
,
where W
h
= N
h
/N. These W
h
-values can be thought of as weighting factors, and
stratified
can then
be viewed as a weighted average of the within-stratum sample averages.
The estimated standard error is found as:
se(
stratified
) = se
_
H
h=1
W
h
y
h
_
=
_
H
h=1
W
2
h
[se(y
h
)]
2
,
where the estimated se(y
h
) is given by the formulas for simple random sampling: se(y
h
) =
_
s
2
h
n
h
(1 f
h
).
CHAPTER 3. SAMPLING
A Numerical Example
Suppose that for the lake sampling example discussed earlier the lake were subdivided into two strata,
and that the following results were obtained. (All readings are in mg per litre.)
Stratum N
h
n
h
Sample Observations y
h
s
h
1 7.5 10
8
5 37.2 46.6 45.3 38.1 40.4 41.52 4.23
2 2.5 10
7
5 365 344 388 347 403 369.4 25.7
We begin by computing the estimated mean for each stratum and its associated standard error. The
sampling fraction
n
h
N
h
is so close to 0 it can be safely ignored. For example, the standard error of the mean
for stratum 1 is found as:
se(
1
) =
s
2
1
n
1
(1 f
1
) =
_
4.23
2
5
= 1.89
. This gives the summary table:
Stratum n
h

h
se(
h
)
1 5 41.52 1.8935
2 5 369.4 11.492
Next, we estimate the total organic matter in each stratum. This is found by multiplying the mean
concentration and se of each stratum by the total volume:

h
= N
h

h
se(
h
) = N
h
se(
h
)
For example, the estimated total organic matter in stratum 1 is found as:

1
= N
1

1
= 7.5 10
8
41.52 = 311.4 10
8
se(
1
) = N
1
se(
1
) = 7.5 10
8
1.89 = 14.175 10
8
This gives the summary table:
Stratum n
h

h
se(
h
)
h
se(
h
)
1 5 41.52 1.8935 311.4 10
8
14.175 10
8
2 5 369.4 11.492 92.3 10
8
2.873 10
8
Next, we total the organic content of the two strata and nd the se of the grand total as
14.175
2
+ 2.873
2
10
8
to give the summary table:
CHAPTER 3. SAMPLING
Stratum n
h

h
se(
h
)
h
se(
h
)
1 5 41.52 1.8935 311.4 10
8
14.175 10
8
2 5 369.4 11.492 92.3 10
8
2.873 10
8
Total 403.7 10
8
14.46 10
8
Finally, the overall grand mean is found by dividing by the total volume of the lake 7.75 10
8
to give:
=
403.7 10
8
7.75 10
8
= 52.09mg/L
se( ) =
14.46 10
8
7.75 10
8
= 1.87mg/L
The calculations required to compute the stratied estimate can also be done using the method of
weighted averages as shown in the following table:
Stratum N
h
W
h
y
h
W
h
y
h
se(y
h
) W
2
h
[se(y
h
)]
2
(= N
h
/N)
1 7.5 10
8
0.9677 41.52 40.180 1.8935 3.3578
2 2.5 10
7
0.0323 369.4 11.916 11.492 0.1374
Totals 7.75 10
8
1.0000 52.097 3.4952
se =
3.4952
Hence the estimate of the overall average is 52.097 mg/L, and the associated estimated standard error is
3.4963 = 1.870 mg/L and an approximate 95% condence interval is then found in the usual fashion. As
expected these match the previous results.
This discussion swept a number of practical difculties under the carpet. These include (a) estimating
the volume of each of the two portions of the lake, (b) taking properly randomized samples from within each
stratum, (c) selecting the appropriate size of each water sample, (d) measuring the concentration for each
water sample, and (e) choosing the appropriate number of water samples from each stratum. None of these
difculties is simple to do. Estimating the volume of a portion of a lake, for example, typically involves
taking numerous depth readings and then applying a formula for approximating integrals. This problem is
beyond the scope of these notes.
The standard error in the estimator of the overall average is markedly reduced in this example by the
stratication. The standard error was just estimated for the stratied estimator to be around 2. This result
was for a sample of total size 10. By contrast, for an estimator based on a simple random sample of the
same size, the standard error can be found to be about 20. [This involves methods not covered in this class.]
Stratication has reduced the standard error by an order of magnitude.
It is also possible that we could reduce the standard error even further without increasing our sampling
CHAPTER 3. SAMPLING
effort by somehow allocating this effort more efciently. Perhaps we should take fewer water samples from
the region far from the outlet, and take more from the other stratum. This will be covered later in this course.
One can also read in more comprehensive accounts how to construct estimates from samples that are
stratied after the sample is selected. This is known as post-stratication. These methods are useful if, e.g.,
you are sampling a population with a known sex ratio. If you observe that your sample is biased in favor
of one sex, you can use this information to build an improved estimate of the quantity of interest through
stratifying the sample by sex after it is collected. It is not necessary that you start out with a plan for sampling
some specied number of individuals from each sex (stratum).
Nonetheless, in any survey work, it is crucial that you begin with a plan. There are many examples of
surveys that produced virtually useless results because the researchers failed to develop an appropriate plan.
This should include a statement of your main objective, and detailed descriptions of how you plan to generate
the sample, collect the data, enter them into a computer le, and analyze the results. The plan should contain
discussion of how you propose to check for and correct errors at each stage. It should be tested with a pilot
survey, and modied accordingly. Major, ongoing surveys should be reassessed continually for possible
improvements. There is no reason to expect that the survey design will be perfect the rst time that it is
tried, nor that aws will all be discovered in the rst round. On the other hand, one should expect that after
many years experience, the researchers will have honed the survey into a solid instrument. George Gallups
early surveys were seriously biased. Although it took over a decade for the aws to come to light, once they
did, he corrected his survey design promptly, and continued to build a strong reputation.
One should also be cautious in implementing stratied survey designs for long-term studies. An efcient
stratication of the Fraser Delta in 1994, e.g., might be hopelessly out of date 50 years from now, with a
substantially altered conguration of channels and islands. You should anticipate the need to revise your
stratication periodically.
3.7.5 Example - estimating the total catch of salmon
DFO needs to monitor the catch of sockeye salmon as the season progresses so that stocks are not overshed.
The season in one statistical sub-area in a year was a total of 2 days (!) and 250 vessels participated in
the shery in these 2 days. A census of the catch of each vessel at the end of each day is logistically difcult.
In this particular year, observers were randomly placed on selected vessels and at the end of each day
the observers contacted DFO managers with a count of the number of sockeye caught on that day.
Here is the raw data - each line corresponds to the observers count for that vessel for that day. On the
second day, a new random sample of vessels was selected. On both days, 250 vessels participated in the
shery.
Date Sockeye
29-Jul-98 337
29-Jul-98 730
CHAPTER 3. SAMPLING
29-Jul-98 458
29-Jul-98 98
29-Jul-98 82
29-Jul-98 28
29-Jul-98 544
29-Jul-98 415
29-Jul-98 285
29-Jul-98 235
29-Jul-98 571
29-Jul-98 225
29-Jul-98 19
29-Jul-98 623
29-Jul-98 180
30-Jul-98 97
30-Jul-98 311
30-Jul-98 45
30-Jul-98 58
30-Jul-98 33
30-Jul-98 200
30-Jul-98 389
30-Jul-98 330
30-Jul-98 225
30-Jul-98 182
30-Jul-98 270
30-Jul-98 138
30-Jul-98 86
30-Jul-98 496
30-Jul-98 215
What is the population of interest?
The population of interest is the set of vessels participating in the shery on the two days. [The fact that each
vessel likely participated in both days is not really relevant.] The population of interest is NOT the salmon
captured - this is the response variable for each boat whose total is of interest.
What is the sampling frame?
It is not clear how the list of shing boats was generated. It seems unlikely that the aerial survey actually had
a picture of the boats on the water from which DFO selected some boats. More likely, the observers were
taken onto the water in some systematic fashion, and then the observer selected a boat at random from those
seen at this point. Hence the sampling frame is the set of locations chosen to drop off the observers and the
set of boats visible from these points.
CHAPTER 3. SAMPLING
What is the sampling design?
The sampling unit is a boat on a day. The strata are the two days. On each day, a random sample was selected
from the boats participating in the shery.
This is a stratied design with a simple random sample selected each day.
Note in this survey, it is logistically impossible to do a simple random sample over both the days as the
number of vessels participating really isnt known for any day until the shery starts. Here, stratication
takes the form of administrative convenience.
Excel analysis
A copy of an Excel spreadsheet is available in the sockeye tab of the AllofData workbook available from the
Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
A summary of the page appears below:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
The data are listed on the spreadsheet on the left.
Summary statistics
The Excel builtin functions are used to compute the summary statistics (sample size, sample mean, and
sample standard deviation) for each stratum. Some caution needs to be exercised that the range of each
function covers only the data for that stratum.
4
You will also need to specify the stratum size (the total number of sampling units in each stratum), i.e.
250 vessels on each day.
Find estimates of the mean catch for each stratum
Because the sampling design in each stratumis a simple randomsample, the same formulae as in the previous
section can be used.
The mean and its estimated se for each day of the opening is reported in the spreadsheet.
Find the estimates of the total catch for each stratum
The estimated total catch is found by multiplying the average catch per boat by the total number of boats
participating in the shery. The estimated standard error for the total for that day is found by multiplying
the standard error for the mean by the stratum size as in the previous section.
For example, in the rst stratum (29 July), the estimated total catch is found by multiplying the estimated
mean catch per boat (322) by the number of boats participating (250) to give an estimated total catch of
80,500 salmon for the day. The se for the total catch is found by multiplying the se of the mean (57) by the
number of boats participating (250) to give the se of the total catch for the day of 14,200 salmon.
Find estimate of grand total
Once an estimated total is found for each stratum, the estimated grand total is found by summing the
individual stratum estimated totals. The estimated standard error of the grand total is found by the square
root of the sum of the squares of the standard errors in each stratum - the Excel function sumsq is useful for
this computation.
Estimates of the overall grand mean
This was not done in the spreadsheet, but is easily computed by dividing the total catch by the total
number of boat days in the shery (250+250=500). The se is found by dividing the se of the total catch also
by 500.
Note this is interpreted as the mean number of sh captured per day per boat.
4
If you are procient with Excel, Pivot-Tables are an ideal way to compute the summary statistics for each stratum. An application
of Pivot-Tables is demonstrated in the analysis of a cluster sample where the cluster totals are needed for the summary statistics.
CHAPTER 3. SAMPLING
SAS analysis
As noted earlier, some care must be used when standard statistical packages are used to analyze survey data
as many packages ignore the design used to select the data.
A sample SAS program for the analysis of the sockeye example called sockeye.sas and its output called
sockeye.lst is available fromthe Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/
The program starts with reading in the raw data and the computation of the sampling weights.
data sockeye; /
*
read in the data
*
/
length date $8.;
input date $ sockeye;
/
*
compute the sampling weight. In general,
these will be different for each stratum
*
/
if date = 29-Jul then sampweight = 250/15;
if date = 30-Jul then sampweight = 250/15;
Because the population size and sample size are the same for each stratum, the sampling weights are common
to all boats. In general, this is not true, and a separate sampling weight computation is required for each
stratum.
A separate le is also constructed with the population sizes for each stratum so that estimates of the
population total can be constructed.
data n_boats; /
*
you need to specify the stratum sizes if you want stratum totals
*
/
length date $8.;
date = 29-Jul; _total_=250; output; /
*
the stratum sizes must be variable _total_
*
/
date = 30-Jul; _total_=250; output;
run;
Proc SurveyMeans then uses the STRATUM statement to identify that this is a stratied design. The
default analysis in each stratum is again a simple random sample.
proc surveymeans data=sockeye
N = n_boats /
*
dataset with the stratum population sizes present
*
/
mean /
*
average catch/boat along with standard error
*
/
sum ; /
*
request estimates of total
*
/ ;
CHAPTER 3. SAMPLING
strata date / list; /
*
identify the stratum variable
*
/
var sockeye; /
*
which variable to get estimates for
*
/
weight sampweight;
run;
The SAS output is:
Data Summary
Number of Strata 2
Sum of Weights 500
Stratum Information
Stratum Index date Population Total Sampling Rate N Obs Variable N
1 29Jul 250 6.00% 15 sockeye 15
2 30Jul 250 6.00% 15 sockeye 15
Statistics
Variable Mean Std Error of Mean Sum Std Dev
sockeye 263.500000 33.082758 131750 16541
The results are the same as before.
The only thing of interest is to note that SAS labels the precision of the estimated grand means as a
Standard error while it labels the precision of the estimated total as a standard deviation! Both are correct -
a standard error is a standard deviation - not of individual units in the population - but of the estimates over
repeated sampling from the same population. I think it is clearer to label both as standard errors to avoid any
confusion.
If separate analyses are wanted for each stratum, the SURVEYMEANS procedure has to be run twice,
one time with a BY statement to estimate the means and totals in each stratum.
Again, it is likely easiest to do planning for future experiments in an Excel spreadsheet rather than using
SAS.
CHAPTER 3. SAMPLING
When should the various estimates be used?
In a stratied sample, there are many estimates that are obtained with different standard errors. It can
sometimes be confusion as to which estimate is used for which purpose.
Here is a brief review of the four possible estimates and the level of interest in each estimate.
C
H
A
P
T
E
R
3
.
S
A
M
P
L
I
N
G
Parameter Estimator se Example and Interpretation Who would be interested in this quan-
tity?
Stratum
mean

h
= Y
h
_
s
2
h
n
h
(1 f
h
) Stratum 1. Estimate is 322; se of
56.8 (not shown).
The estimated average catch per
boat was 322 sh (se 56.8 sh)
on 29 July
A sher who wishes to sh ONLY the
rst day of the season and wants to
know if it will meet expenses.
Stratum
total

h
= N
h

h
=
N
h
Y
h
N
h
se(
h
) =
N
h
_
s
2
j
n
h
(1 f
h
)
Stratum 1. Estimate is
80,500=250x322; se of
14195=250x56.8.
The estimated total catch overall boats
on 29 July was 80,500 (se 14,195)
DFO who wishes to estimate TOTAL
catch overall ALL boats on this single
day so that quota for next day can be
set. Grand Total
Grand
total.
=
1
+
2
_
se(
1
)
2
+se(
1
)
2
Estimate
131,750=80,500+51,250; se is
141952
2
+ 84922
2
= 16541.
The estimated total catch overall
all boats over all days is 132,000
sh (se 17,000 sh).
DFO who wishes to know total catch
over entire shing season so that im-
pacts on stock can be examined.
Grand
average
=

N
se( )
N
Grand mean (not shown).
N=500 vessel-days.
Estimate is 131,750/500=263.5;
se is 16541/500=33.0.
The estimated catch per boat per
day over the entire season was
263 sh (se 33 sh).
A sher who want to know average
catch per boat per day for the entire
season to see if it will meet expenses.
c
2
0
1
2
C
a
r
l
J
a
m
e
s
S
c
h
w
a
r
z
1
7
6
D
e
c
e
m
b
e
r
2
1
,
2
0
1
2
CHAPTER 3. SAMPLING
3.7.6 Sample Size for Stratied Designs

As before, the question arises as how many units should be selected in stratied designs.
This has two questions that need to be answered. First, what is the total sample size required? Second
how should these be allocated among the strata.
The total sample size can be determined using the same methods as for a simple random sample. I would
suggest that you initially ignore the fact that the design will be stratied when nding the initial required
total sample size. If stratication proves to be useful, then your nal estimate will be more precise than
you anticipated (always a nice thing to happen!) but seeing as you are making guesses as to the standard
deviations and necessary precision required, I wouldnt worry about the extra cost in sampling too much.
If you must, it is possible to derive formulae for the overall sample sizes when accounting for strati-
cation, but these are relatively complex. It is likely easier to build a general spreadsheet where the single
cell is the total sample size and all other cells in the formula depend upon this quantity depending upon the
allocation used. Then the total sample size can be manipulated to obtain the desired precision. The following
information will be required:
The sizes (or relative sizes) of each stratum (i.e. the N
h
or W
h
).
The standard deviation of measurements in each stratum. This can be obtained from past surveys, a
literature search, or expert opinion.
The desired precision overall and if needed, for each stratum.
Again refer to the sockeye worksheet.
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
The standard deviations from this survey will be used as guesses for what might happen next year. As
in this years survey, the total sample size will be allocated evenly between the two days.
In this case, the total sample size must be allocated to the two strata. You will see several methods in
a later section to do this, but for now, assume that the total sample will be allocated equally among both
strata. Hence the proposed sample size of 75 is split in half to give a proposed sample size of 37.5 in each
stratum. Dont worry about the fractional sample size - this is only a planning exercise. We create one cell
that has the total sample size, and then use the formulae to allocate the total sample size equally to the two
strata. The total and the se of the overall total are found as before, and the relative precision (denoted as the
relative standard error (rse), and, unfortunately, in some books at the coefcient of variation cv) is found as
the estimated standard error/estimated total.
Again, this portion of the spreadsheet is setup so that changes in the total sample size are propagated
throughout the sheet. If you change the total sample size from 75 to some other number, this is automatically
split among the two strata, which then affects the estimated standard error for each stratum, which then
affects the estimated standard error for the total, which then affects the relative standard error. Again, the
proposed total sample size can be varied using trial and error, or the Excel Goal-Seek option can be used.
Here is what happens when a sample size of 75 is used. Dont be alarmed by the fractional sample sizes
in each stratum the goal is again to get a rough feel for the required effort for a certain precision.
Total n=75
se
Est Est
Stratum n Mean std dev vessels total total
29-Jul 37.5 322 226.8 250 80500 8537
30-Jul 37.5 205 135.7 250 51250 5107
Total 131750 9948
rse 7.6%
A sample size of 75 is too small. Try increasing the sample size until the rse is 5% or less. Alternatively,
once could use the GOAL SEEK feature of Excel to nd the sample size that gives a relative standard error
of 5% or less as shown below:
CHAPTER 3. SAMPLING
Total n=145
se
Est Est
Stratum n Mean std dev vessels total total
29-Jul 72.5 322 226.8 250 80500 5611
30-Jul 72.5 205 135.7 250 51250 3357
Total 131750 6539
rse 5.0%
3.7.7 Allocating samples among strata
There are number of ways of allocating a sample of size n among the various strata. For example,
1. Equal allocation. Under an equal allocation scheme, all strata get the same sample size, i.e. n
h
=
n/H This allocation is best if variances of strata are roughly equal, equally precise estimates are
required for each stratum, and you wish to test for differences in means among strata (i.e. an analytical
survey discussed in previous sections).
2. Proportional allocation. Under proportional allocation, sample sizes are allocated to be proportional
to the number of sampling units in the strata, i.e n
i
= n
Ni
N
= n
Ni
N
h
= n
Ni
N1+N2++N
H
=
n W
i
This allocation is simple to plan and intuitively appealing. However, it is not the best design.
This design may waste effort because large strata get large sample sizes but precision is determined
by sample size not the ratio of sample size to population size. For example, if one stratum is 10 times
larger than any other stratum, it is not necessary to allocate 10 times the sampling effort to get the
same precision in that stratum.
3. Neyman allocation In Neyman allocation (named after the statistician Neyman), the sample is allo-
cated to minimize the overall standard error for a given total sample size. Tedious algebra gives that
the sample should be allocated proportional to the product of the stratum size and the stratum standard
deviation, i.e. n
i
= n
WiSi
W
h
S
h
= n
NiSi
N
h
S
h
= n
NiSi
N1S1+N2S2++N
H
S
H
. This allocation
will be appropriate if the costs of measuring units are the same in all strata. Intuitively, the strata that
have the most of sampling units should be weighted larger; strata with larger standard deviations must
have more samples allocated to them to get the se of the sample mean within the stratum down to a
reasonable level. A key assumption of this allocation is that the cost to sample a unit is the same in all
strata.
4. Optimal Allocation when costs are involved In some cases, the costs of sampling differ among the
strata. Suppose that it costs C
i
to sample each unit in a stratum i. Then the total cost of the survey
is C =
n
h
C
h
. The allocation rule is that sample sizes should be proportional to the product to
stratum sizes, stratum standard deviations, and the inverse of the square root of the cost of sampling,
i.e. n
i
= n
WiSi/
Ci
(W
h
S
h
/
C
h
)
= n
N
i
S
i
C
i
(
N
h
S
h
C
h
)
= n
N
i
S
i
C
i
N
1
S
1
C
1
+
N
2
S
2
C
2
++
N
H
S
H
C
H
This implies that large
samples are found in strata that are larger, more variable, or cheaper to sample.
CHAPTER 3. SAMPLING
In practice, most of the gain in precision occurs from moving from equal to proportional allocation,
while often only small improvements in precision are gained from moving from proportional allocation to
Neyman allocation. Similarly, unless cost differences are enormous, there isnt much of an improvement in
precision to moving to an allocation based on costs.
Example - estimating the size of a caribou herd This section is based on the paper:
Siniff, D.B. and Skoog, R.O. (1964).
Aerial Censusing of Caribou Using Stratied Random Sampling.
The Journal of Wildlife Management, 28, 391-401.
http://dx.doi.org/10.2307/3798104
Some of the values have been modied slightly for illustration purposes.
The authors wished to estimate the size of a caribou herd. The density of caribou differs dramatically
based on the habitat type. The survey area was was divided into six strata based on habitat type. The survey
design is to divide each stratum in 4 km
2
quadrats that will be randomly selected. The number of caribou in
the quadrats will be counted from an aerial photograph.
The computations are available in the caribou tab in the Excel workbook ALLofData.xls available in
The key point to examining different allocations is to make a single cell represent the total sample size and
then make a formula in each of the stratum sample sizes a function of the total.
The total sample size can be found by varying the sample total until the desired precision is found.
Results from previous years survey: Here are the summary statistics from the survey in a previous year:
Map-squares sampled
Stratum N
h
n
h
y s Est total se(total)
1 400 98 24.1 74.7 9640 2621
2 40 10 25.6 63.7 1024 698
3 100 37 267.6 589.5 26760 7693
4 40 6 179 151.0 7160 2273
5 70 39 293.7 351.5 20559 2622
6 120 21 33.2 99.0 3984 2354
Total 770 211 69127 9172
The estimated size of the herd is 69,127 animals with an estimated se of 9,172 animals.
Equal allocation
CHAPTER 3. SAMPLING
What would happen if an equal allocation were used? We now split the 211 total sample size equally among
the 6 strata. In this case, the sample sizes are fractional, but this is OK as we are interested only in planning
to see what would have happened. Notice that the estimate of the overall population would NOT change, but
the se changes.
Stratum N
h
n
h
1 400 35.2 24.1 74.7 9640 4810
2 40 35.2 25.6 63.7 1024 149
3 100 35.2 267.6 589.5 26760 8005
4 40 35.2 179 151.0 7160 354
5 70 35.2 293.7 351.5 20559 2927
6 120 35.2 33.2 99.0 3984 1684
Total 770 211 69127 9938
An equal allocation gives rise to worse precision than the original survey. Examining the table in more detail,
you see that far too many samples are allocated in an equal allocation to strata 2 and 4 and not enough to
strata 1 and 3.
Proportional allocation
What about proportional allocation? Now the sample size is proportional to the stratum population sizes.
For example, the sample size for stratum 1 is found as 211 400/770. The following results are obtained:
Stratum N
h
n
h
1 400 109.6 24.1 74.7 9640 2431
2 40 11.0 25.6 63.7 1024 656
3 100 27.4 267.6 589.5 26760 9596
4 40 11.0 179 151.0 7160 1554
5 70 19.2 293.7 351.5 20559 4787
6 120 32.9 33.2 99.0 3984 1765
Total 770 211 69127 11263
This has an even worse standard error! It looks like not enough samples are placed in stratum 3 or 5.
Optimal allocation
What if both the stratum sizes and the stratum variances are to be used in allocating the sample? We create
a new column (at the extreme right) which is equal to N
h
S
h
. Now the sample sizes are proportional to these
values, i.e. the sample size for the rst stratum is now found as 21129866.4/133893.8. Again the estimate
of the total doesnt change but the se is reduced.
CHAPTER 3. SAMPLING
Stratum N
h
n
h
y s Est total se(total) N
h
S
h
1 400 47.1 24.1 74.7 9640 4089 29866.4
2 40 4.0 25.6 63.7 1024 1206 2550.0
3 100 92.9 267.6 589.56 26760 1629 58953.9
4 40 9.5 179 151.0 7160 1709 6039.6
5 70 38.8 293.7 351.5 20559 2639 24607.6
6 120 18.7 33.2 99.0 3984 2522 11876.4
Total 770 211 69127 6089 133893.8
3.7.8 Example: Estimating the number of tundra swans.
The Tundra Swan Cygnus columbianus, formerly known as the Whistling Swan, is a large bird with white
plumage and black legs, feet, and beak.
5
The USFWS is responsible for conserving and protecting tundra
swans as a migratory bird under the Migratory Bird Treaty Act and the Fish and Wildlife Conservation Act
of 1980. As part of these responsibilities, it conducts regular aerial surveys at one of their prime breeding
areas in Bristol Bay, Alaska. And, the Bristol Bay population of tundra swans is of particular interest because
suitable habitat for nesting is available earlier than most other nesting areas. This example is based on one
such survey.
6
Tundra swans are highly visible on their nesting grounds making them easy to monitor during aerial
surveys.
The Bristol Bay refuge has been divided into 186 survey units, each being a quarter section. These survey
units have been divided into three strata based on density, and previous years data provide the following
information about the strata:
Density Total Past Past
Stratum Survey Units Density Std Dev
High 60 20 10
Medium 68 10 6
Low 58 2 3
Total 186
Based on past years results and budget considerations, approximately 30 survey units can be sampled.
The three strata are all approximately the same total area (number of survey units) so allocations based
on stratum area will be approximately equal across strata. However, that would place about 1/3 of the effort
into the low density strata which typically have fewer birds.
5
Additional information about the tundra swan is available at http://www.hww.ca/hww2.asp?id=78&cid=7
6
Doster, J. (2002). Tundra Swan Population Survey in Bristol Bay, Northern Alaska Peninsula, June 2002.
CHAPTER 3. SAMPLING
It is felt that stratum density is a suitable measure of stratum importance (notice that close relationship
between stratum density and stratum standard deviations which is often found in biological surveys). Conse-
quently, an allocation based on stratum density was used. The sum of the density values is 20+10+2 = 32.
A proportional allocation would then place about 30
20
32
= 18 units in the high density stratum; about
30
10
32
= 9 units in the medium density stratum; and the remainder (3 units) in the low density stratum.
The survey was conducted with the following results:
Survey Area Swans in Single Total
Unit Stratum (km
2
) ocks Birds Pairs birds
dilai2 h 148 12 6 24
naka41 h 137 13 15 43
naka43 h 137 6 16 38
naka51 h 16 10 3 2 17
nakb32 h 137 10 10 30
nakb44 h 135 6 18 12 48
nakc42 h 83 4 5 6 21
nakc44 h 109 17 15 47
nakd33 h 134 11 11 33
ugac34 h 65 2 10 22
ugac44 h 138 28 15 58
ugad5/63 h 159 9 20 49
dugad56/4 m 102 7 4 15
guad43 m 137 6 4 14
ugad42 m 137 5 11 15 46
low1 l 143 2 2
low3 l 138 1 1
The rst thing to notice from the table above is that not all survey units could be surveyed because of
poor weather. As always with missing data, it is important to determine if the data are Missing Completely
at Random (MCAR). In this case, it seems reasonable that swans did not adjust their behavior knowing that
certain survey units would be sampled on the poor weather days and so there is no impact of the missing
data other than a loss of precision compared to a survey with a full 30 survey units chosen.
Also notice that blanks in the table (missing values) represent zeros and not really missing data.
Finally, not all of the survey units are the same area. This could introduce additional variation into our
data which may affect our nal standard errors. Even though the survey units are of different areas, the
survey units were chosen as a simple random sample so ignoring the area will NOT introduce bias into the
estimates (why). You will see in later sections how to compute a ratio estimator which could take the area
CHAPTER 3. SAMPLING
of each survey units into account and potentially lead to more precise estimates.
SAS analysis A copy of the SAS program (tundra.sas) is available in available in the Sample Program
Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The data are read into SAS in the usual fashion with the code fragment:
data swans;
infile datalines firstobs=3;
length survey_unit $10 stratum $1;;
input survey_unit $ stratum $ area num_flocks num_single num_pairs;
num_swans = num_flocks + num_single + 2
*
num_pairs;
datalines4;
... datalines inserted here ...
The total number of survey units in each stratum is also read into SAS using the code fragment.
data total_survey_units;
length stratum $1.;
input stratum $ _total_; /
*
must use _total_ as variable name
*
/
datalines;
h 60
m 68
l 58
;;;;
Notice that the variable that has the number of stratum units must be called _total_ as required by the
SurveyMeans procedure.
Next the data are sorted by stratum (not shown), the number of actual survey units surveyed in each
stratum is found using Proc Means:
proc means data=swans noprint;
by stratum;
var num_swans;
output out=n_units n=n;
run;
Most survey procedures in SAS require the use sampling weights. These are the reciprocal of the prob-
ability of selection. In this case, this is simply the number of units in the stratum divided by the number
sampled in each stratum:
CHAPTER 3. SAMPLING
data swans;
merge swans total_survey_units n_units;
by stratum;
sampling_weight = _total_ / n;
run;
Now the individual stratum estimates are obtained using the code fragment:
/
*
first estimate the numbers in each stratum
*
/
proc surveymeans data=swans
total=total_survey_units /
*
inflation factors
*
/
sum clsum mean clm;
by stratum; /
*
separate estimates by stratum
*
/
var num_swans;
weight sampling_weight;
ods output statistics=IndivEst;
run;
This gives the output:
Obs stratum Mean StdErr LowerCLMean UpperCLMean
1 h 35.83 3.43 28.28 43.38
2 l 1.50 0.49 4.74 7.74
3 m 25.00 10.27 19.19 69.19
Obs stratum Sum StdDev LowerCLSum UpperCLSum
1 h 2150 206 1697 2603
2 l 87 28 275 449
3 m 1700 698 1305 4705
The estimates in the L and M strata are not very precise because of the small number of survey units selected.
SAS has incorporated the nite population correction factor when estimating the se for the individual stratum
estimates.
We estimate that about 2000 swans are present in the H and M strata, but just over 100 in the L stratum.
The grand total is found by adding the estimated totals fromthe strata 2150+87+1700=3937, and the standard
error of the grand total is found in the usual way
206
2
+ 28
2
+ 698
2
= 729.
CHAPTER 3. SAMPLING
Proc SurveyMeans can be used to estimate the grand total number of units overall strata using the code
fragment::
/
*
now to estimate the grand total
*
/
proc surveymeans data=swans
total=total_survey_units /
*
inflation factors for each stratum
*
/
sum clsum mean clm; /
*
want to estimate grand totals
*
/
title2 Estimate total number of swans;
strata stratum /list; /
*
which variable define the strata
*
/
var num_swans; /
*
which variable to analyze
*
/
weight sampling_weight; /
*
sampling weight for each obs
*
/
ods output statistics=FinalEst;
run;
Obs Mean StdErr LowerCLMean UpperCLMean
1 21.17 3.92 12.77 29.57
Obs Sum StdDev LowerCLSum UpperCLSum
1 3937 729 2374 5500
The standard error is larger than desired, mostly because of the very small sample size in the M stratum
where only 3 of the 9 proposed survey units could be surveyed.
3.7.9 Post-stratication
In some cases, it is inconvenient or impossible to stratify the population elements into strata before sampling
because the value of a variable used for stratication is only available after the unit is sampled. For example,
we wish to stratify a sample of baby births by birth weight to estimate the proportion of birth defects;
we wish to stratify by family size when looking at day care costs.
we wish to stratify by soil moisture but this can only be measured when the plot is actually visited.
We dont know the birth weight, the family-size, or the soil moisture until after the data are collected.
CHAPTER 3. SAMPLING
There is nothing formally wrong with post-stratication and it can lead to substantial improvements in
precison.
How would post-stratication work in practise? Suppose than 20 quadrats (each 1 m
2
) were sampled
out of a 100 m
2
survey area using a simple random sample, and the number of insect grubs counted in each
quadrat. When the units were sampled, the soil was classied into high or low quality habitit for these grubs:
Grubs Post-strat
10 h
2 l
3 l
8 h
1 l
3 l
11 h
2 l
2 l
11 h
17 h
1 l
0 l
11 h
15 h
2 l
2 l
4 l
2 l
1 l
The overall mean density is estimated to be 5.40 insects/m
2
with a se of 1.17 m
2
(ignoring any fpc). The
estimated total number of insects over all 100 m
2
of the study area is 100 5.40 = 540 insects with a se of
100 1.17 = 117 insects.
Now suppose we look at the summary statistics by the post-stratication variable.
If the area of the post-strata are known (and this is NOT always possible), you can use standard rollup for
a stratied design. Suppose that there were 30 m
2
of high quality habitat and 70 m
2
of low quality habitat.
Then the roll-up proceeds as before and is summarized as:
CHAPTER 3. SAMPLING
Now the estimated total grubs is 490 with a se of 40 a substantial improvement over the non-stratied
analysis. The difference in the estimates (i.e. 540 vs. 490) is well within the range of uncertainty summarized
by the standard errors.
There are several potential problems when using post-stratication.
The sample size in each post-stratum cannot be controlled. This implies it is not possible to use any
of the allocation methods to improve precision that were discussed earlier. As well, the survey may
end up with a very small sample size in some strata.
The reported se must be increased to account for the fact that the sample size in each stratum is no
longer xed. This introduces an additional source of variation for the estimate, i.e. estimates will
vary from sample to sample not only because a new sample is drawn each time, but also because the
sample size within a stratum will change. However in practice, this is rarely a problem because the
actual increase in the se is usually small and this additional adjustment is rarely every done.
In the above example, the area of each stratum in the ENTIRE study area could be found after the fact.
But in some cases, it is impossible to nd the area of each stratum in the entire study area and so the
rollup could not be done. In these cases, you could use the results from the post-stratication to also
estimate the area of each stratum, but now the expansion factor for each stratum also has a se and this
must also be taken into account Please consult a standard book on sampling theory for details.
3.7.10 Allocation and precision - revisited
A student wrote:
Im a little confused about sample allocation in stratied sampling. Earlier in the course, you
stated that precision is independent of sample size, i.e. a sample of 1000 gave estimates that were
equally precise for Canada and the US (assuming a simple random sample). Yet in stratied
sampling, you also said that precision is improved by proportional allocation where larger strata
get larger sample sizes.
Both statements are correct. If you are interested in estimates for individual populations, then absolute
sample size is important.
If you wanted equally precise estimates for BOTH Canada and the US then you would have equal sample
sizes from both populations, say 1000 from both population even though their overall population size differs
by a factor of 10:1.
However, in stratied sampling designs, you may also be interested in the OVERALL estimate, over both
populations. In this case, a proportional allocation where sample size is allocated proportion to population
size often performs better. In this, the overall sample of 2000 people would be allocated proportional to the
population sizes as follows:
CHAPTER 3. SAMPLING
Stratum Population Fraction of total population Sample size
US 300,000,000 91% 91% x2000=1818
Canada 30,000,000 9% 9% x2000=181
Total 330,000,000 100% 2000
Why does this happen? Well if you are interested in the overall population, then the US results essentially
drives everything and Canada has little effect on the overall estimate. Consequently, it doesnt matter that
the Canadian estimate is not as precise as the US estimate.
3.8 Ratio estimation in SRS - improving precision with auxiliary in-
formation
An association between the measured variable of interest and a second variable of interest can be exploited to
obtain more precise estimates. For example, suppose that growth in a sample plot is related to soil nitrogen
content. A simple random sample of plots is selected and the height of trees in the sample plot is measured
along with the soil nitrogen content in the plot. A regression model is t (Thompson, 1992, Chapters 7 and
8) between the two variables to account for some of the variation in tree height as a function of soil nitrogen
content. This can be used to make precise predictions of the mean height in stands if the soil nitrogen
content can be easily measured. This method will be successful if there is a direct relationship between the
two variables, and, the stronger the relationship, the better it will perform. This technique is often called
ratio-estimation or regression-estimation.
Notice that multi-phase designs often use an auxiliary variable but this second variable is only measured
on a subset of the sample units and should not be confused with ratio estimators in this section.
Ratio estimation has two purposes. First, in some cases, you are interested in the ratio of two variables,
e.g. what is the ratio of wolves to moose in a region of the province.
Second, a strong relationship between two variables can be used to improve precision without increasing
sampling effort. This is an alternative to stratication when you can measure two variables on each sampling
unit.
We dene the population ratio as R =

Y
X
=

Y
X
. Here Y is the variable of interest; X is a secondary
variable not really of interest. Note that notation differs among books - some books reverse the role of X
and Y .
Why is the ratio dened in this way? There are two common ratio estimators, traditionally called the
mean-of-ratio and the ratio-of-mean estimators. Suppose you had the following data for Y and X which
represent the counts of animals of species 1 and 2 taken on 3 different days:
CHAPTER 3. SAMPLING
Sample
1 2 3
Y 10 100 20
X 3 20 1
The mean-of-ratios estimator would compute the estimated ratio between Y and X as:
R
meanofratio
=
10
3
+
100
20
+
20
1
3
= 9.44
while the ratio-of-means would be computed as:
R
ratioofmeans
=
(10 + 100 + 20)/3
(3 + 20 + 1)/3
=
10 + 100 + 20
3 + 20 + 1
= 5.41
Which is better?
The mean-of-ratio estimator should be used when you wish to give equal weight to each pair of numbers
regardless of the magnitude of the numbers. For example, you may have three plots of land, and you measure
Y and X on each plot, but because of observer efciencies that differ among plots, the raw numbers cannot
be compared. For example, in a cloudy, rainy day it is hard to see animals (rst case), but in a clear, sunny
day, it is easy to see animals (second case). The actual numbers themselves cannot be combined directly.
The ratio-of-means estimator (considered in this chapter) gives every value of Y and X equal weight.
Here the fact that unit 2 has 10 times the number of animals as unit 1 is important as we are interested in the
ratio over the entire population of animals. Hence, by adding the values of Y and X rst, each animals is
given equal weight.
When is a ratio estimator better - what other information is needed? The higher the correlation
between X
i
and Y
i
, the better the ratio estimator is compared to a simple expansion estimator. It turns out
that the ratio estimator is the best linear estimator if
the relation between Y
i
and X
i
is linear through the origin
the variation around the regression line is proportional to the X value, i.e. the spread around the
regression line increases as X increases unlike an ordinary regression line where the spread is assumed
to be constant in all parts of the line.
In practice, plot y
i
vs. x
i
from the sample and see what type of relation exists.
When can a ratio estimator be used? A ratio estimator will require that another variable (the X
variable) be measured on the selected sampling units. Furthermore, if you are estimating the overall mean
or total, the total value of the X-variable over the entire population must also be known. For example, as
see in the examples to come, the total area must be known to estimate the total animals once the density
(animals/ha) is known.
CHAPTER 3. SAMPLING
3.8.1 Summary of Main results
Quantity Population value Sample estimate se
Ratio R =

Y
X
=

Y
X
r =
y
x
=
y
x
_
1
2
X
s
2
di
n
(1 f)
Total
Y
= R
X

ratio
= r
X

X

_
1
2
X
s
2
di
n
(1 f)
Mean
Y
= R
X

Y ratio
= r
X

X

_
1
2
X
s
2
di
n
(1 f)
Notes
Dont be alarmed by the apparent complexity of the formulae above. They are relatively simple to
implement in spreadsheets.
The terms
2
di
=
n
i=1
(yirxi)
2
n1
is computed by creating a new column y
i
rx
i
and nding the (sample
standard deviation)
2
of this new derived variable. This will be illustrated in the examples.
In some cases the
2
X
in the denominator may or may not be known and it or its estimate x
2
can be
used in place of it. There doesnt seem to be any empirical evidence that either is better.
The term
2
X
/
2
X
reduces to N
2
.
Condence intervals Condence limits are found in the usual fashion. In general, the distribution of
R is positively skewed and so the upper bound is usually too small. This skewness is caused by the
variation in the denominator of the the ratio. For example, suppose that a random variable (Z) has a
uniform distribution between 0.5 and 1.5 centered on 1. The inverse of the random variable (i.e. 1/Z)
now ranges between 0.666 and 2 - no longer symmetrical around 1. So if a symmetric condence
interval is created, the width will tend not to match the true distribution.
This skewness is not generally a problem if the sample size is at least 30 and the relative standard error
of y and x are both less than 10%.
Sample size determination: The appropriate sample size to obtain a specied size of condence
interval can be found by iniert ingthe formulae for the se for the ratio. This can be done on a spread
sheet using trial and error or the goal seek feature of the spreadsheet as illustated in the examples that
follow.
3.8.2 Example - wolf/moose ratio
[This example was borrowed from Krebs, 1989, p. 208. Note that Krebs interchanges the use of x and y in
the ratio.]
CHAPTER 3. SAMPLING
Wildlife ecologists interested in measuring the impact of wolf predation on moose populations in BC
obtained estimates by aerial counting of the population size of wolves and moose on 11 sub-areas (all roughly
equal size) selected as SRSWOR from a total of 200 sub-areas in the game management zone.
In this example, the actual ratio of wolves to moose is of interest.
Here are the raw data:
Sub-areas Wolves Moose
1 8 190
2 15 370
3 9 460
4 27 725
5 14 265
6 3 87
7 12 410
8 19 675
9 7 290
10 10 370
11 16 510
What is the population and parameter of interest?
As in previous situations, there is some ambiguity:
The population of interest is the 200 sub-areas in the game-management zone. The sampling units are
the 11 sub-areas. The response variables are the wolf and moose populations in the game management
sub-area. We are interested in the wolf/moose ratio.
The populations of interest are the moose and wolves. If individual measurements were taken of each
animal, then this denition would be ne. However, only the total number of wolves and moose within
each sub-area are counted - hence a more proper description of this design would be a cluster design.
As you will see in a later section, the analysis of a cluster design starts by summing to the cluster level
and then treating the clusters as the population and sampling unit as is done in this case.
Having said this, do the number of moose and wolves measured on each sub-area include young moose
and young wolves or just adults? How will immigration and emigration be taken care of?
What was the frame? Was it complete?
The frame consists of the 200 sub-areas of the game management zone. Presumably these 200 sub-areas
cover the entire zone, but what about emigration and immigration? Moose and wolves may move into and
out of the zone.
What was the sampling design?
CHAPTER 3. SAMPLING
It appears to be an SRSWOR design - the sampling units are the sub-areas of the zone.
How did they determine the counts in the sub-areas? Perhaps they simply looked for tracks in the snow
in winter - it seems difcult to get estimates from the air in summer when there is lots of vegetation blocking
the view.
Excel analysis
A copy of the worksheet to perform the analysis of this data is called wolf and is available in the Allofdata
workbook fromthe Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms.
Here is a summary shot of the spreadsheet:
CHAPTER 3. SAMPLING
Assessing conditions for a ratio estimator
CHAPTER 3. SAMPLING
The ratio estimator works well if the relationship between Y and X is linear, through the origin, with
increasing variance with X. Begin by plotting Y (wolves) vs. X (moose).
The data appears to satisfy the conditions for a ratio estimator.
Compute summary statistics for both Y and X
Refer to the screen shot of the spreadsheet. The Excel builtin functions are used to compute the sample
size, sample mean, and sample standard deviation for each variable.
Compute the ratio
The ratio is computed using the formula for a ratio estimator in a simple random sample, i.e.
r =
y
x
Compute the difference variable
CHAPTER 3. SAMPLING
Then for each observation, the difference between the observed Y (the actual number of wolves) and the
predicted Y based on the number of moose (
Y
i
= rX
i
) is found. Notice that the sum of the differences must
equal zero.
The standard deviation of the differences will be needed to compute the standard error for the estimated
ratio.
Estimate the standard error of the estimated ratio
Use the formula given at the start of the section.
Final estimate Our nal result is that the estimated ratio is 0.03217 wolf/moose with an estimated se of
0.00244 wolf/moose. An approximate 95% condence interval would be computed in the usual fashion.
Planning for future surveys
Our nal estimate has an approximate rse of 0.00244/.03217 = 7.5% which is pretty good. You could
try different n values to see what sample size would be needed to get a rse of better than 5% or perhaps this
is too precise and you only want a rse of about 10%.
As an approximated answer, recall that se usally vary by

n. A rse of 5%, is smaller by a factor
of .075/.05 = 1.5 which will require an increase of 1.5
2
= 2.25 in the sample size, or about n
new
=
2.25 11 = 25 units (ignoring the fpc).
If the raw data are available, you can also do a bootstrap selection (with replacement) to investigate
the effect of sample size upon the se. For each different bootstrap sample size, estimate the ratio, the se and
then increase the sample size until the require se is obtained. This is relatively easy to do in SAS using the
Proc SurveySelect that can select samples of arbitrary size. In saome packages, such as JMP, sampling is
without replacement so a direct sampling of 3x the observed sample size is not possible. In this case, create
a pseudo-data set by pasting 19 copies of the raw data after the original data. Then use the Table Subset
Random Sample Size to get the approximate bootstrap sample. Again compute the ratio and its se, and
increase the sample size until the required precision is obtained.
If you want to be more precise about this, notice that the formula for the se of a ratio is found as:
2
X
s
2
di
n
(1 f)
From the spreadsheet we extract various values and nd that the se of the ratio is
_
1
395.64
2
3.29
2
n
(1
n
200
)
Different value of n can be tried until the rse is 5%. This gives a sample size of about 24 units.
If the actual raw data are not available, all is not lost. You would require the approximate MEAN of X
(
X
), the standard DEVIATION of Y , the standard DEVIATION of X, the CORRELATION between Y
CHAPTER 3. SAMPLING
and X, the approximate ratio (R), and the approximate number of total sample units (N). The correlation
determines how closely Y can be predicted from X and essentially determines how much better you will do
using a ratio estimator. If the correlation is zero, there is NO gain in precison using a ratio estimator over a
simple mean.
The se of r is then found as:
se(r) =
2
X
V (y) +R
2
V (x) 2Rcorr(y, x)
_
V (x)V (y)
n
(1
n
N
)
Different values of n can be tried to obtain the desired rse. This is again illustrated on the spreadsheet.
SAS Analysis
The above computations can also be done in SAS with the program wolf.sas available from the Sample
It uses Proc SurveyMeans which gives the output contained in wolf.lst.
The SAS program again starts with the DATA step to read in the data.
data wolf;
input subregion wolf moose;
Because the sampling weights are equal for all observation, it is not necessary to include them when esti-
mating a ratio (the weights cancel out in the formula used by SAS).
The Gplot procedure creates the plot similar to that in the Excel spreadsheet.
proc gplot data=wolf;
title2 plot to assess assumptions;
plot wolf
*
moose;
CHAPTER 3. SAMPLING
Finally, the SurveyMeans procedure does the actual computation:
proc surveymeans data=wolf ratio clm N=200;
title2 Estimate of wolf to moose ratio;
/
*
ratio clm - request a ratio estimator with confidence intervals
*
/
/
*
N=200 specifies total number of units in the population
*
/
var moose wolf;
ratio wolf/moose; /
*
this statement ask for ratio estimator
*
/
ods output Ratio=ratio;
The RATIO statement in the SURVEYMEANS procedure request the computation of the ratio estimator.
Here is the output:
Obs Numerator Variable Denominator Variable Ratio LowerCL StdErr UpperCL
1 wolf moose 0.032169 0.02673676 0.002438 0.03760148
CHAPTER 3. SAMPLING
The results are identical to that from the spreadsheet.
Again, it is easier to do planning in the Excel spreadsheet rather than in the SAS program.
CAUTION. Ordinary regression estimation from standard statistical packages provide only an AP-
PROXIMATION to the correct analysis of survey data. There are two problems in using standard statistical
packages for regression and ratio estimation of survey data:
Assumes a simple random sample. If your data is NOT collected using a simple random sample, then
ordinary regression methods should NOT be used.
Unable to use a nite population correction factor. This is usually not a problem unless the sample
size is large relative to the population size.
Wrong error structure. Standard regression analyses assume that the variance around the regression or
ratio line is constant. In many survey problems this is not true. This can be partially alleviated through
the use of weighted regression, but this still does not completely x the problem. For further informa-
tion about the problems of using standard statistical software packages in survey sampling please
refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_
brogan.html.
Using ordinary regression
Because the ratio estimator assumes that the variance of the response increases with the value of X, a
new column representing the inverse of the X variable (i.e. 1/the number of moose) has been created.
We start by plotting the data to assess if the relationship is linear and through the origin. The Y variable
is the number of wolves; the X variable is the number of moose. If the relationship is not through the origin,
then a more complex analysis called Regression estimation is required.
The graph looks like it is linear through the origin which is one of the assumptions of the ratio estimator.
Now we wish to t a straight line THROUGH THE ORIGIN. By default, most computer packages
include the intercept which we want to force to zero. We must also specify that the inverse of the X variable
(1/X) is the weighting variable.
We see that the estimated ratio (.032 wolves/moose) matches the Excel output, the estimated standard error
(.0026) does not quite match Excel. The difference is a bit larger than can be accounted for not using the
nite population correction factor.
As a matter of interest, if you repeat the analysis WITHOUT using the inverse of the X variable as the
weighting variable, you obtain an estimated ratio of .0317 (se .0022). All of these estimates are similar and
it likely makes very little difference which is used.
Finding the required sample size is trickier because of the weighted regression approach used by the
packages, the slightly different way the se is computed, and the lack of a fpc. The latter two issues are
CHAPTER 3. SAMPLING
usually not important in determining the approximate sample size, but the rst issue is crucial.
Start by REFITTING Y vs. X WITHOUT using the weighting variable. This will give you roughly
the same estimate and se, but now it is much easier to extract the necessary information for sample size
determination.
When the UNWEIGHTED model is t, you will see that Root Mean Square Error has the value of 3.28.
This is the value of s
diff
that is needed. The approximate se for r (ignoring the fpc) is
se(r)
s
diff
n
=
3.28
395.64
n
Again different value of n can be tried to get the appropriate rse. This gives an n of about 25 or 26 which is
sufcient for planning purposes
Post mortem
No population numbers can be estimated using the ratio estimator in this case because of a lack of suitable
data.
In particular, if you had wanted to estimate the total wolf population, you would have to use the simple
ination estimator that we discussed earlier unless you had some way of obtaining the total number of moose
that are present in the ENTIRE management zone. This seems unlikely.
However, refer to the next example, where the appropriate information is available.
3.8.3 Example - Grouse numbers - using a ratio estimator to estimate a population
total
In some cases, a ratio estimator is used to estimate a population total. In these cases, the improvement in
precision is caused by the close relationship between two variables.
Note that the population total of the auxiliary variable will have to be known in order to use this method.
Grouse Numbers
A wildlife biologist has estimated the grouse population in a region containing isolated areas (called
pockets) of bush as follows: She selected 12 pockets of bush at random, and attempted to count the numbers
of grouse in each of these. (One can assume that the grouse are almost all found in the bush, and for the
purpose of this question, that the counts were perfectly accurate.) The total number of pockets of bush in the
region is 248, comprising a total area of 3015 hectares. Results are as follows:
CHAPTER 3. SAMPLING
Area Number
(ha) Grouse
8.9 24
2.7 3
6.6 10
20.6 36
3.7 8
4.1 8
25.8 60
1.8 5
20.1 35
14.0 34
10.1 18
8.0 22
What is the population of interest and parameter to be estimated?
As before, the is some ambiguity:
The population of interest are the pockets of brush in the region. The sampling unit is the pocket of
brush. The number of grouse in each pocket is the response variable.
The population of interest is the grouse. These happen to be clustered into pockets of brush. This
leads back to the previous case.
What is the frame
Here the frame is explicit - the set of all pockets of bush. It isnt clear if all grouse will be found in these
pockets - will some be itinerant and hence missed? What about movement between looking at the pockets
of bush?
Summary statistics
Variable n mean std dev
area 12 10.53 7.91
grouse 12 21.92 16.95
Simple ination estimator ignoring the pocket areas
If we wish to adjust for the sampling fraction, we can use our earlier results for the simple ination estimator,
CHAPTER 3. SAMPLING
our estimate of the total number of grouse is = Ny = 248 21.92 = 5435.33 with an estimated se of
se = N
_
s
2
n
(1 f) = 248
_
16.95
2
12
(1
12
248
) = 1183.4.
The estimate isnt very precise with a rse of 1183.4/5435.3 = 22%.
Ratio estimator - why?
Why did the ination estimator do so poorly? Part of the reason is the relatively large standard deviation in
the number of grouse in the pockets. Why does this number vary so much?
It seems reasonable that larger pockets of brush will tend to have more grouse. Perhaps we can do better
by using the relationship between the area of the bush and the number of grouse through a ratio estimator.
Excel analysis
An Excel worksheet is available in the grouse tab in the AllofData workbook from the Sample Program
Preliminary plot to assess if ratio estimator will work
First plot numbers of grouse vs. area and see if this has a chance of succeeding.
CHAPTER 3. SAMPLING
The graph shows a linear relationship, through the origin. There is some evidence that the variance is
increasing with X (area of the plot).
Find the ratio between grouse numbers and area
The spreadsheet is set up similarly to the previous example:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
The total of the X variable (area) will need to be known.
As before, you nd summary statistics for X and Y , compute the ratio estimate, nd the difference
variables, nd the standard deviation of the difference variable, and nd the se of the estimated ratio.
The estimated ratio is: r = y/x = 21.82/10.53 = 2.081 grouse/ha.
The se of r is found as
se(r) =
1
x
2

s
2
di
n
(1 f) =
_
1
10.533
2

4.7464
2
12
(1
12
248
) = 0.1269
grouse/ha.
Expand ratio by total of X
In order to estimate the population total of Y , you now multiply the estimated ratio by the population
total of X. We know the pockets cover 3015 ha, and so the estimated total number of grouse is found by

Y
=
X
r = 3015 2.081 = 6273.3 grouse.
To estimate the se of the total, multiply the se of r by 3015 as well: se(
Y
) =
X
se(r) = 3015
0.1269 = 382.6 grouse.
The precision is much improved compared to the simple ination estimator. This improvement is due to
the very strong relationship between the number of grouse and the area of the pockets.
Sample size for future surveys
If you wish to investigate different sample sizes, the simplest way would be to modify the cell corre-
sponding to the count of the differences. This will be left as an exercise for the reader.
The nal ratio estimate has a rse of about 6% - quite good. It is relatively straight forward to investigate
the sample size needed for a 5% rse. We nd this to be about 17 pockets.
SAS analysis
The analysis is done in SAS using the program grouse.sas from the Sample Program Library http://
www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The SAS program starts in the usual fashion:
data grouse;
input area grouse; /
*
sampling weights not needed
*
/
CHAPTER 3. SAMPLING
The Data step reads in the data. It is not necessary to include a computation of the sampling weight if
the data are collected in a simple random sample for a ratio estimator the weights will cancel out in the
formulae used by SAS.
Proc Gplot creates the standard plot of numbers of grouse vs. the area of each grove.
proc gplot data=grouse;
title2 plot to assess assumptions;
plot grouse
*
area;;
Proc SurveyMeans procedure can estimate the ratio of grouse/ha but cannot directly estimate the popu-
lation total.
proc surveymeans data=grouse ratio clm N=248;
/
*
the ratio clm keywords request a ratio estimator and a confidence interval.
*
/
CHAPTER 3. SAMPLING
title2 Estimation using a ratio estimator;
var grouse area;
ratio grouse / area;
ods output ratio=outratio; /
*
extract information so that total can be estimated
*
/
run;
The ODS statement redirects the results from the RATIO statement to a new dataset that is processed further
to multiply by the total area of the pockets.
data outratio;
/
*
compute estimates of the total
*
/
set outratio;
Est_total = ratio
*
3015;
Se_total = stderr
*
3015;
UCL_total = uppercl
*
3015;
LCL_total = lowercl
*
3015;
format est_total se_total ucl_total lcl_total 7.1;
format ratio stderr lowercl uppercl 7.3;
run;
The output is as follows:
Data Summary
Statistics
Variable Mean Std Error of Mean 95% CL for Mean
grouse 21.916667 4.772130 11.4132790 32.4200543
area 10.533333 2.227746 5.6300968 15.4365699
Ratio Analysis
Numerator Denominator Ratio Std Err 95% CL for Ratio
grouse area 2.080696 0.126893 1.80140636 2.35998605
Obs Ratio StdErr LowerCL UpperCL Est total Se total LCL total UCL total
1 2.081 0.127 1.801 2.360 6273.3 382.6 5431.2 7115.4
CHAPTER 3. SAMPLING
The results are exactly the same as before.
Again, it is easiest to do the sample size computations in Excel.
We must rst estimate the ratio (grouse/hectare), and then expand this to estimate the overall number of
grouse.
CAUTION. Ordinary regression estimation from standard statistical packages provide only an AP-
PROXIMATION to the correct analysis of survey data. There are two problems in using standard statistical
packages for regression and ratio estimation of survey data:
Assumes that a simple random sample was taken. If the sampling design is not a simple random
sample, then regular regression cannot be used.
Unable to use a nite population correction factor. This is usually not a problem unless the sample
size is large relative to the population size.
Wrong error structure. Standard regression analyses assume that the variance around the regression or
ratio line is constant. In many survey problems this is not true. This can be partially alleviated through
the use of weighted regression, but this still does not completely x the problem. For further informa-
tion about the problems of using standard statistical software packages in survey sampling please
refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_
brogan.html.
Because the ratio estimator assumes that the variance of the response increases with the value of X, a
new column representing the inverse of the X variable (i.e. 1/area of pocket) has been created.
The graph looks like it is linear through the origin which is one of the assumptions of the ratio estimator.
The estimated density is 2.081 (se .123) grouse/hectare. The point estimate is bang on, and the estimated se
is within 1% of the correct se.
This now need to be multiplied by the total area of the pockets (3015 ha) which gives an estimated total
number of grouse of 6274 (se 371) grouse. [Again the estimated se is slightly smaller because of the lack of
a nite population correction.]
The ratio estimator is much more precise than the ination estimator because of the strong relationship
between the number of grouse and the area of the pocket.
Post mortem - a question to ponder
What if it were to turn out that grouse population size tended to be proportional to the perimeter of a pocket of
bush rather than its area? Would using the above ratio estimator based on a relationship with area introduce
serious bias into the ratio estimate, increase the standard error of the ratio estimate, or do both?
CHAPTER 3. SAMPLING
3.9 Additional ways to improve precision
This section will not be examined on the exams or term tests
3.9.1 Using both stratication and auxiliary variables
It is possible to use both methods to improve precision. However, this comes at a cost of increased compu-
tational complexity.
There are two ways of combining ratio estimators in stratied simple random sampling.
1. combined ratio estimate: Estimate the numerator and denominator using stratied random sampling
and then form the ratio of these two estimates:
r
stratified,combined
=

Y stratified

Xstratified
and

Y stratified,combined
=

Y stratified

Xstratified
X
We wont consider the estimates of the se in this course, but it can be found in any textbook on
sampling.
2. separate ratio estimator- make a ratio total for each stratum, and form a grand ratio by taking a
weighted average of these estimates. Note that we weight by the covariate total rather than the stratum
sizes. We get the following estimators for the grand ratio and grand total:
r
stratified,separate
=
1
X
H
h=1
Xh
r
h
and

Y stratified,separate
=
H
h=1
Xh
r
h
Again, we wont worry about the estimates of the se.
Why use one over the other?
You need stratum total for separate estimate, but only population total for combined estimate
combined ratio is less subject to risk of bias. (see Cochran, p. 165 and following). In general, the
biases in separate estimator are added together and if they fall in the same direction, then trouble. In
the combined estimator these biases are reduced through stratication for numerator and denominator
CHAPTER 3. SAMPLING
When the ratio estimate is appropriate (regression through the origin and variance proportional to co-
variate), the last term vanishes. Consequently, the combined ratio estimator will have greater standard
error than the separate ratio estimator unless R is relatively constant from stratum to stratum. How-
ever, see above, the bias may be more severe for the separate ratio estimator. You must consider the
combined effects of bias and precision, i.e. MSE.
3.9.2 Regression Estimators
A ratio estimator works well when the relationship between Y
i
and X
i
is linear, through the origin, with
the variance of observations about the ratio line increasing with X. In some cases, the relationship may be
linear, but not through the origin.
In these cases, the ratio estimator is generalized to a regression estimator where the linear relationship is
no longer constrained to go through the origin.
We wont be covering this in this course.
Regression estimators are also useful if there is more than one X variable.
Whenever you use a regression estimator, be sure to plot y vs. x to assess if the assumptions for a ratio
estimator are reasonable.
CAUTION: If ordinary statistical packages are used to do regression analysis on survey data, you could
obtain misleading results because the usual packages ignore the way in which the data were collected.
Virtually all standard regression packages assume youve collected data under a simple random sample. If
your sampling design is more complex, e.g. stratied design, cluster design, multi-state design, etc, then you
should use a package specically designed for the analysis of survey data, e.g. SAS and the Proc SurveyReg
procedure.
3.9.3 Sampling with unequal probability - pps sampling
All of the designs discussed in previous sections have assumed that each sample unit was selected with equal
probability. In some cases, it is advantageous to select units with unequal probabilities, particularly if they
differ in their contribution to the overall total. This technique can be used with any of the sampling designs
discussed earlier. An unequal probability sampling design can lead to smaller standard errors (i.e. better
precision) for the same total effort compared to an equal probability design. For example, forest stands may
be selected with probability proportional to the area of the stand (i.e. a stand of 200 ha will be selected
with twice the probability that a stand of 100 ha in size) because large stands contribute more to the overall
population and it would be wasteful of sampling effort to spend much effort on smaller stands.
The variable used to assign the probabilities of selection to individual study units does not need to have
an exact relationship with an individual contributions to the total. For example, in probability proportional
CHAPTER 3. SAMPLING
to prediction (3P sampling), all trees in a small area are visited. A simple, cheap characteristic is measured
which is used to predict the value of the tree. A sub-sample of the trees is then selected with probability pro-
portional to the predicted value, remeasured using a more expensive measuring device, and the relationship
between the cheap and expensive measurement in the second phase is used with the simple measurement
from the rst phase to obtain a more precise estimate for the entire area. This is an example of two-phase
sampling with unequal probability of selection.
Please consult with a sampling expert before implementing or analyzing an unequal probability sampling
design.
3.10 Cluster sampling
In some cases, units in a population occur naturally in groups or clusters. For example, some animals
individual animals are not randomly selected; the herds are the sampling unit. The strip-transect example
point are measured. The sampling unit is the circular plot while trees within the plot are sub-samples.
Some examples of cluster samples are:
urchin estimation - transects are taken perpendicular to the shore and a diver swims along the transect
and counts the number of urchins in each m
2
along the line.
aerial surveys - a plane ies along a line and observers count the number of animals they see in a strip
on both sides of the aircraft.
forestry surveys - often circular plots are located on the ground and ALL tree within that plot are
measured.
analysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimated
Solution: You will pleased to know that, in fact, you already know how to design and analyze cluster
samples! The proper analysis treats the clusters as a random sample from the population of clusters, i.e. treat
the cluster as a whole as the sampling unit, and deal only with cluster total as the response measure.
CHAPTER 3. SAMPLING
3.10.1 Sampling plan
In simple random sampling, a frame of all elements was required in order to draw a random sample. Indi-
vidual units are selected one at a time. In many cases, this is impractical because it may not be possible to
list all of the individual units or may be logistically impossible to do this. In many cases, the individual units
appear together in clusters. This is particularly true if the sampling unit is a transect - almost always you
measure things on a individual quadrat level, but the actual sampling unit is the cluster.
This problem is analogous to pseudo-replication in experimental design - the breaking of the transect
into individual quadrats is like having multiple sh within the tank.
A visual comparison of a simple random sample vs. a cluster sample
You may nd it useful to compare a simple random sample of 24 vs. a cluster sample of 24 using the
following visual plans:
Select a sample of 24 in each case.
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Describe how the sample was taken.
CHAPTER 3. SAMPLING
Cluster Sampling
First, the clusters must be dened. In this case, the units are naturally clustered in blocks of size 8. The
following units were selected.
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Describe how the sample was taken. Note the differences between stratied simple random sampling
and cluster sampling!
3.10.2 Advantages and disadvantages of cluster sampling compared to SRS
Advantage It may not be feasible to construct a frame for every elemental unit, but possible to con-
struct frame for larger units, e.g. it is difcult to locate individual quadrats upon the sea oor, but easy
to lay out transects from the shore.
Advantage Cluster sampling is often more economical. Because all units within a cluster are close
together, travel costs are much reduced.
Disadvantage Cluster sampling has a higher standard error than an SRSWOR of the same total size
because units are typically homogeneous within clusters. The cluster itself serves as the sampling
unit. For the same number of units, cluster sampling almost always gives worse precision. This is the
problem that we have seen earlier of pseudo-replication.
Disadvantage A cluster sample is more difcult to analyze, but with modern computing equipment,
this is less of a concern. The difculties are not arithmetic but rather being forced to treat the clusters
as the survey unit - there is a natural tendency to think that data are being thrown away.
The perils of ignoring a cluster design The cluster design is frequently used in practice, but often
analyzed incorrectly. For example, when ever the quadrats have been gathered using a transect of some sort,
you have a cluster sampling design. The key thing to note is that the sampling unit is a cluster, not the
individual quadrats.
The biggest danger of ignoring the clustering aspects and treating the individual quadrats as if they came
from an SRS is that, typically, your reported se will be too small. That is, the true standard error from your
design may be substantially larger than your estimated standard error obtained from a SRS analysis. The
precision is (erroneously) thought to be far better than is justied based on the survey results. This has been
seen before - refer to the paper by Underwood where the dangers of estimation with positively correlated
data were discussed.
3.10.3 Notation
The key thing to remember is to work with the cluster TOTALS.
Traditionally, the cluster size is denoted by M rather than by X, but as you will see in a few moment,
estimation in cluster sampling is nothing more than ratio estimation performed on the cluster totals.
CHAPTER 3. SAMPLING
Population Sample
Attribute value value
Number of clusters N n
Cluster totals
i
y
i
NOTE
i
and y
i
are the cluster i TOTALS
Cluster sizes M
i
m
i
Total area M
The key concept in cluster sampling is to treat the cluster TOTAL as the response variable and ignore all the
individual values within the cluster. Because the clusters are a simple random sample from the population
of clusters, simply apply all the results you had before for a SRS to the CLUSTER TOTALS.
The analysis of a cluster design will require the size of each cluster - this is simply the number of sub-
units within each cluster.
If the clusters are roughly equal in size, a simple ination estimator can be used.
But, in many cases, there is strong relationship between the size of the cluster and cluster total in these
cases a ratio estimator would likely be more suitable (i.e. will give you a smaller standard error), where the
X variable is the cluster size. If there is no relationship between cluster size and the cluster total, a simple
ination estimator can be used as well even in the case of unequal cluster sizes.
You should do a preliminary plot of the cluster totals against the cluster sizes to see if this relationship
holds.
Extensions of cluster analysis - unequal size sampling In some cases, the clusters are of quite unequal
sizes. A better design choice may to be select clusters with an unequal probability design rather than using a
simple random sample. In this case, clusters that are larger, typically contribute more to the population total,
and would be selected with a higher
Computational formulae
Parameter Population value Estimator estimated se
Overall mean =
N
i=1
i
N
i=1
Mi
=
n
i=1
yi
n
i=1
mi
_
1
m
2
s
2
di
n
(1 f)
Overall total = M = M
_
M
2
1
m
2
s
2
di
n
(1 f)
You never use the mean per unit within a cluster.
CHAPTER 3. SAMPLING
The terms
2
di
=
n
i=1
(yi mi)
2
n1
is again found in the same fashion as in ratio estimation - create a new
variable which is the difference between y
i
m
i
, nd the sample standard deviation
2
of it, and then
square the standard deviation.
Sometimes the ratio of two variables measured within each cluster is required, e.g. you conduct aerial
surveys to estimate the ratio of wolves to moose - this has already been done in an earlier example! In
these cases, the actual cluster length is not used.
Condence intervals
As before, once you have an estimator for the mean and for the se, use the usual 2se rule. If the number
of clusters is small, then some text books advise using a t-distribution for the multiplier this is not covered
in this course.
Sample size determination
Again, this is no real problem - except that you will get a value for the number of CLUSTERS, not the
individual quadrats within the clusters.
3.10.5 Example - estimating the density of urchins
Red sea urchins are considered a delicacy and the shery is worth several millions of dollars to British
Columbia.
In order to set harvest quotas and in order to monitor the stock, it is important that the density of sea
urchins be determined each year.
To do this, the managers lay out a number of transects perpendicular to the shore in the urchin beds.
Divers then swim along the transect, and roll a 1 m
2
quadrat along the transect line and count the number of
legal sized and sub-legal sized urchins in the quadrat.
The number of possible transects is so large that the correction for nite population sampling can be
ignored.
The dataset contains variables for the transect, the quadrat within each transect, and the number of legal
and sub-legal sized urchins counted in that quadrat.
What is the population of interest and the parameter?
The population of interest is the sea urchins in the harvest area. These happened to be (articially)
clustered into transects which are sampled. All sea urchins within the cluster are measured.
The parameter of interest is the density of legal sized urchins.
CHAPTER 3. SAMPLING
What is the frame?
The frame is conceptual - there is no predened list of all the possible transects. Rather they pick random
points along the shore and then lay the transects out from that point.
The sampling design is a cluster sample - the clusters are the transect lines while the quadrats measured
within each cluster are similar to pseudo-replicates. The measurements within a transect are not independent
of each other and are likely positively correlated (why?).
As the points along the shore were chosen using a simple random sample the analysis proceeds as a SRS
design on the cluster totals.
Excel Analysis
An Excel worksheet with the data and analysis is called urchin and is available in the AllofData work-
book from then Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms. A reduced view appears below:
CHAPTER 3. SAMPLING
CHAPTER 3. SAMPLING
Summarize to cluster level
The key, rst step in any analysis of a cluster survey is to rst summarize the data to the cluster level. You
will need the cluster total and the cluster size (in this case the length of the transect). The Pivot Table feature
of Excel is quite useful for doing this automatically. Unfortunately, you still have to play around with the
nal table in order to get the data displayed in a nice format.
In many transect studies, there is a tendency to NOT record quadrats with 0 counts as they dont affect
the cluster sum. However, you still have to know the correct size of the cluster (i.e. how many quadrats), so
you cant simply ignore these missing values. In this case, you could examine the maximum of the quadrat
number and the number of listed quadrats to see if these agree (why?).
Preliminary plot
Plot the cluster totals vs. the cluster size to see if a ratio estimator is appropriate, i.e. linear relationship
through the origin with variance increasing with cluster size.
The plot (not shown) shows a weak relationship between the two variables.
Summary Statistics
Compute the summary statistics on the cluster TOTALS. You will need the totals over all sampled clusters
of both variables.
sum(legal) sum(quad) n(transect)
1507 1120 28
Compute the ratio
The estimated density is then

density =
sum(legal)
sum(quad)
= 1507/1120 = 1.345536 urchins/m
2
.
Compute the difference column
To compute the se, create the diff column as in the ratio estimation section and nd its standard deviation.
Compute the se of the ratio estimate
The estimated se is then found as: se(

density) =
_
s
2
di
ntransects
1
quad
2
=
_
48.09933
2
28

1
40
2
= 0.2272
urchins/m
2
.
(Optional) Expand nal answer to a population total
In order to estimate the total number of urchins in the harvesting area, you simply multiply the estimated
ratio and its standard error by the area to be harvested.
CHAPTER 3. SAMPLING
SAS Analysis
SAS v.8 has procedures for the analysis of survey data taken in a cluster design. A program to analyze the
data is urchin.sas and is available from the Sample Program Library at http://www.stat.sfu.ca/
~cschwarz/Stat-650/Notes/MyPrograms.
The SAS program starts by reading in the data at the individual transect level:
data urchin;
infile urchin firstobs=2 missover; /
*
the first record has the variable names
*
/
input transect quadrat legal sublegal;
/
*
no need to specify sampling weights because transects are an SRS
*
/
run;
The total on the urchins and length of urchins are computed using Proc Means:
proc sort data=urchin; by transect;
proc means data=urchin noprint;
by transect;
var quadrat legal;
output out=check min=min max=max n=n sum(legal)=tlegal;
run;
and then plotted:
proc gplot data=check;
title2 plot the relationship between the cluster total and cluster size;
plot tlegal
*
n=1; /
*
use the transect number as the plotting character
*
/
symbol1 v=plus pointlabel=("#transect");
run;
CHAPTER 3. SAMPLING
Because we are computing a ratio estimator from a simple random sample of transects, it is not necessary
to specify the sampling weights.
The key feature of the SAS program is the use of the CLUSTER statement to identify the clusters in the
data.
proc surveymeans data=urchin; /
*
do not specify a pop size as fpc is negligble
*
/
cluster transect;
var legal;
run;
The population number of transects was not specied as the nite population correction is negligible.
Here are the results:
CHAPTER 3. SAMPLING
Data Summary
Number of Clusters 28
Statistics
Variable N Mean Std Error of Mean 95% CL for Mean
legal 1120 1.345536 0.227248 0.87926137 1.81181006
The results are identical to above.
The rst step in the analysis when using standard computer packages is to summarize up to the cluster
level. You need to compute the total for each cluster and the size of each cluster.
Note that there was no transect numbered 5, 12, 17, 19, or 32. Why are these transects missing? According to
the records of the survey, inclement weather caused cancellation of the missing transects. It seems reasonable
to treat the missing transects as missing completely at random (MCAR). In this case, there is no problem in
simply ignoring the missing data all that happens is that the precision is reduced compared to the design
with all data present.
We compare the maximum(quadrat) number to the number of quadrat values actually recorded and see
that they all match indicating that it appears no empty quadrats were not recorded.
Now we are back to the case of a ratio estimator with the Y variable being the number of legal sized
urchins measured on the transect, and the X variable being the size of the transect. As in the previous
examples of a ratio estimator, we create a weighting variable equal to 1/X = 1/size of transect:
The estimated density is 1.346 (se .216) uchins/m
2
. The se is a bit smaller because of the lack of a nite
population correction factor but is within 1% of the correct se.
Planning for future experiments
The rse of the estimate is 0.2274/1.3455 = 17% - not terric. The determination of sample size is done
in the same manner as in the ratio estimator case dealt with in earlier sections except that the number of
CLUSTERS is found. If we wanted to get a rse near to 5%, we would need almost 320 transects - this is
likely too costly.
3.10.6 Example - estimating the total number of sea cucumbers
Sea cucumbers are considered a delicacy among some, and the shery is of growing importance.
CHAPTER 3. SAMPLING
In order to set harvest quotas and in order to monitor the stock, it is important that the number of sea
cucumbers in a certain harvest area be estimated each year.
The following is an example taken from Grifth Passage in BC 1994.
To do this, the managers lay out a number of transects across the cucumber harvest area. Divers then
swim along the transect, and while carrying a 4 m wide pole, count the number of cucumbers within the
width of the pole during the swim.
The number of possible transects is so large that the correction for nite population sampling can be
ignored.
Here is the summary information up the transect area (the preliminary raw data is unavailable):
Transect Sea
Area Cucumbers
260 124
220 67
200 6
180 62
120 35
200 3
200 1
120 49
140 28
400 1
120 89
120 116
140 76
800 10
1460 50
1000 122
140 34
180 109
80 48
The total harvest area is 3,769,280 m
2
as estimated by a GIS system.
The transects were laid out from one edge of the bed and the length of the edge is 51,436 m. Note that
because each transect was 4 m wide, the number of transects is 1/4 of this value.
CHAPTER 3. SAMPLING
What is the population of interest and the parameter?
The population of interest is the sea cucumbers in the harvest area. These happen to be (articially) clus-
tered into transects which are the sampling unit. All sea cucumbers within the transect (cluster) are mea-
sured.
The parameter of interest is the total number of cucumbers in the harvest area.
What is the frame?
The frame is conceptual - there is no predened list of all the possible transects. Rather they pick random
points along the edge of the harvest area, and then lay out the transect from there.
The sampling design is a cluster sample - the clusters are the transect lines while the quadrats measured
within each cluster are similar to pseudo-replicates. The measurements within a transect are not independent
of each other and are likely positively correlated (why?).
Analysis - abbreviated
As the analysis is similar to the previous example, a detailed description of the Excel, SAS, R, or JMP
versions will not be done.
The worksheetl cucumber is available in the Allofdata workbook from the Sample Program Library
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms illustrates the com-
putations in Excel. There are three different surveys illstrated. It also computes the two estimators when two
potential outliers are deleted and for a second harvest area.
Summarize to cluster level
The key, rst step in any analysis of a cluster survey is to rst summarize the data to the cluster level. You
will need the cluster total and the cluster size (in this case the area of the transect). This has already been
done in the above data.
Nowthis summary table is simply an SRSWORfromthe set of all transects. We rst estimate the density,
and then multiply by the area to estimate the total.
Note that after summarizing up to the transect level, this example proceeds in an analogous fashion as
the grouse in pockets of brush example that we looked at earlier.
Preliminary Plot
A plot of the cucumber total vs. the transect size shows a very poor relationship between the two vari-
ables. It will be interesting to compare the results from the simple ination estimator and the ratio estimator.
CHAPTER 3. SAMPLING
Simple Ination Estimator
First, estimate the number ignoring the area of the transects by using a simple ination estimator.
The summary statistics that we need are:
n 19 transects
Mean 54.21 cucumbers/transect
std Dev 42.37 cucumbers/transect
We compute an estimate of the total as = Ny = (51, 436/4) 54.21 = 697, 093 sea cucumbers.
[Why did we use 51,436/4 rather than 51,436?]
We compute an estimate of the se of the total as: se( ) =
_
N
2
s
2
/n (1 f) =
_
(51, 436/4)
2
42.37
2
/19 =
124, 981 sea cucumbers.
The nite population correction factor is so small we simply ignore it.
This gives a relative standard error (se/est) of 18%.
Ratio Estimator
We use the methods outlined earlier for ratio estimators from SRSWOR to get the following summary table:
area cucumbers
Mean 320.00 54.21 per transect
The estimated density of sea cucumbers is then

density =
mean(cucumbers)
mean(area)
= 54.21/320.00 = 0.169
cucumber/m
2
.
To compute the se, create the diff column as in the ratio estimation section and nd its standard deviation
as s
di
= 73.63. The estimated se of the ratio is then found as: se(

density) =
_
s
2
di
ntransects
1
area
2
=
_
73.63
2
19

1
320
2
= 0.053 cucumbers/m
2
.
We once again ignore the nite population correction factor.
In order to estimate the total number of cucumbers in the harvesting area, you simply multiply the above
by the area to be harvested:

ratio
= area

density = 3, 769, 280 0.169= 638,546 sea cucumbers.
The se is found as: se(
ratio
) = area se(

density) = 3, 769, 280 0.053 = 198,983 sea cucumbers
for an overall rse of 31%.
CHAPTER 3. SAMPLING
SAS Analysis
The SAS program is available in cucumber.sas and the relevant output in cucumber.lst in the Sample Program
However, because only the summary data is available, you cannot use the CLUSTER statement of Proc
SurveyMeans. Rather, as noted earlier in the notes, you form a ratio estimator based on the cluster totals.
The data are read in the Data step already sumarized to the cluster level:
data cucumber;
input area cucumbers;
transect = _n_; /
*
number the transects
*
/
Do a plot to see the relationship between transect area and numbers of cucumbers:
proc gplot data=cucumber;
title2 plot the relationship between the cluster total and cluster size;
plot cucumbers
*
area=1; /
*
use the transect number as the plotting character
*
/
symbol1 v=plus pointlabel=("#transect");
run;
CHAPTER 3. SAMPLING
Because the relationship between the number of cucmbers and transect area is not very strong, A simple
ination estimator will be tried rst. The sample weights must be computed. This is equal to the total area
of the cucumber bed divided by the number of transects taken:
/
*
First compute the sampling weight and add to the dataset
*
/
/
*
The sampling weight is simply the total pop size / # sampling units in an SRS
*
/
/
*
In this example, transects were an SRS from all possible transects
*
/
proc means data=cucumber n mean std ;
var cucumbers;
/
*
get the total number of transects
*
/
output out=weight n=samplesize;
run;
data cucumber;
merge cucumber weight;
retain samplingweight;
CHAPTER 3. SAMPLING
if samplesize > . then samplingweight = 51436/4 / samplesize;
run;
And then the simple ination estimator is used via Proc SurveyMeans:
proc surveymeans data=cucumber mean clm sum clsum cv ;
/
*
N not specified as we ignore the fpc in this problem
*
/
/
*
mean clm - find estimate of mean and confidence intervals
*
/
/
*
sum clsum - find estimate of grand total and confidence intervals
*
/
title2 Simple inflation estimator using cluster totals;
var cucumbers;
weight samplingweight;
run;
Data Summary
Sum of Weights 12859
Statistics
Variable Mean Std Error of Mean 95% CL for Mean Coeff of Variation Sum Std Dev 95% CL for Sum
cucumbers 54.210526 9.719330 33.7909719 74.6300807 0.179289 697093 124981 434518.107 959668.208
Now for the ratio estimator. First use Proc SurveyMeans to compute the density, and then inated the
density by the total area of cucumber area;
proc surveymeans data=cucumber ratio clm ;
/
*
the ratio clm keywords request a ratio estimator and a confidence interval.
*
/
title2 Estimation using a ratio estimator;
var cucumbers area;
ratio cucumbers / area;
ods output ratio=outratio; /
*
extract information so that total can be estimated
*
/
run;
data outratio;
/
*
compute estimates of the total
*
/
set outratio;
cv = stderr / ratio; /
*
the relative standard error of the estimate
*
/
Est_total = ratio
*
3769280;
CHAPTER 3. SAMPLING
Se_total = stderr
*
3769280;
UCL_total = uppercl
*
3769280;
LCL_total = lowercl
*
3769280;
format est_total se_total ucl_total lcl_total 7.1;
format cv 7.2;
format ratio stderr lowercl uppercl 7.3;
run;
This gives the nal results:
Obs Ratio StdErr cv LowerCL UpperCL
1 0.169 0.053 0.31 0.058 0.280
Obs Est total Se total LCL total UCL total
1 638546 198983 220498 1056593
Comparing the two approaches
Why did the ratio estimator do worse in this case than the simple ination estimator in Grifths Passage?
The plot the number of sea cucumbers vs. the area of the transect shows virtually no relationship between
the two - hence there is no advantage to using a ratio estimator.
In more advanced courses, it can be shown that the ratio estimator will do better than the ination
estimator if the correlation between the two variables is greater than 1/2 of the ratio of their respective relative
variation (std dev/mean). Advanced computations shows that half of the ratio of their relative variations is
0.732, while the correlation between the two variables is 0.041. Hence the ratio estimator will not do well.
The Excel worksheet also repeats the analysis for Grifth Passage after dropping some obvious outliers.
This only makes things worse! As well, at the bottom of the worksheet, a sample size computation shows
that substantially more transects are needed using a ratio estimator than for a ination estimator. It appears
that in Grifth Passage, that there is a negative correlation between the length of the transect and the number
of cucumbers found! No biological reason for this has been found. This is a cautionary example to illustrate
the even the best laid plans can go astray - always plot the data.
A third worksheet in the workbook analyses the data for Sheep Passage. Here the ratio estimator outper-
forms the ination estimator, but not by a wide factor.
CHAPTER 3. SAMPLING
3.11 Multi-stage sampling - a generalization of cluster sampling
Not part of Stat403/650 Please consult with a sampling expert before implementing or analyzing a multi-
stage design.
3.11.1 Introduction
All of the designs considered above select a sampling unit from the population and then do a complete
measurement upon that item. In the case of cluster sampling, this is facilitated by dividing the sampling unit
into small observational units, but all of the observational units within the sampled cluster are measured.
If the units within a cluster are fairly homogeneous, then it seems wasteful to measure every unit. In the
extreme case, if every observational unit within a cluster was identical, only a single observational unit from
the cluster needs to be selected in order to estimate (without any error) the cluster total. Suppose then that the
observational units within a cluster were not identical, but had some variation? Why not take a sub-sample
from each cluster, e.g. in the urchin survey, count the urchins in every second or third quadrat rather than
every quadrat on the transect.
This method is called two-stage sampling. In the rst stage, larger sampling units are selected using
some probability design. In the second stage, smaller units within the selected rst-stage units are selected
according to a probability design. The design used at each stage can be different, e.g. rst stage units selected
using a simple random sample, but second stage units selected using a systematic design as proposed for the
urchin survey above.
This sampling design can be generalized to multi-stage sampling.
Some example of multi-stage designs are:
Vegetation Resource Inventory. The forest land mass of BC has been mapped using aerial methods
and divided into a series of polygons representing homogeneous stands of trees (e.g. a stand domi-
nated by Douglas-r). In order to estimate timber volumes in an inventory unit, a sample of polygons
is selected using a probability-proportional-to-size design. In the selected polygons, ground measure-
ment stations are selected on a 100 mgrid and crews measure standing timber at these selected ground
stations.
Urchin survey Transects are selected using a simple random sample design. Every second or third
quadrat is measured after a random starting point.
Clam surveys Beaches are divided into 1 ha sections. A random sample of sections is selected and a
series of 1 m
2
quadrats are measured within each section.
Herring spawns biomass Schweigert et al. (1985, CJFAS, 42, 1806-1814) used a two-stage design
to estimate herring spawn in the Strait of Georgia.
CHAPTER 3. SAMPLING
Georgia Strait Creel Survey The Georgia Strait Creel Survey uses a multi-stage design to select land-
ing sites within strata, times of days to interview at these selected sites, and which boats to interview
in a survey of angling effort on the Georgia Strait.
Some consequences of simple two-stage designs are:
If the selected rst-stage units are completely enumerated then complete cluster sampling results.
If every rst-stage unit in the population is selected, then a stratied design results.
A complete frame is required for all rst-stage units. However, a frame of second-stage and lower-
stage units need only be constructed for the selected upper-stage units.
The design is very exible allowing (in theory) different selection methods to be used at each stage,
and even different selection methods within each rst stage unit.
A separate randomization is done within each rst-stage unit when selecting the second-stage units.
Multi-stage designs are less precise than a simple randomsample of the same number of nal sampling
units, but more precise than a cluster sample of the same number of nal sampling units. [Hint: think
of what happens if the second-stage units are very similar.]
Multi-stage designs are cheaper than a simple random sample of the same number of nal sampling
units, but more expensive than a cluster sample of the same number of nal sampling units. [Hint:
think of the travel costs in selecting more transects or measuring quadrats within a transect.]
As in all sampling designs, stratication can be employed at any level and ratio and regression esti-
mators are available. As expected, the theory becomes more and more complex, the more "variations"
are added to the design.
The primary incentives for multi-stage designs are that
1. frames of the nal sampling units are typically not available
2. it often turns out that most of the variability in the population occurs among rst-stage units. Why
spend time and effort in measuring lower stage units that are relatively homogeneous within the rst-
stage unit
3.11.2 Notation
A sample of n rst-stage units (FSU) is selected from a total of N rst-stage units. Within the i
th
rst-stage
unit, m
i
second-stage units (SSU) are selected from the M
i
units available.
CHAPTER 3. SAMPLING
Item Population Sample
Value Value
First stage units N n
Second stage units M
i
m
i
SSUs in population M =
M
i
Value of SSU Y
ij
y
ij
Total of FSU
i

i
= M
i
/m
i
mi
j=1
y
ij
Total in pop =
i
Mean in pop = /M
We will only consider the case when simple random sampling occurs at both stages of the design.
The intuitive explanation for the results is that a total is estimated for each FSU selected (based on the
SSU selected). These estimated totals are then used in a similar fashion to a cluster sample to estimate the
grand total.
Parameter Population Estimated
value Estimate se
Total =
i
N
n
n
i=1

i
se ( ) =
N
2
(1 f
1
)
s
2
1
n
+
N
2
f1
n
2
n
i=1
M
2
i
(1 f
2
)
s
2
2i
mi
Mean =

M
=

M
se ( ) =
_
se
2
( )
M
2
where
s
2
1
=
n
i=1
_

i

_
2
n 1
s
2
2i
=
mi
j=1
(y
ij
y
i
)
2
m
i
1
CHAPTER 3. SAMPLING
=
1
n
n
i=1

i
f
1
= n/N and f
2i
= m
i
/M
i
Notes:
There are two contributions to the estimated se - variation among rst stage totals (s
2
1
) and variation
among second stage units (S
2
2i
).
If the FSU vary considerably in size, a ratio estimator (not discussed in these notes) may be more
appropriate.
Condence Intervals The usual large sample condence intervals can be used.
3.11.4 Example - estimating number of clams
A First Nations wished to develop a wild oyster shery. As rst stage in the development of the shery, a
survey was needed to establish the current stock in a number of oyster beds.
This example looks at the estimate of oyster numbers from a survey conducted in 1994.
The survey was conducted by running a line through the oyster bed the total length was 105 m. Several
random location were located along the line. At each randomly chosen location, the width of the bed was
measured and about 3 random location along the perpendicular transect at that point were taken. A 1 m
2
quadrat was applied, and the number of oysters of various sizes was counted in the quadrat.
CHAPTER 3. SAMPLING
tran- width quad- total Net
Location sect width rat seed xsmall small med large count weight
(m) (m) (kg)
Lloyd 5 17 3 18 18 41 48 14 139 14.6
Lloyd 5 17 5 6 4 30 9 4 53 5.2
Lloyd 5 17 10 15 21 44 13 11 104 8.2
Lloyd 7 18 5 8 10 14 5 3 40 6.0
Lloyd 7 18 12 10 38 36 16 4 104 10.2
Lloyd 7 18 13 0 15 12 3 3 33 4.6
Lloyd 18 14 1 11 8 5 9 19 52 7.8
Lloyd 18 14 5 13 23 68 18 11 133 12.6
Lloyd 18 14 8 1 29 60 2 1 93 10.2
Lloyd 30 11 3 17 1 13 13 2 46 5.4
Lloyd 30 11 8 12 16 23 22 14 87 6.6
Lloyd 30 11 10 23 15 19 17 1 75 7.0
Lloyd 49 9 3 10 27 15 1 0 53 2.0
Lloyd 49 9 5 13 7 14 11 4 49 6.8
Lloyd 49 9 8 10 25 17 16 11 79 6.0
Lloyd 76 21 4 3 3 11 7 0 24 4.0
Lloyd 76 21 7 15 4 32 26 24 101 12.4
Lloyd 76 21 11 2 19 14 19 0 54 5.8
Lloyd 79 18 1 14 13 7 9 0 43 3.6
Lloyd 79 18 4 0 32 32 27 16 107 12.8
Lloyd 79 18 11 16 22 43 18 8 107 10.6
Lloyd 84 19 1 14 32 25 39 7 117 10.2
Lloyd 84 19 8 25 43 42 17 3 130 7.2
Lloyd 84 19 15 5 22 61 30 13 131 14.2
Lloyd 86 17 8 1 19 32 10 8 70 8.6
Lloyd 86 17 11 8 17 13 10 3 51 4.8
Lloyd 86 17 12 7 22 55 11 4 99 9.8
Lloyd 95 20 1 17 12 20 18 4 71 5.0
Lloyd 95 20 8 32 4 26 29 12 103 11.6
Lloyd 95 20 15 3 34 17 11 1 66 6.0
These multi-stage designs are complex to analyze. Rather than trying to implement the various formulae,
I would suggest that a proper sampling package be used (such as SAS, or R) rather than trying to do these by
CHAPTER 3. SAMPLING
hand.
If using simple packages, the rst step is to move everything up to the primary sampling unit level. We
need to estimate the total at the primary sampling unit, and to compute some components of the variance
from the second stage of sampling.
Now you will need to add some columns to estimate the total for each FSU and the contribution of the
second stage sampling to the overall variance. These columns will be created using the formula boxes as
shown below.
First the formula for the FSU total, i.e. the estimated total weight for the entire transect. This is the
simply the average weight per quadrat times the width of the strip.
Second, we compute the component of variance for the second stage. [Typically, if the rst stage sam-
pling fraction is small, this can be ignored.]
Now to summarize up to the population level. We compute the mean transect total and expand by the
number of transects:
The variance component from the rst stage of sampling is found as:
And the nal overall se is found by combining the rst stage variance component and the second stage
variance component from each transect
This gives us the nal solution:
Our nal estimate is a total biomass of 14,070 kg with an estimated se of 1484 kg.
A similar procedure can be used for the other variables.
Excel Spreadsheet
The above computations can also be done in Excel as shown in the wildoyster worksheet in the ALLof-
Data.xls workbook from the Sample Program Library.
As in the case of a pure cluster sample, the PivotTable feature can be used to compute summary statistics
needed to estimate the various components.
SAS Program
SAS can also be used to analyze the data as shown in the program wildoyster.sas and output wildoyster.lst in
the Sample Program Library.
CHAPTER 3. SAMPLING
The data are read in the usual way:
data oyster;
infile datalines firstobs=3;
input loc $ transect width quad small xsamll small med large total weight;
sampweight = 105/10
*
width/3; /
*
sampling weight = product of sampling fractions
*
/
The sample weight is computed as the product of the sampling fraction at the rst stage and the second stage.
Proc SurveyMeans is used directly with the two-stage design. The cluster statement identies the rst
stage of the sampling.
/
*
estimate the total biomass on the oyster bed
*
/
/
*
Note that SurveyMeans only use a first stage variance in its
computation of the standard error. As the first stage sampling
fraction is usually quite small, this will tend to give only
slight underestimates of the true standard error of the estimate
*
/
proc surveymeans data=oyster
total=105 /
*
length of first reference line
*
/
sum ; /
*
interested in total biomass estimate
*
/
cluster transect; /
*
identify the perpindicular transects
*
/
var weight;
weight sampweight;
run;
Note that the Proc SurveyMeans computes the se using only the rst stage standard errors. As the rst
stage sampling fraction is usually quite small, this will tend to give only slight underestimates of the true
standard error of the estimate.
The nal results are:
Data Summary
Number of Clusters 10
Sum of Weights 1722
CHAPTER 3. SAMPLING
Statistics
Variable Sum Std Dev
weight 14070 1444.919931
3.11.5 Some closing comments on multi-stage designs
The above example barely scratches the surface of multi-stage designs. Multi-stage designs can be quite
complex and the formulae for the estimates and estimated standard errors fearsome. If you have to analyze
such a design, it is likely better to invest some time in learning one of the statistical packages designed for
surveys (e.g. SAS v.8) rather than trying to program the tedious formulae by hand.
There are also several important design decisions for multi-stage designs.
Two-stage designs have reduced costs of data collection because units within the FSU are easier to
collect but also have a poorer precision compared to a simple-random sample with the same number
nal sampling units. However, because of the reduced cost, it often turns out the more units can be
sampled under a multi-stage design leading to an improved precision for the same cost as a simple-
random sample design. There is a tradeoff between sampling more rst stage units and taking a small
sub-sample in the secondary stage. An optimal allocation strategy can be constructed to decide upon
the best strategy consult some of the reference books on sampling for details.
As with ALL sampling designs, stratication can be used to improve precision. The stratication
usually takes place at the rst sampling unit stage, but can take place at all stages. The details of
estimation under stratication can be found in many sampling texts.
Similarly, ratio or regression estimators can also be used if auxiliary information is available that is
correlated with the response variable. This leads to very complex formulae!
One very nice feature of multi-stage designs is that if the rst stage is sampled with replacement, then
the formulae for the estimated standard errors simplify considerably to a single term regardless of the
design used in the lower stages! If there are many rst stage units in the population and if the sampling
fraction is small, the chances of selecting the same rst stage unit twice are very small. Even if this occurs,
a different set of second stage units will likely be selected so there is little danger of having to measure the
same nal sampling unit more than once. In such situations, the design at second and lower stages is very
exible as all that you need to ensure is that an unbiased estimate of the rst-stage unit total is available.
3.12 Analytical surveys - almost experimental design
In descriptive surveys, the objective was to simply obtain information about one large group. In observa-
tional studies, two deliberately selected sub-populations are selected and surveyed, but no attempt is made
CHAPTER 3. SAMPLING
to generalize the results to the whole population. In analytical studies, sub-populations are selected and
sampled in order to generalize the observed differences among the sub-population to this and other similar
populations.
As such, there are similarities between analytical and observational surveys and experimental design.
The primary difference is that in experimental studies, the manager controls the assignment of the explana-
tory variables while measuring the response variables, while in analytical and observational surveys, neither
set of variables is under the control of the manager. [Refer back to Examples B, C, and D in the earlier
chapters] The analysis of complex surveys for analytical purposes can be very difcult (Kish 1987; Kish,
1984; Rao, 1973; Sedransk, 1965a, 1965b, 1966).
As in experimental studies, the rst step in analytical surveys is to identify potential explanatory variables
(similar to factors in experimental studies). At this point, analytical surveys can be usually further subdivided
into three categories depending on the type of stratication:
the population is pre-stratied by the explanatory variables and surveys are conducted in each stratum
to measure the outcome variables;
the population is surveyed in its entirety, and post-stratied by the explanatory variables.
the explanatory variables can be used as auxiliary variables in ratio or regression methods.
[It is possible that all three types of stratication take place - these are very complex surveys.]
The choice between the categories is usually made by the ease with which the population can be pre-
stratied and the strength of the relationship between the response and explanatory variables. For example,
sample plots can be easily pre-stratied by elevation or by exposure to the sun, but it would be difcult to
pre-stratify by soil pH.
Pre-stratication has the advantage that the manager has control over the number of sample points col-
lected in each stratum, whereas in post- stratication, the numbers are not controllable, and may lead to very
small sample sizes in certain strata just because they form only a small fraction of the population.
For example, a manager may wish to investigate the difference in regeneration (as measured by the
density of new growth) as a function of elevation. Several cut blocks will be surveyed. In each cut block, the
sample plots will be pre-stratied into three elevation classes, and a simple random sample will be taken in
each elevation class. The allocation of effort in each stratum (i.e. the number of sample plots) will be equal.
The density of new growth will be measured on each selected sample plot. On the other hand, suppose that
the regeneration is a function of soil pH. This cannot be determined in advance, and so the manager must
take a simple random sample over the entire stand, measure the density of new growth and the soil pH at
each sampling unit, and then post-stratify the data based on measured pH. The number of sampling units in
each pH class is not controllable; indeed it may turn out that certain pH classes have no observations.
If explanatory variables are treated as a auxiliary variables, then there must be a strong relationship
between the response and explanatory variables. Additionally, we must be able to measure the auxiliary
variable precisely for each unit. Then, methods like multiple regression can also be used to investigate
CHAPTER 3. SAMPLING
the relationship between the response and the explanatory variable. For example, rather than classifying
elevation into three broad elevation classes or soil pH into broad pH classes, the actual elevation or soil
pH must be measured precisely to serve as an auxiliary variable in a regression of regeneration density vs.
elevation or soil pH.
If the units have been selected using a simple random sample, then the analysis of the analytical surveys
proceeds along similar lines as the analysis of designed experiments (Kish, 1987; also refer to Chapter
2). In most analyses of analytical surveys, the observed results are postulated to have been taken from a
hypothetical super-population of which the current conditions are just one realization. In the above example,
cut blocks would be treated as a random blocking factor; elevation class as an explanatory factor; and sample
plots as samples within each block and elevation class. Hypothesis testing about the effect of elevation on
mean density of regeneration occurs as if this were a planned experiment.
Pitfall: Any one of the sampling methods described in Section 2 for descriptive surveys can be used for
analytical surveys. Many managers incorrectly use the results from a complex survey as if the data were
collected using a simple random sample. As Kish (1987) and others have shown, this can lead to substantial
underestimates of the true standard error, i.e., the precision is thought to be far better than is justied based on
the survey results. Consequently the manager may erroneously detect differences more often than expected
(i.e., make a Type I error) and make decisions based on erroneous conclusions.
Solution: As in experimental design, it is important to match the analysis of the data with the survey
design used to collect it. The major difculty in the analysis of analytical surveys are:
1. Recognizing and incorporating the sampling method used to collect the data in the analysis. The
survey design used to obtain the sampling units must be taken into account in much the same way as
the analysis of the collected data is inuenced by actual experimental design. A table of equivalences
between terms in a sample survey and terms in experimental design is provided in Table 1.
CHAPTER 3. SAMPLING
Table 1
Equivalences between terms used in surveys and in experimental design.
Survey Term Experimental Design Term
Simple Random
Sample
Completely randomized design
Cluster Sampling (a) Clusters are random effects; units within a cluster
treated as sub-samples; or
(b) Clusters are treated as main plots; units within a
cluster treated as sub-plots in a split-plot analysis.
Multi-stage sam-
pling
(a) Nested designs with units at each stage nested in
units in higher stages. Effects of units at each stage are
treated as random effects, or
(b) Split-plot designs with factors operating at higher
stages treated as main plot factors and factors operat-
ing at lower stages treated as sub-plot factors.
Stratication Fixed factor or random block depending on the reasons
for stratication.
Sampling Unit Experimental unit or treatment unit
Sub-sample Sub-sample
There is no quick easy method for the analysis of complex surveys (Kish, 1987). The super-population
approach seems to work well if the selection probabilities of each unit are known (these are used to
weight each observation appropriately) and if random effects corresponding to the various strata or
stages are employed. The major difculty caused by complex survey designs is that the observations
are not independent of each other.
2. Unbalanced designs (e.g. unequal numbers of sample points in each combination of explanatory fac-
tors). This typically occurs if post- stratication is used to classify units by the explanatory variables
but can also occur in pre-stratication if the manager decides not to allocate equal effort in each stra-
tum. The analysis of unbalanced data is described by Milliken and Johnson (1984).
3. Missing cells, i.e., certain combinations of explanatory variables may not occur in the survey. The
analysis of such surveys is complex, but refer to Milliken and Johnson (1984).
4. If the range of the explanatory variable is naturally limited in the population, then extrapolation outside
of the observed range is not recommended.
More sophisticated techniques can also be used in analytical surveys. For example, correspondence
analysis, ordination methods, factor analysis, multidimensional scaling, and cluster analysis all search for
post-hoc associations among measured variables that may give rise to hypotheses for further investigation.
Unfortunately, most of these methods assume that units have been selected independently of each other
using a simple random sample; extensions where units have been selected via a complex sampling design
have not yet developed. Simpler designs are often highly preferred to avoid erroneous conclusions based on
inappropriate analysis of data from complex designs.
CHAPTER 3. SAMPLING
Pitfall: While the analysis of analytical surveys and designed experiments are similar, the strength of the
conclusions is not. In general, causation cannot be inferred without manipulation. An observed relationship
in an analytical survey may be the result of a common response to a third, unobserved variable. For example,
consider the two following experiments. In the rst experiment, the explanatory variable is elevation (high or
low). Ten stands are randomly selected at each elevation. The amount of growth is measured and it appears
that stands at higher elevations have less growth. In the second experiment, the explanatory variables is
the amount of fertilizer applied. Ten stands are randomly assigned to each of two doses of fertilizer. The
amount of growth is measured and it appears that stands that receive a higher dose of fertilizer have greater
growth. In the rst experiment, the manager is unable to say whether the differences in growth are a result
of differences in elevation or amount of sun exposure or soil quality as all three may be highly related. In
the second experiment, all uncontrolled factors are present in both groups and their effects will, on average,
be equal. Consequently, the assignment of cause to the fertilizer dose is justied because it is the only factor
that differs (on average) among the groups.
As noted by Eberhardt and Thomas (1991), there is a need for a rigorous application of the techniques
for survey sampling when conducting analytical surveys. Otherwise they are likely to be subject to biases
of one sort or another. Experience and judgment are very important in evaluating the prospects for bias, and
attempting to nd ways to control and account for these biases. The most common source of bias is the
selection of survey units and the most common pitfall is to select units based on convenience rather than
on a probabilistic sampling design. The potential problems that this can lead to are analogous to those that
occur when it is assumed that callers to a radio-phone- in show are representative of the entire population.
3.13 References
Cochran, W.G. (1977). Sampling Techniques. New York:Wiley.
One of the standard references for survey sampling. Very technical
Gillespie, G.E. and Kronlund, A.R. (1999).
A manual for intertidal clam surveys, Canadian Technical Report of Fisheries and Aquatic Sciences
2270. A very nice summary of using sampling methods to estimate clam numbers.
Keith, L.H. (1988), Editor. Principles of Environmental Sampling. New York: American Chemical
Society.
A series of papers on sampling mainly for environmental contaminants in ground and surface water,
soils, and air. A detailed discussion on sampling for pattern.
Kish, L. (1965). Survey Sampling. New York: Wiley.
An extensive discussion of descriptive surveys mostly from a social science perspective.
Kish, L. (1984). On Analytical Statistics from complex samples. Survey Methodology, 10, 1-7.
An overview of the problems in using complex surveys in analytical surveys.
Kish, L. (1987). Statistical designs for research. New York: Wiley.
One of the more extensive discussions of the use of complex surveys in analytical surveys. Very
technical.
CHAPTER 3. SAMPLING
Krebs, C. (1989). Ecological Methodology.
A collection of methods commonly used in ecology including a section on sampling
Kronlund, A.R., Gillespie, G.E., and Heritage, G.D. (1999).
Survey methodology for intertidal bivalves. Canadian Technical Report of Fisheries and Aquatic
Sciences 2214. An overview of how to use surveys for assessing intertidal bivalves - more technical
than Gillespie and Kronlund (1999).
Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystem management. New York: Wi-
ley.
Good primer on how to measure common ecological data using direct survey methods, aerial photog-
raphy, etc. Includes a discussion of common survey designs for vegetation, hydrology, soils, geology,
and human inuences.
Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journal of the Royal Statistical Soci-
ety, Series B, 27, 264-278.
Thompson, S.K. (1992). Sampling. New York:Wiley.
A good companion to Cochran (1977). Has many examples of using sampling for biological popu-
lations. Also has chapters on mark-recapture, line-transect methods, spatial methods, and adaptive
sampling.
3.14 Frequently Asked Questions (FAQ)
3.14.1 Confusion about the denition of a population
What is the difference between the "population total" and the "population size"?
Population size normally refers to the number of nal sampling units in the population. Population
total refers to the total of some variable over these units.
For example, if you wish to estimate the total family income of families in Vancouver, the nal sam-
pling units are families, the population size is the number of families in Vancouver, and the response variable
is the income for this household, and the population total will be the total family income over all families in
Vancouver.
Things become a bit confusing when sampling units differ from nal units that are clustered and you
are interested in estimates of the number of nal units. For example in the grouse/pocket bush example,
the population consists of the grouse which are clustered into 248 pockets of brush. The grouse is the nal
sampling unit, but the sampling unit is a pocket of bush. In cluster sampling, you must expand the estimator
by the number of CLUSTERS, not by the number of nal units. Hence the expansion factor is the number of
pockets (248), the variable of interest for a cluster is the number of grouse in each pocket, and the population
total is the number of grouse over all pockets.
CHAPTER 3. SAMPLING
Similarly, for the oysters on the lease. The population is the oysters on the lease. But you dont randomly
sample individual oysters you randomly sample quadrats which are clusters of oysters. The expansion
factor is now the number of quadrats.
In the salmon example, the boats are surveyed. The fact that the number of salmon was measured is
incidental - you could have measured the amount of food consumed, etc.
In the angling survey problem, the boats are the sampling units. The fact that they contain anglers or that
they caught sh is what is being measured, but the set of boats that were at the lake that day is of interest.
3.14.2 How is N dened
How is N (the expansion factor dened). What is the best way to nd this value?
This can get confusing in the case of cluster or multi-phase designs as there are different Ns at each
stage of the design. It might be easier to think of N as an expansion factor.
The expansion factor will be known once the frame is constructed. In some cases, this can only be done
after the fact - for example, when surveying angling parties, the total number of parties returning in a day is
unknown until the end of the day. For planning purposes, some reasonable guess may have to done in order
to estimate the sample size. If this is impossible, just choose some arbitrary large number - the estimated
future sample size will be an overestimate (by a small amount) but close enough. Of course, once the survey
is nished, you would then use the actual value of N in all computations.
3.14.3 Multi-stage vs. Multi-phase sampling
What is the difference between Multi-stage sampling and multi-phase sampling?
In multi-stage sampling, the selection of the nal sampling units takes place in stages. For example,
suppose you are interested in sampling angling parties as they return from shing. The region is rst divided
into different landing sites. A random selection of landing sites is selected. At each landing site, a random
selection of angling parties is selected.
In multi-phase sampling, the units are NOT divided into larger groups. Rather a rst phase selects some
units and they are measured quickly. A second phase takes a sub-sample of the rst phase and measures
more intently. Returning back to the angling survey. A multi-phase design would select angling parties. All
of the selected parties could ll out a brief questionnaire. A week later, a sample of the questionnaires is
selected, and the angling parties RECONTACTED for more details.
The key difference is that in multi-phase sampling, some units are measured TWICE; in multi-phase
sampling, there are different sizes of sampling units (landing sites vs. angling parties), but each sampling
unit is only selected once.
CHAPTER 3. SAMPLING
3.14.4 What is the difference between a Population and a frame?
Frame = list of sampling units from which a sample will be taken. The sampling units may not be the same
as the nal units that are measured. For example, in cluster sampling, the frame is the list of clusters, but
the nal units are the objects within the cluster.
Population = list of all nal units of interest. Usually the nal units are the actual things measured
in the eld, i.e. what is the nal object upon which a measurement is taken.
In some cases, the frame doesnt match the population which may cause biases, but in ideal cases, the
frame covers the population.
3.14.5 How to account for missing transects.
What do you do if an entire cluster is missing?
Missing data can occur at various parts in a survey and for various reasons. The easiest data to handle
is data missing completely at random (MCAR). In this situation, the missing data provides no informa-
tion about the problem that is not already captured by other data point and the missingness is also non-
informative. In this case, and if the design was a simple random sample, the data point is just ignored. So if
you wanted to sample 80 transects, but were only able to get 75, only the 75 transects are used. If some of
the data are missing within a transect - the problem changes from a cluster sample to a two-stage sample so
the estimation formulae change slightly.
If data is not MCAR, this is a real problem - welcome to a Ph.D. in statistics in how to deal with it!
Chapter 4
Designed Experiments - Terminology
and Introduction
Contents
4.1 Terminology and Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.1.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.1.2 Treatment, Experimental Unit, and Randomization Structure . . . . . . . . . . . . 252
4.1.3 The Three Rs of Experimental Design . . . . . . . . . . . . . . . . . . . . . . . 255
4.1.4 Placebo Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4.1.5 Single and bouble blinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4.1.6 Hawthorne Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
4.2 Applying some General Principles of Experimental Design . . . . . . . . . . . . . . . 258
4.2.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.2.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.2.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.2.4 Experiment 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4.2.5 Experiment 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4.3 Some Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.3.1 The Salk Vaccine Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.3.2 Testing Vitamin C - Mistakes do happen . . . . . . . . . . . . . . . . . . . . . . 262
4.4 Key Points in Design of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
4.4.1 Designing an Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.4.2 Analyzing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.4.3 Writing the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.5 A Road Map to What is Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.5.2 Experimental Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.5.3 Some Common Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
250
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND
INTRODUCTION
4.1 Terminology and Introduction
This chapter contains denitions and a general introduction that forms a foundation for the rest of the course
work. These concepts may seem abstract at rst, but they will become more meaningful as they are applied
in later chapters.
4.1.1 Denitions
Experimental design and analysis has a standardized terminology that is, unfortunately, different from that
used in survey sampling (refer to the section on Analytical Surveys in the chapter on Sampling for an
equivalence table).
factor - one of the variables under the control of the experimenter that is varied over different exper-
imental units. If a variable is kept constant over all experimental units, then it is not a factor because
we cannot discern its inuence on the response (e.g. if the water temperature in an experiment is
held constant and equal for all tanks, it is impossible to determine a temperature effect). Factors are
sometimes called explanatory variables. A factor has 2 or more levels.
levels - values of the factor used in the experiment. For example, in an experiment to assess the effects
of different amounts of UV radiation upon the growth rate of smolt, the UV radiation was held at
normal, 1/2 normal, and 1/5 normal levels. These would the three levels for this factor.
treatment - the combination of factor levels applied to an experimental unit. If an experiment has a
single factor, then each treatment would correspond to one of the levels. If an experiment had two or
more factors, then the combination of levels from each factor applied to an experimental unit would
be the treatment. For example, with two factors having 2 and 3 levels respectively, there are 6 possible
treatment combinations.
response variable - what outcome is being measured. For example, in an experiment to measure
smolt growth in response to UV levels, the response variable for each smolt could be nal weight after
30 days.
experimental unit - the unit to which the treatment is applied. For example, several smolt could
be placed into a tank and the tank is exposed to different amount of UV radiation. The tank is the
experimental unit.
observational unit - the unit on which the response is measured. In some cases, the observational
unit may be different from the experimental unit - be careful!
CAUTION: A common mistake in the analysis of experimental data is to confuse the experimen-
tal and observational unit. This leads to pseudo-replication as discussed in the very nice paper by
Hurlbert (1984).
For example, consider an experiment to investigate the effects of UV levels on the growth of smolt.
Two tanks are prepared; one tank has high levels of UV light, the second tank has no UV light. Many
sh are placed in each tank. At the end of the experiment, the individual sh are measured. In this
INTRODUCTION
experiment, the observational unit is the smolt, but the experimental unit is the tank. The treatment are
NOT individually administered to single sh - a whole group of sh are simultaneously exposed to the
UV radiation. Here any tank effect e.g. closeness of the tank to a window, is completely confounded
with the experimental treatment and cannot be separated. You CANNOT analyze this data at the
individual sh level.
Identify the factor, its levels, the treatments, the response variable, the experimental unit, and the obser-
vational unit in the following situations:
An agricultural experimental station is going to test two varieties of wheat. Each variety will be
planted on 3 elds, and the yield from the eld will be measured.
An agricultural experimental station is going to test two varieties of wheat. Each variety will be tested
with two types of fertilizers. Each combination will be applied to two plots of land. The yield will be
measured for each plot.
Fish farmers want to study the effect of an anti-bacterial drug on the amount of bacteria in sh gills.
The drug is administered at three dose-levels (none, 20, and 40 mg/100L). Each dose is administered
to a large controlled tank through the ltration system. Each tank has 100 sh. At the end of the
experiment, the sh are killed, and the amount of bacteria in the gills of each sh is measured.
4.1.2 Treatment, Experimental Unit, and Randomization Structure
Every experiment can be decomposed into three components:
Treatment Structure This describes the relationships among the factors. In this course, you will only
see factorial experiments where every treatment combination appears in the experiment.
Experimental Unit Structure This describes how the experimental units are arranged among them-
selves. In this course you will see three types of experimental unit structures - independent, blocked,
or split-plotted.
Randomization Structure This describes how treatments are assigned to experimental units. In this
course you will see completely randomized designs and blocked designs.
By looking at these three structures, it is possible to correctly and reliably analyze any experimental
design without resorting to cookbook methods. This philosophy is known as No-Name Experimental
Design and is exemplied by the book of the same name by Lorenzen and Anderson (1993).
The raw data cannot, by itself, provide sufcient information to decide on the experimental design used
to collect the data. For example, consider an experiment to investigate the inuence of lighting level (High or
Low) and Moisture Level (Wet or Dry) upon the growth of plants grown in pots.
1
Four possible experimental
designs are shown below:
1
This is a popular pasttime in British Columbia!
INTRODUCTION
In each case, plants are potted, the treatment applied, and the resulting growth of the plants leaves (say
nal total biomass) is measured.
Treatment structure. In all four designs, the treatment structure is the same. There are 2 factors
(Lighting level and Moisture level) each with 2 levels giving a total of 4 treatments. All treatments appear
INTRODUCTION
in all of the experiments (giving what is known as a factorial treatment structure).
Experimental Unit structure. In Dssign A, the experimental and observational unit are the pot. The
treatments (HD, HW, LD, LW) are assigned to individual pots, and a single measurement is obtained on the
single plant from each pot.
In Design B, the pots are rst grouped into two houses. As before, the experimental and observational
units are the pots.
In Design C, the experimental unit is the pot (which contains two plants) but the observational unit is the
plant.
In Design D, four growth chambers are obtained. The lighting level (H or L) is assigned to a growth
chamber. Two pots are placed in each growth chamger. The moisture level is assigned to the individual pot.
The observational unit is the single plant in each pot.
Randomization Structure. In Design A, there is complete randomization of treatments to experimental
units. In Design B, there is a restricted randomization. Each house receives all four treatments (exactly
once in each house) and the treatments are randomized to pots independently in each house. In Design C,
there is complete randomization of treatments to the pot. Both plants in a both receive the same treatment.
Finally, in Design D, the lighting levels are randomized to the growth chambers, while the moisture level is
randomized to the individual pots.
All four designs give rise to 8 observations (the growth of each plant). However, the four designs MUST
BE ANALYZED DIFFERENTLY(!) because the experimental design is different.
Design A is known as a completely randomized design (CRD). This is the default design assumed by
most computer packages.
Design B is an example of a Randomized Complete Block (RCB) design. Each house is a block (stratum)
and each block has every treatment (is complete). It would be incorrect to analyze Design B using the
default settings of most computer pacakges.
Design C is an example of pseudo-replicated design (Hurlbert, 1984). While there are two plants in each
pot, each plant did not receive its own treatment, but they received treatments in pairs. There is no valid
analysis of this experiment because there are no replicated treatments, i.e. there is only one pot for each
combination of lighting level and mositure level. It would be incorrect to analyze Design C using the
default settings of most computer pacakges.
Finally Design D is a combination of two different designs operating at two different sizes of experiment
units. The assignment of lighting levels to growth chambers looks like a completely randomized design at the
growth chamber level. The assignment of moisture level to the two pots within a growth chamber looks like a
blocked design with growth chambers serving as blocks, and the individual pots serving as the experimental
units for moisture level. This is an illustration of what is known as a split-plot (from its agricultural heritage)
design with main plots (growth chambers) being assigned to lighting levels and split-plots (the pots) being
assigned to moisture level. You could, of course, reverse the roles of lightling level and moisture level to
INTRODUCTION
get yet another split-plot design. It would be incorrect to analyze Design D using the default settings of
most computer pacakges.
The moral of the story is that you must carefully consider how the data were collected before using a
computer package. This is the most common error in the analysis of experimental data - the unthinking use
of a computer package where the actual design of the experiment does not match the analysis being done. It
is ALWAYS helpful to draw a picture of the actual experimental design to try and decide upon the treatment,
experimental, and randomization structure.
Be sure that the brain is engaged, before putting a package into gear!
4.1.3 The Three Rs of Experimental Design
It is important to
randomize because it averages out the effect of all other lurking variables. Note, that randomization
doesnt remove their effects, but makes, on average, their effects equal in all groups.
replicate because (1) some estimate of natural variation is required in order to know if any observed
difference can be attributed to the factor or just random chance; and (2) the experiment should have
sufcent power to detect an effect of a biologically meaningful size.
stratify (block) to account for and remove the effects of a known extraneous variable.
What do the above mean in practice?
Consider the very simple agricultural problem of comparing two varieties of tomatoes. The purpose of
the comparison is to nd the variety which produces the greater quantity of marketable quality fruit from a
given area for large scale commercial planting. What should we do?
A simple approach would be to plant a tract of land in each variety and measure the total weight of
marketable fruit produced. However, there are some obvious difculties. The variety that cropped most
heavily may have done so simply because it was growing in better soil. A number of factors affect growth:
soil fertility, soil acidity, irrigation and drainage, wind exposure, exposure to sunlight (e.g. shading, north-
facing or south-facing hillside). Unfortunately no one knows exactly to what extent changes in these factors
affect growth.
So unless the two tracts of land are comparable with respect to all of these features, we wont be able to
conclude that the more heavily producing variety is better as it may just be planted in a tract that is better
suited to growth. If it was possible (and it never will be) to nd two tracts of land that were identical in these
respects, using just those two tracts for comparison would result in a fair comparison but the differences
found might be so special to that particular combination of growing conditions that the results obtained
would not be a good guide to full scale agricultural production anyway.
INTRODUCTION
Why randomize? Let us think about it another way. Suppose we took a large tract of land and
subdivided it into smaller plots by laying down a rectangular grid. By using some sort of systematic design
to decide what variety to plant in each plot, we may come unstuck if there is a feature of the land like an
unknown fertility gradient. We may still end up giving one variety better plots on average. Instead, lets
do it randomly by numbering the plots and randomly choosing half of them to receive the rst variety. The
rest receive the second variety. We might expect the random assignment to ensure that both varieties were
planted in roughly the same numbers of high fertility and low fertility plots, high pH and low pH plots, well
drained and poorly drained plots etc. In that sense we might expect the comparison of yields to be fair.
Moreover, although we have thought of some factors affecting growth, there will be many more that we, and
even the specialist, will not have thought of. And we can expect the random assignment of treatments to
ensure some rough balancing of those as well!
Why replicate? Randomsampling gives representative samples, on average. However, in small samples,
it may occur, just by chance, that your sample may be a bit weird. Unfortunately, we can only expect the
random allocation of treatments to lead to balanced samples (e.g. a fair division of the more and less fertile
plots) if we have a large number of experimental units to randomize. In many experiments this is not true
(e.g. using 6 plots to compare 2 varieties) so that in any particular experiment there may well be a lack of
balance on some important factor. Random assignment still leads to fair or unbiased comparisons, but only
in the sense of being fair or unbiased when averaged over a whole sequence of experiments. This is one of
the reasons why there is such an emphasis in science on results being repeatable.
A more important reason for adequate replication is that large experiments have a greater chance of
detecting important differences. This seems self evident, but in practice can be hard to achieve. One com-
mon error often made is to confuse pseudo-replication with real replication - refer to your assignments for
examples of this.
Why replicate at all why not just measure one unit under each treatment? Without replication, it is
impossible to determine if a difference is caused by random chance or the factors.
Why block? Partly because random assignment of treatments does not necessarily ensure a fair compar-
ison when the number of experimental units is small, more complicated experimental designs are available
to ensure fairness with respect to those factors which we believe to be very important. Suppose with our
tomato example that, because of the small variation in the fertility of the land we were using, the only thing
that we thought mattered greatly was drainage. We could then try and divide the land into two blocks, one
well drained and one badly drained. These would then be subdivided into smaller plots, say 10 plots per
block. Then in each block, 5 plots are assigned at random to the rst variety and the remaining 5 plots to
the second variety. We would then only compare the two varieties within each block so that well drained
plots are only compared with well drained plots, and similarly for badly drained plots. This idea is called
blocking. By allocating varieties to plots within a block at random we would provide some protection against
other extraneous factors.
Another application of this idea is in the comparison of the effects of two pain relief drugs on people.
If the situation allowed, we would like to try both drugs on each person and look at the differential effects
within people. To protect ourselves against biases caused by a drift in day to day pain levels or other physical
factors within a person that might affect the response, the order in which each person received the drugs
would be randomly chosen. Here the block is the person and we have two treatments (drugs) per person
INTRODUCTION
applied in a random order. The randomization could be done in such a way that the same number received
each drug rst so that the design is balanced with regard to order. In the words of Box, Hunter and Hunter
[1978, page 103] the best general strategy for experimentation is, block what you can and randomize what
you cannot.
4.1.4 Placebo Effects
There are psychological effects which complicate the effects of treatments on people. In medicine, there is a
tendency for people to get better because they think that something is being done about their complaint. So
when you give someone a pain relief drug there are two effects at work, a chemical effect of the drug and
a psychological boost. To give you some idea of the strength of these psychological effects, approximately
35% of patients have been shown to respond favorably to a placebo (or inert dummy treatment) for a wide
range of conditions including alleviation of pain, depression, blood pressure, infantile asthma, angina and
gastro-intestinal problems (Grimshaw and Jaffe [1982]).
Consequently, any control treatment must look and feel as similar as possible as any of the active treat-
ments being applied.
4.1.5 Single and bouble blinding
In evaluating the effect of the drug, we want to lter out the real effect of the drug from the effect of the
patients own psychology. The standard method is to compare what happens to patients who are given the
drug (the treatment group) with what happens to patients given an inert dummy treatment called a placebo
(the control group). This will only work if the patients do not know whether they are getting the real drug
or the placebo. Similarly, when we are comparing two or more treatments the subjects (patients) should not
know which treatment they are receiving, if at all possible. This idea is called single blinding the subjects.
The results are not then contaminated by any preconceived ideas about the relative effectiveness of the
treatments. For example, consider a comparative study of two asthma treatments, one in liquid form and one
in powder form. To blind the subjects, each subject had to receive both a powder and a liquid. One was a
real treatment, the other a placebo. The idea of using a control or control group to evaluate the effect of an
experimental intervention is basic to all experimentation.
Blinding, to whatever extent is possible, tends to be desirable in any experiment involving human re-
sponses (e.g. medical, psychological and educational experiments).
It is often advisable to have the experimenter blinded as to the treatment applied as well. An experiment
was performed in England to evaluate the effect of providing free milk to school children. There was a
random allocation of children to the group who received milk (the treatment group) and the control group
which received no milk. However, because the study designers were afraid that random assignment may
not not necessarily have balanced the groups on general health and social background at the classroom
level, teachers were allowed to switch children between treatment and control to equalize the groups. It is
easy to imagine the effect this must have had. Most teachers are caring people who would be unhappy to
INTRODUCTION
watch afuent children drinking free school milk while malnourished children go without. We would expect
the sort of interchanging of students between groups that went on would result in too many malnourished
children receiving the milk thus biasing the study, perhaps severely. To protect against this sort of thing,
medical studies are made double blind whenever possible. In addition to the subjects not knowing what
treatment they are getting, the people administering the treatments dont know either! The people evaluating
the results, e.g. deciding whether tissue samples are cancerous or not, should also be blinded.
4.1.6 Hawthorne Effect
In studies of people, the actual process of measuring or observing people changes their behavior. According
to Wikipedia
2
,
The Hawthorne effect was not named after a researcher but rather refers to the factory where
the effect was rst thought to be observed and described: the Hawthorne works of the Western
Electric Company in Chicago, 1924-1933. The phrase was coined by Landsberger in 1955. One
denition of the Hawthorne effect is:
An experimental effect in the direction expected but not for the reason expected; i.e., a
signicant positive effect that turns out to have no causal basis in the theoretical motivation
for the intervention, but is apparently due to the effect on the participants of knowing
themselves to be studied in connection with the outcomes measured.
For example, would you change your viewing habits on TV if you knew that your viewing was being moni-
tored?
4.2 Applying some General Principles of Experimental Design
Does taking Vitamin C reduce the incidence of colds? This is a popular theory - how would you design an
experiment to test this research question using students as guinea pigs.
[We will ignore for a moment the whole problem of how representative students are of the population
in general. This is a common problem with many drug studies that are often conducted on a single gender
and age group and then the drug company wishes to generalize the results to the other gender and other age
groups.]
Assume that you have a group of 50 students available for the experiment.
First: What are the experimental units, the factor, its levels, and the treatments?
2
http://en.wikipedia.org/wiki/Hawthorne_effect Accessed on 2007-10-01.
INTRODUCTION
4.2.1 Experiment 1
Have all 50 students take Vitamin C supplements for 6 months and record the number of colds. After 6
months, the data was collected, and on average, the group of students had 1.4 colds per subject.
What is the aw in this design?
there is no control group that the results can be compared against. Is an average of 1.4 colds in 6
months unusually low?
4.2.2 Experiment 2
The students are divided into two groups. All of the males receive Vitamin C supplements for 6 months
while all of the females receive nothing. Each student records the number of colds incurred in the 6 month
period. At the end of the 6 months, the males had an average of 1.4 colds per subject; the female had an
average of 1.9 colds per subject.
there is a control group, but any effect of the Vitamin C is confounded with gender, i.e., we cant
tell if the difference in the average number of colds is due to the Vitamin C or to the gender of the
students.
4.2.3 Experiment 3
The students are randomly assigned to each of the two groups. This is accomplished by putting 25 slips of
paper into a hat marked Vitamin C, and 25 slips of paper in to a hat marked Control. The slips of paper
are mixed. Each student then selects a slip of paper. Those with the slips marked Vitamin C are given
Vitamin C supplements for 6 months, while those with the slips marked Control are not given anything.
Each student records the number of colds incurred in the 6 month period. At the end of the 6 months, the
Vitamin C group had an average of 1.4 colds per subject; the control group had an average of 1.9 colds per
subject.
the study is not double-blinded, Because the students knew if they were taking Vitamin C or not,
they may modify their behavior accordingly. For example, those taking Vitamin C may feel that they
dont have to wash their hands as often as those not taking Vitamin C under the belief that the Vitamin
C will protect them.
INTRODUCTION
4.2.4 Experiment 4
The researcher makes up 50 identical pill bottles. In half of the researcher places Vitamin C tables. In the
other half, the research puts identical looking and tasting sugar pills (a placebo). The researcher then asks
an associate to randomly assign the jars and to keep track of which jar has which tablet, but the associate is
not to tell the researcher. The key is put into a sealed envelope. The numbered jars are then placed in a box,
and mixed up. Each student then selects a jar and tells the researcher the number on the jar. Each student
takes the pill for the next six month. Each student records the number of colds incurred in the 6 month
period. After the data are recorded, the associate opens the sealed envelope and each student is then labeled
as having taken a Vitamin C or a placebo. The Vitamin C group had an average of 1.4 colds per subject; the
control group had an average of 1.9 colds per subject.
None! This is a properly conducted experiment. It has a control group so that comparisons can be
made between two groups. Students are randomly assigned to each group so that any lurking variable
will be roughly present in both groups and its effects will be roughly equal. The control groups
gets an identical tasting placebo so the subjects dont know what treatment they are receiving. The
researcher doesnt know what the students are taking until after the experiment is complete so cant
unconsciously inuence students behavior. [This is known as double-blinding, neither the subject nor
the research knows the treatment until the experiment is over.]
This experiment can be diagramed as follows:
4.2.5 Experiment 5
The researcher was unable to recruit enough students from a single class, and so has to recruit students from
two different classes. Because the contact patterns of students may differ among the classes (e.g. one class
may be in the morning and students have to cram together on the busses to get to class on time), the research
would like to control for any class effect. Consequently, the two classes are treated as separate blocks of
students, and within each class, researcher makes up identical pill bottles. In half of bottles, the researcher
places Vitamin C tables. In the other half, the research puts identical looking and tasting sugar pills (a
placebo). The researcher then asks an associate to randomly label the jars and to keep track of which jar has
INTRODUCTION
which tablet, but the associate is not to tell the researcher. The key matching the bottles to the pill type is put
into a sealed envelope. The numbered jars are then placed in a box, and mixed up. Each student then selects
a jar and tells the researcher the number on the jar. Each student takes the pill for the next six month. Each
student records the number of colds incurred in the 6 month period. After the data are recorded, the associate
opens the sealed envelope and each student is then labeled as having taken a Vitamin C or a placebo. The
Vitamin C group from Class A had an average of 1.4 colds per subject; the control group in Class A had an
average of 1.9 colds per subject. The Vitamin C group from class B had an average of 1.2 colds per subject;
the control group had an average of 1.5 colds per subject.
None! This is even a better experiment! There are two mini-experiments being conducted simulta-
neously in the two blocks or strata - the two classes. Blocking or stratication is useful when there
may be a known lurking variable that is strongly suspected to affect the results. It is not necessary to
stratify or to block, but statistical theory can show that in most cases, you can get a more powerful ex-
periment, i.e. able to detect smaller differences, for the same sample size by suitable chosen blocking
variables.
4.3 Some Case Studies
4.3.1 The Salk Vaccine Experiment
Many of the issues above arise in the famous story of the 1954 Salk vaccine trial. Polio plagued many parts
of the world in the rst half of this century. It struck in epidemics causing deformity and death, particularly
to children. In 1954 the US Public Health service began a eld trial of a vaccine developed by Jonas Salk.
Two million children were involved, and of these, roughly half a million were vaccinated, half a million
were left unvaccinated and the parents of a million refused the vaccination. The trial was conducted on the
most vulnerable group (in grades 1, 2 and 3) in some of the highest risk areas of the country. The National
Foundation for Infantile Paralysis (NFIP) put up a design in which all grade 2 children whose parents would
INTRODUCTION
consent were vaccinated, while all grades 1 and 3 children then formed the controls. What aws are apparent
here?
To begin with, polio is highly infectious, being passed on by contact. It could therefore easily sweep
through grade 2 in some districts, thus biasing the study against the vaccine, or it could appear in much
greater numbers in grade 1 or grade 3, making the vaccine look better than it is. Also the treatment group
consisted only of children whose parents agreed to the vaccination whereas no consent was required for the
control group. At that time, well-educated, higher-income parents were more likely to agree to vaccination,
so that treatment and control groups were unbalanced on social status. Such an imbalance could be important
because higher income children tended to live in more hygienic surroundings and, surprisingly, this increases
the incidence of polio. Children from less hygienic surroundings were more likely to contract mild cases
when still young enough to be protected by antibodies from their mothers. They then generated their own
antibodies which protected them later on. We might expect this effect to bias the study against the vaccine.
Some school districts saw these problems with the NFIP design and decided to do it a better way. You
guessed it. A double-blind, controlled, randomized study. The placebo was an injection of slightly salty
water; the treatment and control groups were obtained by randomly assigning to the vaccine or placebo only
those children whose parents consented to the vaccination. Whose results would you believe?
4.3.2 Testing Vitamin C - Mistakes do happen
Even the best laid plans can go awry, however. Mosteller et al. [1983 : page 242] describe a Toronto study
conducted to test the theory of two time Nobel Prize winner Linus Pauling that large doses of vitamin C tend
to prevent the common cold. Patients were randomized with regard to receiving vitamin C or the placebo
and blinded as to the treatment they were receiving. However some participants tried to nd out whether
they were getting vitamin C by opening their capsules and tasting! Ostensibly, the trial showed signicantly
fewer colds in the vitamin group than the control group. However, when those who tasted and correctly
guessed what they were getting were eliminated from the analysis, any difference between the groups was
small enough to be due to chance variation.
4.4 Key Points in Design of Experiments
This section is adapted from the article:
Tingley, M.A. (1996). Some Practical Advice for Experimental Design and Analysis. Liason
10(2).
INTRODUCTION
4.4.1 Designing an Experiment
1. Write down the single most important results that you expect to conclude from this experiment. The
statement should be clear and concise.
2. Describe the experimental protocol. It is often difcult to describe, in simple language, how the data
will be collected. The following questions may help in describing the experiment:
(a) Demonstrate the experimental apparatus. How does it work? Where will the experiment be run?
(b) What will the data look like (e.g. continuous between 10 and 50; integers between 0 and 10;
categories such as low, medium, or high)?
(c) What environmental conditions could affect the measurements?
(d) Which factors will be controlled? At what levels?
(e) There must be a control group otherwise the results cant be compared against anything. The
control group should receive treatment as identical as possible to the experimental groups to
avoid the placebo effect.
3
Also be aware of the Hawthorne effect where the simple act of mea-
surement cases a change in the response.
(f) Avoid experimenter bias. The experiment, if possible, should be double-blinded. This is to
avoid introducing bias from the researcher into the experimenter. For example, the researcher
may discourage students who are in the control group from walking outside on cold days. In
some cases, the experimental treatment is so different than the control, that it is impossible to
blind it. Great care must be taken in these experiments.
in a trial of a post-heart attack drug, the researcher assigned the drug to the more severe
cases so that they would have a better chance of survival!
(g) Which external inuences are not controlled? Can they be blocked? Consider blocking or
stratifying the experiment if you suspect there is an external lurking variable that will inuence
the results. The distinction between a blocking variables and a second experimental factor is not
always clear cut. Some typical blocking or stratication factors are:
location - is the climate the same in different locations in the province?
cage position - experimental animals may respond differently on the edges.
plots of land
(h) Which combination of controlled factors will produce the lowest response? Which will produce
the highest response?
(i) What would be a statistically signicant result?
(j) What would be a practically signicant result? This is often HARD to determine.
(k) Make a guess as to the noise level, i.e., variation in the data.
(l) What are the sampling costs or the cost to take one measurement?
(m) Estimate the sample size requirements to detect important differences and compare the required
sample size to the sample size afforded by your budget.
3
Recall, that a placebo effect occurs when the actual process involved in the experiment causes the results regardless of the treatment.
For example, if students know they are in a study and are receiving Vitamin C, they may change their eating habits which could affect
the results regardless if they are receiving the treatment or not.
INTRODUCTION
(n) Describe the randomization. The experimental units should be randomly assigned to the treat-
ments. Lurking variables are always present and in many cases, their effects are unknown. By
randomly assigning experimental units to treatments, both the experimental and control groups
should have roughly the same distribution of lurking variables, and their effects should be
roughly equal in both groups. Hence in any comparison between groups, the effect of the lurk-
ing variables should cancel out. Note that randomization doesnt eliminate the effect of lurking
variables on the individual subjects - it only tries to ensure that the total effects on both groups
is about the same. Some typical problems caused by failure to randomize:
cage position having an effect on rats behavior. Cages on the edge get more light.
elds differ in fertility and so variety trials may be confounded with eld fertility differ-
ences.
4.4.2 Analyzing the data
1. Look at the data. Use plots and basic descriptive statistics to check that the data look sensible. Are
there any outliers? Decide what to do with outliers.
2. Can you see the main result directly from the data?
3. Draw a picture of how the data were actually collected it may differ from the plan that was proposed.
4. Think about transformations. In some cases, the correct form of a variable is not obvious, e.g. should
fuel consumption be specied in km/L or L/100 km?
5. Try a preliminary analysis. The analysis MUST match the design, i.e. an RCB analysis must be used
to analyze data that were collected using a blocked design. This is the most crucial step of the
analysis!
6. Plot residuals from the t. Plot the residuals against tted values, all predictors, time, or any other
measurement that you can get your hand on (e.g. change in lab technicians). Check for outliers.
7. Which factors appear to be unimportant? Which appear to be most important?
8. Fit a simple sub-model that describes the basic characteristics of the data.
9. Check residuals again.
10. Check to see that the nal model is sensible, e.g. if interactions are present, all main effects contained
in the interaction must also be present.
11. Multiple comparisons?
4.4.3 Writing the Report
1. State clearly the major ndings of the analysis.
2. Construct a suitable graph to display the results.
INTRODUCTION
3. Never report naked p-values. Remember, statistical signicance is not the same as biologically
important. Conversely, failure to detect a difference does not mean no difference.
4. Estimate effect sizes. Never report naked estimates be sure to report a measure of precision (a
standard error or a condence interval).
5. Put the results in a biological context. A statistically signicant result may not be biologically impor-
tant.
6. Put relevant technical stuff into an appendix. Then halve the size of the appendix.
4.5 A Road Map to What is Ahead
4.5.1 Introduction
Experiments are usually done to compare the average response or the proportion of responses among two or
more groups. For example, you may wish to compare the mean weight gain of two different feeds given to
livestock in order to decide which is better. In these cases, there is no particular value of interest for each
feed-type; rather the difference in performance is of interest.
Other example of experimental situations:
has the level of popular support for a particular political party changed since the last poll.
is a new drug more effective in curing a disease than the standard drug.
is a new diet more effective in weight loss than a standard diet?
is a new method of exercise more benecial than a standard method?
4.5.2 Experimental Protocols
There are literally hundreds of different experimental designs that can be used in research. One of the jobs
of a statistician is to be able to recognize the wide variety of designs and to understand how to apply very
general methods to help analyze these experiments.
We shall start with a particular type of experiment - namely a single factor experiment with two or more
levels. For example,:
we may be testing types of drugs (the factor) at three levels (a placebo, a standard drug, and a new
drug).
INTRODUCTION
we may be examining the level of support over time (the factor) at three particular points in a cam-
paign.
we may be interested in comparing brand of batteries (the factor) and looking at two particular brands
(the levels) - Duracell or Eveready.
The same methods can also be used when surveys are taken from two different populations. For example,
is the level of support for a political party the same for males and females. Clearly, gender cannot be
randomized (not yet) to individual people, so you must take a survey of males and females and then make
a decision about the levels of support for each gender. Here gender would be the factor with 2 levels (male
and female).
When the experiment is performed, you must pay attention to the RRR of statistical design (randomiza-
tion, replication, and blocking). Randomization ensures that the inuence of other, uncontrollable factors
will be roughly equal in all treatment groups. Adequate replication ensures that the results will be precise
(condence intervals narrow) or hypothesis tests are powerful. Blocking is a method of trying to control the
suspected effects of a particular variable in the experiment.
Blocking is a very powerful tool. For example:
you know that the resting heart rate varies considerably among people. Consequently, you may decide
to measure a person before and after an exercise to see the change in heart rate, rather than measuring
one person before and a second person after the exercise. By taking both measurements on the same
person, the difference in heart rate should be less variable than the difference between two different
people.
you know that driving habit vary considerably among drivers. Consequently, you may decide to com-
pare durability of two brands by mounting both brands on the same car and doing a direct comparison
under the same driving conditions, rather than using different cars with different drivers and (presum-
able) different driving conditions for each brand.
It is important that you recognize when such pairing or blocking takes place. In general, you can rec-
ognize it when repeated measurements are taken on the same experimental unit (a person, a car) rather than
measurements on different experimental units. There is a restriction in the randomization because you have
to ensure that every treatment occurs on every unit (both brands of tire; both time points on a person)
If blocking has not taken place, we say that the experiment was completely randomized or that it is a
case of independent samples. The latter term is used, because no unit is measured twice, and consequently,
measurements are independent of each other. For example:
separate surveys of males and females
people randomized into one and only one of placebo or vitamin C.
INTRODUCTION
If blocking has taken place, we say that the experiment was blocked and in the case of two levels (e.g.
before/after measurements on a person; two brands of tire on each car) we use the term paired experiment.
For example:
measuring every person using two different walking styles
measuring every person before and after a drug is administered.
panel surveys where the same set of people is surveyed more than once.
Johnson (2002) in
Johnson, D. H. (2002).
The important of replication in wildlife studies.
has a very nice discussion about the role of replication in wildlife studies, particularly the role of meta-
replication where entire studies are repeated. It is recommended that this paper be added to your pile to be
read!
4.5.3 Some Common Designs
The most common analyses gather information on population means, proportions, and slopes. The analyzes
for each of these types of parameters can be summarized as follows: (not all will be covered in this course):
C
H
A
P
T
E
R
4
.
D
E
S
I
G
N
E
D
E
X
P
E
R
I
M
E
N
T
S
-
T
E
R
M
I
N
O
L
O
G
Y
A
N
D
I
N
T
R
O
D
U
C
T
I
O
N
Table 4.1: Some common experimental designs and analyses
c
2
0
1
2
C
a
r
l
J
a
m
e
s
S
c
h
w
a
r
z
2
6
8
D
e
c
e
m
b
e
r
2
1
,
2
0
1
2
C
H
A
P
T
E
R
4
.
D
E
S
I
G
N
E
D
E
X
P
E
R
I
M
E
N
T
S
-
T
E
R
M
I
N
O
L
O
G
Y
A
N
D
I
N
T
R
O
D
U
C
T
I
O
N
Type of Pa-
rameter
Type of Experi-
mental Design or
Survey
Number
of levels
Name of Analy-
sis
Example
Mean Completely ran-
domized design
or independent
surveys
Two Two independent
sample t-test
assuming equal
variances
20 patients are randomized to placebo or new drug.
Time until death is measured.
Mean Completely ran-
domized design
or independent
surveys
Two or
more
One-way Anal-
ysis of Variance;
Single factor -
CRD - ANOVA
30 patients are randomized to placebo or 2 other drugs.
Time until death is measured.
Mean Blocked Design
or panel surveys
Two Paired t-test Two brands of tires are randomly assigned to either the
left or right of each car. Every car has both tires.
Mean Blocked Design
or panel surveys
Two or
more
Blocked Analy-
sis of Variance;
Single factor -
RCB - ANOVA
Four brands of tires are randomly assigned to the 4 tire
positions of each car. Every car has all four brands.
Mean Two factors in
a completely
randomized
design
Two or
more
for each
factor
Two-way
ANOVA; Multi-
factor - CRD -
ANOVA
The effects of both UV radiation and water temper-
ature are randomly assigned to tanks to measure the
growth of smolt.
Mean Two factors in a
blocked design
Two or
more
for each
factor
Two-way RCB;
Multi-factor -
RCB - ANOVA
The effects of both UV radiation and water temper-
ature are randomly assigned to tanks to measure the
growth of smolt. The experiment is performed at two
different laboratories
c
2
0
1
2
C
a
r
l
J
a
m
e
s
S
c
h
w
a
r
z
2
6
9
D
e
c
e
m
b
e
r
2
1
,
2
0
1
2
C
H
A
P
T
E
R
4
.
D
E
S
I
G
N
E
D
E
X
P
E
R
I
M
E
N
T
S
-
T
E
R
M
I
N
O
L
O
G
Y
A
N
D
I
N
T
R
O
D
U
C
T
I
O
N
Type of Pa-
rameter
Type of Experi-
mental Design or
Survey
Number
of levels
Name of Analy-
sis
Example
Mean Two factors in a
split-plot design
Two or
more
for each
factor
Split-plot design The effects of both UV radiation and water temper-
ature are assigned to tanks to measure the growth of
smolt. Each tank has a UV bulb suspended above it,
but several tanks in series are connected to water of
the same temperature.
Proportion Completely ran-
domized design
or independent
surveys
Two or
more
Chi-square test
for independence
Political preference for three parties are measured at
two points in time in two separate polls. Different peo-
ple are used in both polls.
Proportion Blocked design or
panel surveys
Two or
more
VERY COM-
PLEX ANALY-
SIS - not covered
in this class -
Political preference for three parties are measured at
two points in time by asking the same people in both
polls (a panel study)
Slopes Completely ran-
domized design
or independent
surveys
Two or
more
Analysis of Co-
variance (not cov-
ered in this class)
Plots of land are randomized to two brand of fertilizer.
The dose is varied and interest lies in comparing the in-
crease in yield per change in fertilizer for both brands.
Slopes Blocked design or
panel surveys
Two or
more
VERY COM-
PLEX ANALY-
SIS!
Two drugs are to be compared to see how the heart
rate reduction varies by dose. Each person is given
multiple doses of the same drug, one week apart, and
the heart rate determined
c
2
0
1
2
C
a
r
l
J
a
m
e
s
S
c
h
w
a
r
z
2
7
0
D
e
c
e
m
b
e
r
2
1
,
2
0
1
2
C
H
A
P
T
E
R
4
.
D
E
S
I
G
N
E
D
E
X
P
E
R
I
M
E
N
T
S
-
T
E
R
M
I
N
O
L
O
G
Y
A
N
D
I
N
T
R
O
D
U
C
T
I
O
N
Type of Pa-
rameter
Type of Experi-
mental Design or
Survey
Number
of levels
Name of Analy-
sis
Example
Notice that the two-sample t-test and the paired t-test are special cases of more general methods called the Analysis of Variance or ANOVA
for short, to be explored in the following chapter.
c
2
0
1
2
C
a
r
l
J
a
m
e
s
S
c
h
w
a
r
z
2
7
1
D
e
c
e
m
b
e
r
2
1
,
2
0
1
2
Chapter 5
Single Factor - Completely Randomized
Designs (a.k.a. One-way design)
Contents
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.2 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
5.2.1 Using a random number table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
5.2.2 Using a computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
5.3 Assumptions - the overlooked aspect of experimental design . . . . . . . . . . . . . . 285
5.3.1 Does the analysis match the design? . . . . . . . . . . . . . . . . . . . . . . . . 286
5.3.2 No outliers should be present . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
5.3.3 Equal treatment group population standard deviations? . . . . . . . . . . . . . . . 287
5.3.4 Are the errors normally distributed? . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.3.5 Are the errors are independent? . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.4 Two-sample t-test- Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.5 Example - comparing mean heights of children - two-sample t-test . . . . . . . . . . . 290
5.6 Example - Fat content and mean tumor weights - two-sample t-test . . . . . . . . . . 297
5.7 Example - Growth hormone and mean nal weight of cattle - two-sample t-test . . . . 303
5.8 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
5.8.1 Basic ideas of power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
5.8.2 Prospective Sample Size determination . . . . . . . . . . . . . . . . . . . . . . . 312
5.8.3 Example of power analysis/sample size determination . . . . . . . . . . . . . . . 313
5.8.4 Further Readings on Power analysis . . . . . . . . . . . . . . . . . . . . . . . . 319
5.8.5 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
5.8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
5.9 ANOVA approach - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
5.9.1 An intuitive explanation for the ANOVA method . . . . . . . . . . . . . . . . . . 323
272
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS
(A.K.A. ONE-WAY DESIGN)
5.9.2 A modeling approach to ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.10 Example - Comparing phosphorus content - single-factor CRD ANOVA . . . . . . . . 331
5.11 Example - Comparing battery lifetimes - single-factor CRD ANOVA . . . . . . . . . . 343
5.12 Example - Cuckoo eggs - single-factor CRD ANOVA . . . . . . . . . . . . . . . . . . 353
5.13 Multiple comparisons following ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . 366
5.13.1 Why is there a problem? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
5.13.2 A simulation with no adjustment for multiple comparisons . . . . . . . . . . . . . 367
5.13.3 Comparisonwise- and Experimentwise Errors . . . . . . . . . . . . . . . . . . . 369
5.13.4 The Tukey-Adjusted t-Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.13.5 Recommendations for Multiple Comparisons . . . . . . . . . . . . . . . . . . . . 372
5.13.6 Displaying the results of multiple comparisons . . . . . . . . . . . . . . . . . . . 373
5.14 Prospective Power and sample sizen - single-factor CRD ANOVA . . . . . . . . . . . 375
5.14.1 Using Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
5.14.2 Using SAS to determine power . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
5.14.3 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
5.15 Pseudo-replication and sub-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
5.16.1 What does the F-statistic mean? . . . . . . . . . . . . . . . . . . . . . . . . . . 381
5.16.2 What is a test statistic - how is it used? . . . . . . . . . . . . . . . . . . . . . . . 381
5.16.3 What is MSE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
5.16.4 Power - various questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
5.16.5 How to compare treatments to a single control? . . . . . . . . . . . . . . . . . . . 384
5.16.6 Experimental unit vs. observational unit . . . . . . . . . . . . . . . . . . . . . . 384
5.16.7 Effects of analysis not matching design . . . . . . . . . . . . . . . . . . . . . . . 385
5.17 Table: Sample size determination for a two sample t-test . . . . . . . . . . . . . . . . 388
5.18 Table: Sample size determination for a single factor, xed effects, CRD . . . . . . . . 390
5.19 Scientic papers illustrating the methods of this chapter . . . . . . . . . . . . . . . . 393
5.19.1 Injury scores when trapping coyote with different trap designs . . . . . . . . . . . 393
5.1 Introduction
This is the most basic experimental design and the default design that most computer packages assume that
you have conducted. A single factor is varied over two or more levels. Levels of the factor (treatments)
are completely randomly assigned to experimental units, and a response variable is measured from each
unit. Interest lies in determining if the mean response differs among the treatment levels in the respective
populations.
Despite its simplicity, this design can be used to illustrate a number of issues common to all experi-
ments. It also is a key component of more complex designs (such as the split-plot design).
NOT ALL EXPERIMENTS ARE OF THIS TYPE! Virtually all computer packages (even Excel) can
correctly analyze experiments of this type. Unfortunately, not all experiments are single-factor CRD. In my
experience in reviewing reports it often occurs that a single-factor CRD analysis is applied to designs when it
is inappropriate. Therefore just dont blindly use a computer package to experimental data before verifying
that you understand the design. Be sure to draw a picture of the design as illustrated in previous chapters.
An experiment MUST have the following attributes before it should be analyzed using single-factor
CRD methods:
1. Single factor with two or more levels. Is there a single experimental factor that is being manipulated
over the experimental units? [In more advanced courses, this can be relaxed somewhat.]
2. Experimental unit=observational unit. Failure to have the observational and experiment unit being
the same, is the most common error made in the design and analysis of experiments - refer to the pub-
lication by Hurlbert (1984). Look at the physical unit being measured in the experiment - could each
individual observational unit be individually randomized to a treatment level? Common experiments
that fail this test are sub-sampling designs (e.g. sh in a tank).
3. Complete randomization. Could each experimental unit be randomized without restriction to any
of the treatment levels? In many designs there are restrictions on randomization (e.g. blocking) that
restrict the complete randomization of experimental units to treatments.
As noted in the chapter on Survey Sampling, analytical surveys can be conducted with similar aims. In
this case, there must be separate and complete randomization in the selection of experimental units with
equal probability of selection for each unit before using these methods.
These designs are commonly analyzed using two seemingly different, but equivalent procedures:
Two-sample t-test used with exactly two levels in the factor
Single factor CRD ANOVA used with two or more levels in the factor.
The two-sample t-test is a special case of the more general single-factor CRD ANOVA and the two analyses
give identical results for this case. Because of the large numbers of experiments that fall into this design,
both types of analyses will be explored.
5.2 Randomization
There are two types of experiments or surveys that are treated as Completely Randomized Designs (CRD).
First in a true experiment, there is a complete randomization of treatments to the experimental units.
Second, in some cases, assignment of treatments to experimental units is not feasible, e.g. it is currently im-
possible to randomly assign sex to an animal! These latter experiments are more properly called Analytical
Surveys, and units need to be chosen at random from the populations forming each treatment group.
The basic method of randomizing the assignment of treatments to experimental units is analogous to
placing slips of paper with the treatment levels into one hat, placing slips of paper with a list of the exper-
imental units into another hat, mixing the slips of paper in both hats, and then sequentially drawing slips
from each hat to match the treatment with the experimental unit.
In the case where treatments cannot be randomly assigned to experimental units (e.g. you cant randomly
assign sex to rats), you must selecte experiment units from the relevant population using a simple random
sample. For example, if the factor is Sex with levels males and females, you must select the experimental
units (people) from all the males or females from the population of males and females using a simple random
sample. seen in the chapter on Survey Sampling.
In practice, randomization is done using either a random number table or generating random numbers
on the computer.
5.2.1 Using a random number table
Many text book contain random number tables. Some tables are also available on the web e.g. http://
ts.nist.gov/WeightsAndMeasures/upload/AppenB-HB133-05-Z.pdf. Most tables are
arranged in a similar fashion. They contain a list of random one digit numbers (from 0 to 9 inclusive)
that have be arbitrarily grouped into sets of 5 digits and arbitrarily divided into groups of 10 rows. [The row
number is NOT part of the table.]
Each entry in the table is equally likely to be one of the values from 0 to 9; each pair of digits in the table
is equally likely to be one of the values from 00 to 99; each triplet in the table is equally likely to be one of
the values 000 to 999; and so on.
Assigning treatments to experimental units
Suppose that you wanted to randomly assign 50 experimental units to two treatments. Consequently, each
treatment must be assigned to 25 experimental units.
1. Label the experimental units from 1 to 50.
2. Enter the table at an arbitrary row and position in the row, and pick off successive two digit groups.
Each two digit group will select one of the experimental units. [Ignore 00, and 51-99.]. For example,
suppose that you enter the table at row 48. The random digits from the table are:
48 46499 94631 17985 09369 19009 51848 58794 48921 22845 55264
and so the two digits groups from this line are:
46 49 99 46 31 17 98 50 93 69 19 00 95 18 48
The rst 25 distinct two-digit groups that are between 01 and 50 (inclusive) are used to select the
units for the rst treatment. From the above table, experimental units 46, 49, 31, 17, 50, 19, 18, 48
. . . belong to the treatment 1 group. Note that the value 46 occurred twice - it is only used once.
3. The remainder of the experimental units belong to the second treatment.
This can be extended to many treatment groups, by choosing rst those experimental units that belong to
the rst group, then the experimental units that belong to the second group, etc. An experimental unit cannot
be assigned to more than one treatment group, so it belongs to the rst group it is assigned to.
Selecting from the population
Suppose you wish to select 10 units from each of two population of animals representing males and females.
There are 50 animals of each sex.
The following is repeated twice, once for males and once for females.
1. Label the units in the population from 1 to 50.
2. Enter the table at an arbitrary row and position in the row, and pick off successive two digit groups.
Each two digit group will select one of the units from the population. Continue until 10 are selected.
For example, suppose that you enter the table at row 48. The random digits from the table are:
48 46499 94631 17985 09369 19009 51848 58794 48921 22845 55264
and so the two digits groups from this line are:
46 49 99 46 31 17 98 50 93 69 19 00 95 18 48
The rst 10 distinct two-digit groups that are between 01 and 50 (inclusive) are used to select the units
for the rst treatment. From the above table, experimental units 46, 49, 31, 17, 50, 19, 18, 48 . . . are
selected from the rst treatment population. Note that the value 46 occurred twice - it is only used
once.
This can be extended to many treatment groups, by choosing rst those experimental units that belong
to the rst group, then the experimental units that belong to the second group, etc.
5.2.2 Using a computer
A computer can be used to speed the process.
Randomly assign treatments to experimental units
JMP has a Design of Experiments module that is helpful in doing the randomization of treatments to experi-
mental units.
Start by selecting the Full Factorial Design under the DOE menu item:
For a single factor CRD, click on the Categorical button in the Factors part of the dialogue box and
specify the number of levels in the factor. For example for a factor with 3 levels, you would specify:
If you have more than one factor, you would add the new factors in the same way.
For each factor, click and change the name of the factor (default name for the rst factor is X1) and the
names of the levels (default labels for the levels of the rst factors are L1, L2, etc.). Suppose the factor of
interest is the dose of the drug and three levels are Control (0 dose), 1 mg/kg, and 2 mg/kg. The nal dialogue
box would look like:
Press the Continue button when all factors and their levels have been specied. It is not necessary that
all factors have the same number of levels (e.g. one factor could have three levels and a second factor could
have 2 levels).
Finally, specify the total number of experimental units available by changing the Number of replicates
box. JMP labels the second set of experimental units as the rst replicate, etc. For example, if you have
12 experimental units to be assigned to the 3 levels of the factor, there are 4 experimental units to be assigned
to each level which JMP treats as 3 replicates.
Then press the Make Table button to generate an experimental randomization:
Assign a dose of 2 mg/kg to the rst animal, treat the second animal as a control, assign a dose of 2 mg/kg
to the 3
rd
animal etc.
You may wish to change the name of the response to something more meaningful than simply Y . Once
the experiment is done, enter the data in the Y column.
The above randomization assumes that you want equal numbers of experimental units for each level.
This is NOT a requirement for a valid design there are cases where you may wish more experimental units
for certain levels of the experiment (e.g. it may be important to have smaller standard errors for the 2 mg/kg
dose than the 1 mg/kg dose). Such designs can also be developed on the computer ask me for details.
Randomly selecting from populations
As noted earlier, it is sometimes impossible to randomize treatments to experimental units (e.g. it is not
feasible to randomly assign sex to animals). In these Analytical Surveys, the key assumption is that the
experimental units used in the experiment are a random sample from the relevant population.
While the random selection of units from a population is a method commonly used in survey sampling
and there are plenty of methods to ensure this selection is done properly in a survey sample context, the
experimental design context is fraught with many difculties that make much of the randomization moot.
For example, in experiments with animals and sex as a factor, the experimenter has no way to ensure that
the animals available are a random sample from all animals of each sex. Typically, for small rodents such a
mice, the animals are ordered from a supply company, housed in an animal care facility not under the direct
control of the experimenter, and are supplied on an as-needed basis. All than can be done is typically hope
for the best. What is the actual population of interest? All animals of that sex? All animals of that sex born
in a particular year? All animals of that sex born in that year for that particular supply company?
Suppose you have an experiment that will examine difference in growth rates between the two sexes of
animals. You clearly cannot assign sex to individual animals. But you have a supply of 20 animals of each
sex that are numbered from 1 to 20 and you need to select 10 animals of each sex. You would repeat the
following procedure twice
Here is the list of male animals which are housed in separate cages:
Use the Table Subset option:
and specify the number of animals to be selected:
This will generate a random sample of size 10 from the original table.
Repeat the above procedure for the female animals.
5.3 Assumptions - the overlooked aspect of experimental design
Each and every statistical procedure makes a number assumptions about the data that should be veried as the
analysis proceeds. Some of these assumptions can be examined using the data at hand. Other assumptions,
often the most important ones, can only be assessed using the meta-data about the experiment.
The set of assumptions for the single factor CRD are also applicable for the most part to most other
ANOVA situations. In subsequent chapters, these will be revisited and those assumptions that are specic to
a particular design will be highlighted.
The assumptions for single factor CRD are as follows. The reader should refer to the examples in each
chapter for details on how to assess each assumpation in actual practice using your statistical package.
5.3.1 Does the analysis match the design?
THIS IS THE MOST CRUCIAL ASSUMPTION!
In this case, the default assumption of most computer packages is that the data were collected under a
single factor Completely Randomized Design (CRD).
It is not possible to check this assumption by simply looking at the data and you must spend some time
examining exactly how the treatments were randomized to experimental units, and if the observational unit is
the same as the experimental unit (i.e. the meta-data about the experiment). This comes down to the RRRs
of statistics - how were the experimental units randomized, what are the numbers of experimental units, and
are there groupings of experimental units (blocks)?
Typical problems are lack of randomization and pseudo-replication.
Was randomization complete? If you are dealing with analytical survey, then verify that the samples
are true random samples (not merely haphazard samples). If you are dealing with a true experiments, ensure
that there was a complete randomization of treatments to experimental units.
What is the true sample size? Are the experimental units the same as the observational units? In
pseudo-replication (to be covered later), the experimental and observational units are different. An exam-
ple of pseudo-replication are experiments with sh in tanks where the tank is the experimental unit (e.g.
chemicals added to the tank) but the sh are the observational units.
No blocking present? This is similar to the question about complete randomization. The experimental
units should NOT be grouped into more homogeneous units with restricted randomizations within each
group. The distinction between CRD and blocked designs will be come more apparent in later chapters. The
simplest case of a blocked design which is NOT a CRD is a paired design where each experimental object
gets both treatments (e.g. both doses of a drug in random order).
5.3.2 No outliers should be present
As you will see later in the chapter, the idea behind the tests for equality of means is, ironically, to compare
the relative variation among means to the variation with each group. Outliers can severely distort estimates
of the within-group variation and severely distort the results of the statistical test.
Construct side-by-side scatterplots of the individual observations for each group. Check for any outliers
are there observations that appear to be isolated from the majority of observations in the group? Try to
nd a cause for any outliers. If the cause is easily corrected, and not directly related to the treatment effects
(like a data recording error) then alter the value. Otherwise, include a discussion of the outliers and their
potential signicance to the interpretation of the results in your report. One direct way to assess the potential
impact of an outlier is to do the analysis with and without the outlier. If there is no substantive difference in
the results - be happy!
A demonstration of the effect of outliers in a completely randomized design is available in the Sample
5.3.3 Equal treatment group population standard deviations?
Every procedure for comparing means that is a variation of ANOVA, assumes that all treatment groups have
the same population standard deviation.
1
This can be informally assessed by computing the sample standard
deviation for each treatment group to see if they are approximately equal. Because the sample standard
deviation is quite variables over repeated samples from the same population, exact equality of the sample
standard deviation is not expected. In fact, unless the ratio of the sample standard deviations is extreme (e.g.
more than a 5:1 ratio between the smallest and largest value), the assumption of equal population standard
deviations is likely satised.
More formal tests of the equality of population standard deviations can be constructed (e.g. Levenes test
is recommended), but these are not covered in this course.
Often you can anticipate an increase in the amount of chance variation with an increase in the mean.
For example, traps with an ineffective bait will typically catch very few insects. The numbers caught may
typically range from 0 to under 10. By contrast, a highly effective bait will tend to pull in more insects, but
also with a greater range. Both the mean and the standard deviation will tend to be larger.
If you have equal or approximately equal numbers of replicates in each group, and you have not too many
groups, heteroescadicity (unequal population standard deviations) will not cause serious problems with an
Analysis of Variance. However, heteroscedasticity does cause problems for multiple comparisons (covered
later in this section). By pooling the data from all groups to estimate a common , you can introduce
serious bias into the denominator of the t-statistics that compares the means for those groups with larger
standard deviations. In fact, you will underestimate the true standard errors of these means, and could easily
misinterpret a large chance error for a real, systematic difference.
I recommend that you start by constructing the side-by-side dot plots comparing the observations for
each group. Does the scatter seem similar for all groups? Then compute the sample standard deviations of
each group. Is there a wide range in the standard deviations? [I would be concerned if the ratio of the largest
to the smallest standard deviation is 5x or more.] Plot the standard deviation of each treatment group against
the mean of each treatment group. Does there appear to be relationship between the standard deviation and
1
The sample standard deviations are estimates of the population standard deviations, but you dont really care about the sample
standard deviations and testing if the sample standard deviations are equal is nonsensical.
the mean?
2
Sometimes, transformations can be used to alleviate some of the problems. For example, if the response
variable are counts, often a log or sqrt transform makes the standard deviations approximately equal in all
groups.
If all else fails, procedures are available that relax this assumption (e.g. the two-sample t-test using
the Satterthwaite approximation or bootstrap methods).
3
CAUTION: despite their name, non-parametric
methods often make similar assumptions about equal variation in the populations. It is a common fallacy
that non-parametric methods have NO assumptions - they just have different assumptions.
4
Special case for CRD with two groups. It turns out that it is not necessary to make the assumption of
equal group standard deviations in the special case of a single factor CRD design with exactly 2 levels. In
this special case, a variant of the standard t-test can be used which is robust against inequality of standard
deviations. This will be explored in the examples.
5.3.4 Are the errors normally distributed?
The procedures in this and later chapters for test hypotheses about equality of means in their respective
populations assume that observations WITHIN each treatment group are normally distributed. It is a com-
mon misconception that the assumption of normality applies to the pooled set of observations the
assumption of normality applies WITHIN each treatment group. However, because ANOVA estimates
treatment effects using sample averages, the assumption of normality is less important when sample sizes
within each treatment group are reasonably large. Conversely, when sample sizes are very small in each
treatment group, any formal tests for normality will have low power to detect non-normality. Consequently,
this assumption is most crucial in cases when you can do least to detect it!
I recommend that you construct side-by-side dot-plots or boxplots of the individual observation for each
group. Does the distribution about the mean seem skewed? Find the residuals after the model is t and
examine normal probability plots. Sometimes problems can be alleviated by transformations. If the sample
sizes are large, non-normality really isnt a problem.
Again if all else fails, a bootstrap procedure or a non-parametric method (but see the cautions above) can
be used.
2
Taylors Power Law is an empirical rule that relates the standard deviation and the mean. By tting Taylors Power Law to these
plots, the appropriate transform can often be determined. This is beyond the scope of this course.
3
These will not be covered in this course
4
For example, the rank based methods where the data are ranked and the ranks used an analysis still assume that the populations
have equal standard deviations.
5.3.5 Are the errors are independent?
Another key assumption is that experimental units are independent of each other. For example, the response
of one experimental animal does not affect the response of another experimental animal.
This is often violated by not paying attention to the details of the experimental protocol. For example,
technicians get tired over time and give less reliable readings. Or the temperature in the lab increases
during the day because of sunlight pouring in from a nearby window and this affects the response of the
experimental units. Or multiple animals are housed in the same pen, and the dominant animals affect the
responses of the sub-dominant animals.
If the chance errors (residual variations) are not independent, then the reported standard errors of the
estimated treatment effects will be incorrect and the results of the analysis will be INVALID! In particular, if
different observations from the same group are positively correlated (as would be the case if the replicates
were all collected from a single location, and you wanted to extend your inference to other locations), then
you could seriously underestimate the standard error of your estimates, and generate articially signicant
p-values. This sin is an example of a type of spatial-pseudo-replication (Hurlbert, 1984).
I recommend that you plot the residuals in the order the experiment was performed. The residual plot
should show a random scatter about 0. A non-random looking pattern in the residual plot should be investi-
gated. If your experiment has large non-independence among the experimental units, seek help.
5.4 Two-sample t-test- Introduction
This is the famous Two-sample t-test rst developed in the early 1900s. It is likely the most widely used
methodology in research studies, followed by (perhaps) the slightly more general single factor CRD ANOVA
(Analysis of Variance) methodology.
The basic approach in hypothesis testing is to: formulate a hypothesis in terms of population parame-
ters, collect some data, and then see if the data are unusual if the hypothesis were true. If this is the case,
then there is evidence against the null hypothesis. However, as seen earlier in this course, the issue of the
role of hypothesis testing in research studies was discussed. Modern approaches to this problem play down
the role of hypothesis testing in favor of estimation (condence intervals).
Many books spend an inordinate amount of time worrying about one- or two-sided hypothesis tests. My
personal view on this matter is that one of three outcomes usually occurs:
1. The results are so different in the two groups that using a one- or two- sided test makes no difference.
2. The results are so similar in the two groups, that neither test will detect a statistically signicant
difference.
3. The results are borderline and Id worry about other problems in the experiment such as violations of
assumptions.
Consequently, I dont worry too much about one- or two-sided hypotheses - Id much rather see a condence
interval, more attention spend on verifying the assumptions of the design, and good graphical methods
displaying the results of the experiment. Consequently, we will always conduct two-sided tests, i.e. our
alternate hypothesis will always be looking for differences in the two means in either direction.
5.5 Example - comparing mean heights of children - two-sample t-test
It is well known that adult males and females are, on average, different heights. But is this true at 12 years
of age?
A sample of 63 children (12 years old) was measured in a school and the height and weight recorded.
The data is available in the htwt12.csv le in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in the usual
way:
data height;
length gender $4;
infile htwt12.csv dlm=, dsd missover firstobs=2;
input gender $ height weight;
run;
Part of the raw data are shown below:
Obs gender height weight
1 f 62.3 105.0
2 f 63.3 108.0
3 f 58.3 93.0
4 f 58.8 89.0
5 f 59.5 78.5
6 f 61.3 115.0
7 f 56.3 83.5
8 f 64.3 110.5
9 f 61.3 94.0
10 f 59.8 84.5
The rst question to resolved before any analysis is attempted is to verify that this indeed is a single-
factor CRD. The single factor is sex with two levels (males and females). The treatments are the sexes.
This is clearly NOT a true experiment as sex cannot be randomized to children - it is an analytical
survey. What is the population of interest and are children (the experimental units) randomly selected from
each population? The population is presumably all 12 year-olds in this part of the city. Do children that
come to this school represent a random selection from all students in the area?
The experimental unit and the observational unit are equal (there is only one measurement per child).
5
There doesnt seem to be any grouping of children into more homogeneous sets (blocks as discussed in
later chapters). that could account for some of the height differences (again ignoring twins).
Hence, it appears that this is indeed a single-factor CRD design.
Let:

m
and
f
represent the population mean height of males and females, respectively.
n
m
and n
f
represent the sample sizes for males and females respectively.
Y
m
and Y
f
represent the sample mean height of males and females, respectively.
s
m
and s
f
represent the sample standard deviation of heights for males and females respectively.
The two-sample test proceeds as follows.
1. Specify the hypotheses.
We are not really interested in the particular value of the mean heights for the males or the females.
What we are really interested in is comparing the mean heights of the two populations, or, equivalently,
in examining the difference in the mean heights between the two populations. At the moment, we
dont really know if males or females are taller and we are interested in detecting differences in either
direction. Consequently, the hypotheses of interest are:
H:
f
=
m
or
f

m
= 0
A:
f
=
m
or
f

m
= 0
Note again that the hypotheses are in terms of population parameters and we are interested in test-
ing if the difference is 0. [A difference of 0 would imply no difference in the mean heights.] We
are interested in both if males are higher (on average) or lower (on average) in height compared to
females.
6
2. Collect data. Because the samples are independent and no person is measured twice, each person has
separate row in the table.
5
We shall ignore the small number of twin children. In these cases what is the experimental unit? The family or the child?
6
This is technically called a two-sided test.
The data does not need to be sorted in any particular order, but must be entered in this stacked format
with one column representing the factor and one column representing the data.
proc sgplot data=height;
title2 Plot of height vs. gender;
scatter x=gender y=height;
run;
which gives
And then:
proc sgplot data=height;
title2 Box plots;
vbox height / group=gender notches;
/
*
*
/
run;
which gives
proc tabulate data=height;
class gender;
var height;
table gender, height
*
(n
*
f=5.0 mean
*
f=5.1 std
*
f=5.1 stderr
*
f=7.2 lclm
*
f=7.1 uclm
*
f=7.1) / rts=20;
run;
which gives:
height
gender 30 59.5 3.0 0.55 58.4 60.6
f
height
m 33 58.9 3.1 0.54 57.8 60.0
The side-by-side dot plot show roughly comparable scatter for each group and the sample standard
deviations of the two groups are roughly equal. The assumption of equal standard deviations in each
treatment group appears to be tenable. [Note that in the two-sample CRD experiment, the assumption
of equal standard deviations is not required if the unequal variance t-test is used.] As noted earlier, in
the special case of a single factor CRD with two levels, this assumption can be relaxed.
There are no obvious outliers in either group.
The basic summary statistics show that there doesnt appear to be much of a difference between means
of the two groups. You could compute a condence interval for the mean of each group using each
groups own data (the sample mean and estimated standard error) using methods described earlier.
3. Compute a test-statistic, a p-value and make a decision.
ods graphics on;
proc ttest data=height plot=all dist=normal;
title2 test of equality of heights between males and females;
class gender;
var height;
run;
ods graphics off;
t
Value DF
Pr
>
|t|
height Pooled Equal 0.82 61 0.4171
height Satterthwaite Unequal 0.82 60.714 0.4165
Variable gender Method Variances Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
height Diff (1-2) Pooled Equal 0.6252 -0.9049 2.1552
Variable gender Method Variances Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
height Diff (1-2) Satterthwaite Unequal 0.6252 -0.9029 2.1532
Variable gender N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
height f 30 59.5100 0.5455 58.3943 60.6257
height m 33 58.8848 0.5351 57.7950 59.9747
height Diff (1-2) _ 0.6252 0.7652 -0.9049 2.1552
and a nal plot of:
This table has LOTS OF GOOD STUFF!.
The Unequal Variance t-test (also known as the Welch test, or the Satterthwaite test) does NOT
assume equal standard deviations in both groups. Most statisticians recommend ALWAYS using
this procedure rather than the traditional two-sample equal variance (also known as the pooled-
variance) t-test, even if the two sample standard deviations are similar which would indicate that
the latter procedure would be valid.
The estimated difference in the population mean heights (male average - female average) is es-
timated by Y
m
Y
f
= .625 inches with a standard error of the estimated difference in th
means of 0.764 inches. [You dont have to worry about the formula for the estimated stan-
dard error, but you can get this by dividing the t-statistic by the difference in means.] A 95%
condence interval for the difference in the population mean heights is also shown and ranges
from (2.15 .91) inches. [An approximate 95% condence interval can be computed as
estimate 2se of estimate.] Because the 95% c.i. for the difference in the population mean
heights includes zero, there is no evidence that the population means
7
are unequal. Depending
on the package used, the difference in the means may be computed in the other direction with
the corresponding condence interval reversed appropriately.
At this point, the condence interval really provides all the information you need to make a decision
about your hypothesis. The condence interval for the difference in mean heights includes zero, so
there is no evidence of a difference in the population mean heights.
Notice that the condence intervals (see the plot) cover the value of 0 indicating no evidence of a
difference.
We continue with the formal hypothesis testing:
The test-statistic is T = .818. This is a measure of how unusual the data is compared to the the
(null) hypothesis of no difference in the population means, expressed as a fractions of standard
errors. In this case, the observed difference in the means is about 0.8 standard errors away from
the value of 0 (representing no difference in the means). [It is not necessary to know how to
compute this value.] Test statistics are hold-overs from the BC (before computer era) when
the test statistic was then compared to a statistical table to see if it was statistically signicant
or not. In modern days, the test-statistic really doesnt serve any useful purpose and can usually
be ignored. Similarly the line labeled as the DF (degrees of freedom) is not really needed as
well when computers are doing the heavy lifting.
The p-value is 0.416. The p-value is a measure of how consistent the data is with the null
hypothesis. It DOES NOT MEASURE the probability that the hypothesis is true!
8
The p-value
is attached to the data, not to the hypothesis.
Because the p-value is large (a rough rule of thumb is to compare the p-value to the value of
0.05), we conclude that there is no evidence against the hypothesis that the average height of
males and female children is equal.
Of course, we havent proven that both genders have the same mean height. All we have shown is
that based on our sample of size 63, there is not enough evidence to conclude that the mean heights
in the population are different. Maybe our experiment was too small? Most good statistical packages
have extensive facilities to help you plan future experiments. This will be discussed in the section on
Statistical Power later in this chapter.
SAS also provided information for the equal-variance two-sample t-test in the above output.
Because the two sample standard deviations are so similar, the results are virtually identical between the two
variants of the t-test.
7
Why do we say the population means? Why is the sentence in terms of sample means?
8
Hypotheses must be true or false, they cannot have a probability of being true. For example, suppose you ask a child if he/she took
a cookie. It makes no sense to say that there is a 47% chance the child took the cookie either the child took the cookie or the child
didnt take the cookie.
Modern Statistical Practice recommends that you ALWAYS use the unequal variance t-test (the
rst test) as it always works properly regardless of the standard deviations being approximately equal or not.
The latter equal-variance t-test is of historical interest, but is a special case of the more general Analysis
of Variance methods which will be discussed later in this chapter.
The formula to compute the test statistic and df are available in many textbooks and on the web e.g.
http://en.wikipedia.org/wiki/Students_t-test and not repeated here and they provide
little insight into the logic of the process..
Similarly, many text books show how to look up the test statistic in a table to nd the p-value but this is
pointless now that most computers can compute the p-value directly.
5.6 Example - Fat content and mean tumor weights - two-sample t-
test
Recent epidemiological studies have shown that people who consume high fat diets have higher cancer rates
and more severe cancers than low fat diets.
Rats were randomized to one of two diets, one low in fat and the other high in fat. [Why and how was
randomization done?] At the end of the study, the rats were sacriced, the tumors excised, and the weight of
the tumors found.
Low fat High fat
12.2 12.3
9.7 10.2
9.2 11.8
8.2 11.7
11.2 11.1
9.5 14.6
8.4 11.9
9.3 9.8
11.1 11.3
10.8 10.3
The data is available in the fattumor.csv le in the Sample Program Library at http://www.stat.
way:
data weight;
length diet $10.;
infile fattumor.csv dlm=, dsd missover firstobs=2;
input diet $ weight;
run;
Obs diet weight
1 High Fat 12.3
2 High Fat 10.2
3 High Fat 11.8
4 High Fat 11.7
5 High Fat 11.1
6 High Fat 14.6
7 High Fat 11.9
8 High Fat 9.8
9 High Fat 11.3
10 High Fat 10.3
First verify that a single-factor CRD analysis is appropriate. What is the factor? How many levels? What
are the treatments? How were treatments assigned to experimental units? Is the experimental unit the same
as the observational unit?
Let

L
and
H
represent the true mean weight of tumors in all rats under the two diets
Y
L
and Y
H
, etc. represent the sample statistics
1. Formulate the hypotheses.
H:
H
=
L
or
H

L
= 0
A:
H
=
L
or
H

L
= 0
Note that we have formulated the alternate hypothesis as a two-sided hypothesis (we are interested in
detecting differences in either direction). It is possible to formulate the hypothesis so that changes in
a single direction only (i.e. does higher fat lead to larger (on average) differences in tumor weight),
but this is not done in this course.
2. Collect data and look at summary statistics.
The data should be entered into most packages as a case-by-variable structure, i.e. each row should
contain data for a SINGLE experimental unit and each column represents different variables. Most
packages will require one column to be the experimental factor and a second column to be the response
variable.
The order of the data rows is not important, nor do the data for one group have to entered before the
data for the second group.
It is important that the factor variable have the factor or character attribute. The response variable
should be a continuous scaled variable.
proc sgplot data=weight;
title2 Plot of weight vs. diet;
scatter x=diet y=weight;
run;
which gives
And then:
proc sgplot data=weight;
title2 Box plots;
vbox weight / group=diet notches;
/
*
*
/
run;
which gives
proc tabulate data=weight;
class diet;
var weight;
table diet, weight
*
(n
*
f=5.0 mean
*
f=5.1 std
*
f=5.1 stderr
*
f=7.2 lclm
*
f=7.1 uclm
*
f=7.1) / rts=20;
run;
which gives:
weight
diet 10 11.5 1.4 0.43 10.5 12.5
High Fat
Low Fat 10 10.0 1.3 0.41 9.0 10.9
From the dot plot, we see that there are no obvious outliers in either group.
We notice that the sample standard deviations are about equal in both groups so the assumption of
equal population standard deviations is likely tenable. There is some, but not a whole lot of overlap
between the condence intervals for the individual group means which would indicate some evidence
that the population means may differ.
3. Find the test-statistic,the p-value and make a decision. Proc Ttest is used to perform the test of the
hypothesis that the two means are the same:
ods graphics on;
proc ttest data=weight plot=all dist=normal;
title2 test of equality of weights between the two diets;
class diet;
var weight;
run;
ods graphics off;
t
Value DF
Pr
>
|t|
weight Pooled Equal 2.58 18 0.0190
weight Satterthwaite Unequal 2.58 17.967 0.0190
Variable diet Method Variances Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
weight Diff (1-2) Pooled Equal 1.5400 0.2844 2.7956
Variable diet Method Variances Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
weight Diff (1-2) Satterthwaite Unequal 1.5400 0.2843 2.7957
Variable diet N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
weight High Fat 10 11.5000 0.4315 10.5238 12.4762
weight Low Fat 10 9.9600 0.4134 9.0247 10.8953
weight Diff (1-2) _ 1.5400 0.5976 0.2844 2.7956
and a nal plot of:
As noted in the previous example, there are two variants of the t-test, the equal and unequal variance t-
test. Modern statistical practice is to the unequal-variance t-test (the one selected above) as it performs
well under all circumstances without worrying if the standard deviations are equal among groups.
The estimated difference (low high) in the true (unknown) mean weights is 1.54 g (se .60 g),
with a 95% condence interval that doesnt cover 0. [Depending on your package, the signs may be
reversed in the estimates and the condence intervals will also be reversed.] There is evidence then
that the low fat diet has a lower mean tumor weight than the high fat diet.
The condence interval provides sufcient information to answer the research question, but a formal
hypothesis test can also be conducted. Notice that the condence intervals (see the plot) cover the
value of 0 indicating no evidence of a difference.
The formal test-statistic has the value of 2.577. In order to compute the p-value, the test-statistic will
be compared to a t-distribution with 18 df . [It is not necessary to know how to compute this statistic,
nor the df .]
The two-sided p-value is 0.0190.If the alternate hypothesis was one-sided, i.e. if we were interested
only if the high fat increased tumor weights (on average) over lowfat diets, the the sided=L or sided=U
option on the Proc Ttest statement could be used.
Because the p-value is small, we conclude that there is evidence against the hypothesis that the mean
weight of the tumors from the two diets are equal. Furthermore, there is evidence that the high fat diet
gives tumors with a larger average weight than the low fat diet.
We have not proved that the high fat diet gives heavier (on average) tumors than the low fat diet.
All that we have shown is that if there was no difference in the mean, then the observed data is very
unusual.
Note that while it is possible to conduct a one-sided test of the hypothesis, these are rarely useful. The
paper:
Hurlbert, S. H. and Lombardi, C. M. (2012).
Lopsided reasoning on lopsided tests and multiple comparisions.
Australian and New Zealand Journal of Statistics, **, ***-****
http://dx.doi.org/10.1111/j.1467-842X.2012.00652.x
discuss the problems with one-sided tests and recommends that they be rarely used. The basic problem
is what do you do if you happen to nd a result that is in the opposite direction from the alternative
hypothesis? Do you simply ignore this interesting nding? About the only time that a one-tailed
test is justied are situations where you are testing for compliance against a known standard. For
example, in quality control, you want to know if the defect rate is more than an acceptable value.
Another example, would be water quality testing where you want to ensure that the level of a chemical
is below an acceptable maximum value. In all other cases, two-sided tests should be used. For the rest
of this chapter (and the entire set of notes that is available on-line) we only use two-sided tests. Note
that the whole question of one- or two-sided tests is irrelevant once you have more than two treatment
groups as will be noted later.
In this experiment, the sample standard deviations are approximately equal, so the equal variance t-test
give virtually the same results and either could be used. Because the unequal variance t-test can be used in
both circumstances, it is the recommended test to perform for a two-sample CRD experiment.
5.7 Example - Growth hormone and mean nal weight of cattle - two-
sample t-test
Does feeding growth hormone increase the nal weight of cattle prior to market?
Cattle were randomized to one of two groups - either a placebo or the group that received injections of
the hormone. In this experiment, the sample sizes were not equal (there was a reason that is not important
for this example).
Hormone Placebo (Control)
1784 2055
1757 2028
1737 1691
1926 1880
2054 1763
1891 1613
1794 1796
1745 1562
1831 1869
1802 1796
1876
1970
The data is available in the fattumor.csv le in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS and then
stacked so that one column is the treatment variable and one column is the response variable:
data hormone;
infile hormone.csv dlm=, dsd missover firstobs=2;
input hormone control;
/
*
now to convert to the usual structure for analysis - stack the variables
*
/
trt = hormone; weight = hormone; output;
trt = control; weight = control; output;
keep trt weight;
run;
Obs trt weight
1 hormone 1784
2 control 2055
3 hormone 1757
Obs trt weight
4 control 2028
5 hormone 1737
6 control 1691
7 hormone 1926
8 control 1880
9 hormone 2054
10 control 1763
Does this experiment satisfy the criteria for a single factor - CRD? What is the factor? What are the
levels? What are the treatments? How were treatments assigned to the experimental units? Are the experi-
mental units the same as the observational units? Where there some grouping of experimental units that we
should be aware of (e.g. pairs of animals kept in pens?).
Let
1.
H
and
C
represent the true mean weight of cattle receiving hormone or placebo (control) injections.
2. Y
H
and Y
C
, etc. represent the sample statistics.
1. Formulate the hypotheses:
Our hypotheses are:
H:
C
=
H
or
C

H
= 0
A:
C
=
H
or
C

H
= 0
As in a previous example, we have formulated the alternate hypothesis in terms of a two sided alter-
native it is possible (but not part of this course) to express the alternate hypothesis as a one-sided
alternative, i.e. interest may lie only in cases where the weight has increased (on average) after injec-
tion of the hormone.
2. Collect data and look at summary statistics.
Notice that the data format is different from the previous examples. The raw data le has two columns,
one corresponding to the Hormone and the second corresponding to the Placebo group.
We rst notice that there are two missing values for the Placebo group. SAS uses a period (.) to
represent missing values. Whenever data are missing, it is important to consider why the data missing.
If the data are Missing Completely at Random, then the missingness is completely unrelated to the
treatment or the response and there is usually no problem in ignoring the missing values. All that
happens is that the precision of estimates is reduced and the power of your experiment is also reduced.
If data are Missing at Random, then the missingness may be related to the treatment, but not the
response, i.e. for some reasons, only animals in one group are missing, but within the group, the
missingness occurs at random. This is usually again not a problem
If data are not missing at random, may have a serious problem on your hands.
9
In such cases, seek
experienced help it is a difcult problem.
proc sgplot data=hormone;
title2 Plot of weight vs. trt;
scatter x=trt y=weight;
run;
which gives
And then:
proc sgplot data=hormone;
title2 Box plots;
9
An interesting case of data not missing at random occurs if you look at the length of hospital stays after car accidents for people
wearing or not wearing seat belts. It is quite evident that people who wear seat belts, spend more time, on average, in hospitals, than
people who do not wear seat belts.
vbox weight / group=trt notches;
/
*
*
/
run;
which gives
proc tabulate data=hormone;
class trt;
var weight;
table trt, weight
*
(n
*
f=5.0 mean
*
f=5.1 std
*
f=5.1 stderr
*
f=7.2 lclm
*
f=7.1 uclm
*
f=7.1) / rts=20;
run;
which gives:
weight
trt 10 1805 160.8 50.86 1690.3 1920.3
control
hormone 12 1847 98.5 28.43 1784.7 1909.8
The sample standard deviations appear to be quite different. This does NOT cause a problem in the
two-sample CRD case as the unequal-variance t-test performs well in these circumstances. Formal
statistical test for the equality of population standard deviations could be performed, but my recom-
mendation is that unless the ratio of the sample standard deviations is more than 5:1, the equal-variance
t-test also performs reasonably well. Formal tests for equality of standard deviations have very poor
performance characteristics, i.e. poor power and poor robustness against failure of the underlying
assumptions.
The condence intervals for the respective population means appear to have considerable overlap so
it would be surprising if a statistically signicant difference were to be detected.
3. Find the test-statistic, the p-value and make a decision.
The sample standard deviations appear to be quite different. This does NOT cause a problem in the
two-sample CRD case as the unequal-variance t-test performs well in these circumstances.
ods graphics on;
proc ttest data=hormone plot=all dist=normal;
title2 test of equality of weights between the two trts;
class trt;
var weight;
run;
ods graphics off;
t
Value DF
Pr
>
|t|
weight Pooled Equal -0.75 20 0.4608
weight Satterthwaite Unequal -0.72 14.355 0.4831
Variable trt Method Variances Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
weight Diff (1-2) Pooled Equal -41.9500 -158.3 74.4078
weight Diff (1-2) Satterthwaite Unequal -41.9500 -166.6 82.7209
Variable trt N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
weight control 10 1805.3 50.8575 1690.3 1920.3
weight hormone 12 1847.3 28.4256 1784.7 1909.8
weight Diff (1-2) _ -41.9500 55.7813 -158.3 74.4078
and a nal plot of:
The estimated difference in mean weight between the two groups is 41.95 (se 58) lbs. Here the
estimate of the difference in the means is negative indicating that the hormone group had a larger
sample mean than the control group. But the condence interval contains zero so there is no evidence
that means are unequal. The condence interval provides all the information we need to make a
decision, but a formal hypothesis test can still be computed. Notice that the condence intervals (see
the plot) cover the value of 0 indicating no evidence of a difference.
The test-statistic is .72 to be compared to a t-distribution with 14.4 df , but in this age of computers,
these values dont have much use. The two-sided p-value is 0.48. The one-sided p-value could be
computed using the sided=L or sided=U option on the Proc Ttest statement, but are not of interest
in this experiment. Because the p-value is large, there no evidence (based on our small experiment)
against our hypothesis of no difference.
In this experiment, the standard deviations are not a similar as in previous examples. In this case, the
equal-variance t-test gives slightly different answers, but the overall conclusion is identical. Either test could
be used, but modern practice is to always use the unequal variance t-test shown earlier:
5.8 Power and sample size
Would you do an experiment that had less than a 50% chance of succeeding? Yet many researchers em-
bark upon an experimental plan using inadequate sample sizes to detect important, biologically meaningful
results.
The statistical analysis of any experimental data usually involves a test of some (null) hypothesis that is
central to the investigation. For example, is there a difference in the mean nal weight of cattle between a
control group that receives a placebo and a treatment group that is injected with growth hormone.
Because experimental data are invariably subject to random error, there is always some uncertainty
about any decision about the null hypothesis on the basis of a statistical test. There are two reasons for this
uncertainty. First, there is the possibility that the data might, by chance, be so unusual that the we believe
we have evidence against the null hypothesis even though it is true. For example, there may be no effect of
the growth hormone, but the evidence against the hypothesis occurs by chance. This is a Type I error and
is controlled by the level of the test. In other words, if a statistical test is performed and the hypothesis
will only be doubted if the observed p-value is less than 0.05 (the level), then the researcher is willing to
accept a 5% chance that this statistically signicant result is an error (a false positive result).
The other source of uncertainty is often not even considered. It is the possibility that, given the available
data, we may fail to nd evidence against the null hypothesis (a false negative result). For example, the
growth hormone may give a real increase in the mean weight, but the variability of the data is so large that
we fail to detect this change. This is a Type II error and is controlled by the sample size.
Related to the Type II error rate is the power of a statistical test. The power of a statistical test is the
probability that, when the null hypothesis is false, the test will nd sufcient evidence against the null
hypothesis. A powerful test is one that has a high success rate in detecting even small departures from the
null hypothesis. In general, the power of a test depends on the adopted level of signicance, the inherent
variability of the data, the degree to which the true state of nature departs from the null hypothesis, and the
sample size. Computation of this probability for one or more combinations of these factors is referred to as
a power analysis.
Considerations of power are important at two stages of an experiment.
First, at the design stage, it seems silly to waste time and effort on an experiment that doesnt have a
fairly good chance of detecting a difference that is important to detect. Hence, a power analysis is performed
to give the researcher some indication of the likely sample sizes needed to be relatively certain of detecting
a difference that is important to the research hypothesis.
Second, after the analysis is nished, it often occurs that you failed to nd sufcient evidence against the
null hypothesis. Although a retrospective power analysis is fraught with numerous conceptual difculties,
it is often helpful to try and gure out why things werent detected. For example, if a retrospective power
analysis showed that the experiment has reasonably large power to detect small differences, and you failed
to detect a difference, then one has some evidence that the actual effect must be fairly small. However, this
is no substitute for a consideration of power before the experiment is started.
5.8.1 Basic ideas of power analysis
The power of a test is dened as the Probability that you will nd sufcient evidence against the the null
hypothesis when the null hypothesis is false and an effect exists. The power of a test will depend upon the
following:
level. This is the largest value for the p-value of the test at which you will decide that the evidence
is sufciently strong to have doubts in the null hypothesis. Usually, most experiments use = 0.05,
but this is not an absolute standard. The smaller the alpha level, the more difcult it is to declare that
the evidence is sufciently strong against the null hypothesis, and hence the lower the power.
Effect size. The effect size is the actual size of the difference that is to be detected. This will depend
upon economic and biological criteria. For example, in the growth hormone example, there is an extra
cost associated with administering the hormone, and hence there is a minimum increase in the mean
weight that will be economically important to detect. It is easier to detect a larger difference and hence
power increases with the size of the difference to be detected.
THIS IS THE HARDEST DECISION IN CONDUCTING A POWER ANALYSIS. There is no
easy way to decide what effect size is biologically important and it needs to be based on the conse-
quences of failing to detect an effect, the variability in the data etc. Many studies use a rough rule of
thumb that a one standard deviation chance in the mean is a biologically important difference, but this
has no scientic basis.
Natural variation (noise). All data has variation. If there is a large amount of natural variation in
the response, then it will be more difcult to detect a shift in the mean and power will decline as
variability increases. When planning a study, some estimate of the natural variation may be obtained
from pilot studies, literature searches, etc. In retrospective power analysis, this is available from the
statistical analysis in the Root Mean Square Error term of the output. The MSE term is the estimate
of the VARIANCE within groups in the experiment and the estimated standard deviation is simply the
square root of the estimated variance.
Sample size. It is easier to detect differences with larger sample sizes and hence power increases with
sample size.
5.8.2 Prospective Sample Size determination
Before a study is started, interest is almost always on the necessary sample size required to be reasonably
certain of detecting an effect of biological or economic importance.
There are ve key elements required to determine the appropriate sample size:
Experimental design. The estimation of power/sample size depends upon the experimental design
in the sense that the computations for a single factor completely randomized design are different than
for a two-factor split-plot design. Fortunately, a good initial approximation to the proper sample
size/power can often be found by using the computations designed for a single-factor completely
randomized design.
level. The accepted standard is to use = .05 but this can be changed in certain circumstances.
Changing the level would require special justication.
biologically important difference. This is hard! Any experiment has some goal to detect some
meaningful change over the current state of ignorance. The size of the meaningful change is often
hard to quantify but is necessary in order to determine the sample size required. It not sufcient to
simply state that any difference is important. For example, is a .000002% difference in the means a
scientically meaningful result? The biologically important difference can be expressed either as an
absolute number (e.g. a difference of 0.2 cm in the means), or as a relative percentage (e.g. a 5%
change in the mean). In the latter case, some indication of the absolute mean is required in order to
convert the relative change to an absolute change (e.g. a 5% change in the mean when the mean is
around 50 cm, implies an absolute change of 5%50 = 2.5 cm.
variation in individual results. If two animals were exposed to exactly the same experimental treat-
ment, how variable would their individual results be? Some measure of the the standard deviation of
results when repeated on replicate experimental unit (e.g. individual animals) is required. This can be
obtained from past studies or from expert opinion on the likely variation to be expected. Note that the
standard ERROR is NOT the correct measure of variability from previous experiments as this does
NOT measure individual variation.
desired power. While a 50% chance of success seems low, is 70% sufcient, is 90% sufcient? This
is a bit arbitrary, but a general consensus is that power should be at least 80% before attempting an
experiment even then, it implies that the research is willing to accept a 1/5 chance of not detecting a
biologically important difference! The higher the power desired, the greater the sample size required.
Two common choices for the desired power are an 80% power when testing at = .05 or a 90% power
when testing at = .10. These are customary values and have been chosen so that a standardized
assessment of power can proceed.
The biologically important difference and the standard deviation of individual animal results will require
some documentation when preparing a research proposal.
There are a number of ways of determining the necessary sample sizes
Computational formula such as presented in Zar ( Biostatistical Analysis, Prentice Hall).
Tables that can be found in some books or on the web at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/PDF/Tables.pdf and are attached at then end of this document.
Two sets of tables are attached to this document. The rst table is appropriate to a single factor
completely randomized design with only two levels (such data would be analyzed using a two-sample
t-test). The second table is appropriate for a single factor completely randomized design with two or
more levels (such data are often analyzed using a one way ANOVA).
Computer programs such as in JMP, R, SAS or those available on the web. For example, the Java
applets by Russ Lenth at http://www.cs.uiowa.edu/~rlenth/Power/ provide nice inter-
active power computations for a wide variety of experimental designs. Lenth also has good advice on
power computations in general on his web page.
Unfortunately, there is no standard way of expressing the quantities needed to determine sample size,
and so care must be taken whenever a new table, program, or formula is used to be sure that it is being used
correctly. All should give you the same results.
How to increase power This is a just a brief note to remind you that power can be increased not only by
increasing the sample size, but also by decreasing the unexplained variation (the value of ) in the data. This
can often be done by a redesign of an experiment, e.g. by blocking, or by more careful experimentation.
5.8.3 Example of power analysis/sample size determination
When planning a single-factor CRD experiment with two levels you will need to decide upon the -level
(usually 0.05), the approximate sizes of the difference of the means to be detected (
1

2
), (either from
expert opinion or past studies), and some guess as the standard deviation () of units in the population (from
expert opinion or past studies). A very rough guess for a standard deviation can be formed by thinking of
the range of values to be seen in the population and dividing by 4. This rule-of-thumb occurs because many
populations have an approximate normal distribution of the variable of interest, and in a normal distribution,
abut 95% of observations are within 2 standard deviations of the mean. Consequently, the approximate
range of observations is about 4 standard deviations.
Suppose that study is to be conducted to investigate the effects of injecting growth hormone into cattle. A
set of cattle will be randomized either to the control group or to the treatment group. At the end, the increase
in weight will be recorded. Because of the additional costs of the growth hormone, the experimental results
are only meaningful if the increase is at least 50 units. The standard deviation of the individual cattle changes
in weight is around 100 units (i.e. two identical cattle given the same treatment could have differences in
weight gain that are quite variable).
Using tables
The rst table is indexed on the left margin by the ratio of the biological effect to the standard deviation, i.e.
=
|
1
2
|
=
50
100
= .5
Reading across the table at = 0.5 in the middle set of columns corresponding to = .05 and the specic
column labeled 80% power, the required sample size is 64 in EACH treatment group.
Note the effect of decreasing values of , i.e. as the biologically important difference becomes smaller,
larger sample sizes are required. The table can be extended to cases where the required sample size is
greater than 100, but these are often impractical to run expert help should be sought to perhaps redesign
the experiment.
Using a package to determine power
The standard deviation chosen is between the two individual standard deviations that we saw in the previous
example; the difference to detect was specied as 50 lbs. The choice of alpha level (0.05) and the target
power (0.80 = 80%) are traditional choices made to balance the chances of a type I error (the alpha level)
and the ability to detect biologically important differences (the power). Another popular choice is to use
= .10 and aim for a target power of .90. These choices are used to reduce the amount of arguing among
the various participants in a study.
The sample size required then to detect a 50 lbs difference in the mean if the standard deviation is 100 is
found as follows.
SAS has several methods to determine power. Proc Power computes power for relatively simple designs
with a single random error (such as the ones in this chapter). SAS also has a stand alone program for power
analysis.
Refer to the http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms for
links for many example of power analysis in SAS.
For the Hormone example in the previous section, we use Proc Power:
proc power;
title Power analysis for hormone example;
twosamplemeans
test=diff /
*
indicates that you wish to test for differences in the mean
*
/
meandiff=50 /
*
size of difference to be detected
*
/
stddev=100 /
*
the standard deviation within each group
*
/
power=.80 /
*
target power of 80%
*
/
alpha=.05 /
*
alpha level for the test
*
/
sides=2 /
*
a two sided test for difference in the mean should be done
*
/
ntotal=. /
*
solve for the total sample size assuming equal sample sizes in both groups
*
/
; /
*
end of the twosamplemeans statement - DONT FORGET THIS
*
/
ods output Output=Power10;
run;
Alpha
Mean
Diff
Std
Dev Sides
Null
Diff
Nominal
Power
Actual
Power
N
Total
0.05 50 100 2 0 0.8 0.801 128
So almost 130 animals (i.e. 65 in each group) would be needed! Depending on the power program used,
the results may give the sample size for EACH group, or the TOTAL sample over both groups. So a reported
sample size of 128 in TOTAL or 64 PER GROUP are equivalent.
Most power packages assumes that you want equal sizes in both groups. You can show mathematically
that maximizes the power to detect effects. There are many resources available on the web and for purchase
that allow you the exibility of having unequal sample sizes. For example, power and sample size pages
available from Russ Length at: http://www.stat.uiowa.edu/~rlenth/Power/index.html
are very exible in specifying the sample sizes in each group.
It is often of interest to plot the power as a function of the sample size or effect size, or in general plot
how two of the four variables in a power analysis tradeoff.
We can generate tables and plots to show how power varies over different sample sizes. For example,
here I use Proc Power to investigate power for a range of differences in the population means
ods graphics on;
proc power;
title Power analysis for hormone example with various sized differences;
/
*
We vary the size of the difference to see what sample size is needed
*
/
twosamplemeans
test=diff /
*
*
/
meandiff=30 to 150 by 10 /
*
*
/
stddev=100 /
*
*
/
power=.80 /
*
target power of 80%
*
/
alpha=.05 /
*
*
/
sides=2 /
*
*
/
ntotal=. /
*
*
/
; /
*
*
/
plot x=effect xopts=(ref=50 crossref=yes); /
*
plot the sample size vs effect size and draw ref lines at effect=50
*
/
ods output output=power20;
run;
ods graphics off;
Obs Alpha MeanDiff StdDev Sides NullDiff NominalPower Power NTotal
1 0.05 30 100 2 0 0.8 0.801 352
2 0.05 40 100 2 0 0.8 0.804 200
3 0.05 50 100 2 0 0.8 0.801 128
4 0.05 60 100 2 0 0.8 0.804 90
5 0.05 70 100 2 0 0.8 0.812 68
6 0.05 80 100 2 0 0.8 0.807 52
7 0.05 90 100 2 0 0.8 0.812 42
8 0.05 100 100 2 0 0.8 0.807 34
9 0.05 110 100 2 0 0.8 0.828 30
10 0.05 120 100 2 0 0.8 0.802 24
11 0.05 130 100 2 0 0.8 0.826 22
12 0.05 140 100 2 0 0.8 0.841 20
13 0.05 150 100 2 0 0.8 0.848 18
which can be plotted:
Now we use Proc Power to investigate power for a range of different sample sizes:
ods graphics on;
proc power;
title Power analysis for hormone example with various sample sizes;
/
*
We vary the total sample size to see what power is obttained
*
/
twosamplemeans
test=diff /
*
*
/
meandiff=50 /
*
*
/
stddev=100 /
*
*
/
power=. /
*
solve for power
*
/
alpha=.05 /
*
*
/
sides=2 /
*
*
/
ntotal=50 to 200 by 10 /
*
total sample size assuming equal sample sizes in both groups
*
/
; /
*
*
/
plot x=n yopts=(ref=.80 crossref=yes); /
*
plot the power as a function of sample size and draw ref line at 80% power
*
/
run;
ods graphics off;
Obs Alpha MeanDiff StdDev Sides NTotal NullDiff Power
1 0.05 50 100 2 50 0 0.410
2 0.05 50 100 2 60 0 0.478
3 0.05 50 100 2 70 0 0.541
4 0.05 50 100 2 80 0 0.598
5 0.05 50 100 2 90 0 0.650
6 0.05 50 100 2 100 0 0.697
7 0.05 50 100 2 110 0 0.738
8 0.05 50 100 2 120 0 0.775
9 0.05 50 100 2 130 0 0.808
10 0.05 50 100 2 140 0 0.836
11 0.05 50 100 2 150 0 0.860
12 0.05 50 100 2 160 0 0.882
13 0.05 50 100 2 170 0 0.900
14 0.05 50 100 2 180 0 0.916
15 0.05 50 100 2 190 0 0.929
16 0.05 50 100 2 200 0 0.940
which can be plotted:
5.8.4 Further Readings on Power analysis
The following papers have a good discussion of the role of power analysis in wildlife research.
Steidl, R. J., Hayes, J. P., and Shauber, E. (1997).
Statistical power analysis in wildlife research.
Available at: http://dx.doi.org/10.2307/3802582
1. What are the four interrelated components of statistical hypothesis testing?
2. What is the difference between biological and statistical signicance?
3. What are the advantages of a paired (blocked) design over that of a completely randomized
design? What implications does this have for power analysis?
4. What is most serious problem with retrospective power analyses?
5. Under what conditions could a retrospective power analysis be useful?
6. What are the advantages of condence intervals?
7. What are the consequences of Type I and Type II errors?
Nemec, A.F.L. (1991).
Power Analysis Handbook for the Design and Analysis of Forestry Trials.
Biometrics Information Handout 02.
available at: http://www.for.gov.bc.ca/hfd/pubs/Docs/Bio/Bio02.htm.
Peterman, R. M. (1990).
Statistical power analysis can improve sheries research and management.
Canadian Journal of Fisheries and Aquatic Sciences, 47: 1-15.
Available at: http://dx.doi.org/10.1139/f90-001.
The Peterman paper is a bit technical, but has good coverage of the following issues:
1. Why are Type II errors often more of a concern in sheries management?
2. What four variables affect the power of a test? Be able to explain their intuitive consequences.
3. What is the difference between an a-priori and a-posteriori power analysis?
4. What are the implications of ignoring power in impact studies?
5. What are some of the costs of Type II errors in sheries management?
6. What are the implications of reversing the burden of proof?
5.8.5 Retrospective Power Analysis
This is, unfortunately, often conducted as a post mortem - the experiment failed to detect anything and you
are trying to salvage anything possible from it.
There are serious limitation to a retrospective power analysis! A discussion of some of these issues
is presented by
Gerard, P., Smith, D.R., and Weerakkody, G. 1998.
Limits of retrospective power analysis.
Available at: http://dx.doi.org/10.2307/3802357.
which is a bit technical and repeats the advice in Steidl, Hayes, and Shauber (1997) discussed in the previous
section.
Their main conclusions are:
Estimates of retrospective power are usually biased (e.g. if you fail to nd sufcient evidence against
the null hypothesis, the calculated retrospective power using the usual power formulae can never
exceed 50%) and are usually very imprecise. This is not to say that the actual power must always be
less than 50% rather that the usual prospective power/sample size formula used are not appropriate
for estimating the retrospective power and give incorrect estimates. Some packages have tried to
implement the corrected formulae for retrospective power but you have to be sure to select the
proper options.
The proper role of power analysis is in research planning. It is sensible to use the results of a current
study (e.g. estimates of variability and standard deviations) for help in planning future studies, but
be aware that typically estimates of variation are very imprecise. Use a range of standard deviation
estimates when examining prospective power.
A condence interval on the the nal effect size will tell you much more than an retrospective power
analysis. It indicates where the estimate is relative to biologically signicant effects and its width
gives and indication of the precision of the estimate.
5.8.6 Summary
As can been seen by the past examples, the actual determination of sample size required to detect biologically
important differences can be relatively painless.
However, the hard part of the process lies in determining the size of a biologically important difference.
This requires a good knowledge of the system being studied and of past work. A statement that any differ-
ence is important really is not that meaningful because a simple retort of Is a difference of .0000000001%
biologically and/or scientically meaningful? exposes the fallacy of believing that any effect size is rele-
vant.
Similarly, determining the variation in response among experimental units exposed to the same experi-
mental treatment is also difcult. Often past studies can provide useful information. In some cases, expert
opinion can be sought and questions such as what are some typical values that you would expect to see over
replicated experimental unit exposed to the same treatment will provide enough information to get started.
It should be kept in mind that because the biologically meaningful difference and the variation over
replicated experimental units are NOT known with absolute certainty, sample sizes are only approximations.
Dont get hung up on if the proper sample size is 30 or 40 or 35. Rather, the point of the exercise is know if
the sample size required is 30, 300 or 3000! If the required sample size is in the 3000 area and there are only
sufcient resources to use a sample size of 30, why bother doing the experiment it has a high probability
of failure.
The above are simple examples of determining sample size in simple experiments that look at changes in
means. Often the computations will be sufcient for planning purposes. However, in more complex designs,
the sample size computations are more difcult and expert advice should be sought.
Similarly, sample size/power computations can be done for other types of parameters, e.g. proportions
live/dead, LD50s, survival rates from capture-recapture studies, etc. Expert help should be sought in these
cases.
5.9 ANOVA approach - Introduction
ANOVA is a generalization of the Two-sample t-test assuming equal population standard deviations to the
case of two or more populations. [It turns out, as you saw in the earlier example, that an ANOVA on two
groups under a CRD provides the same information (p-values and condence intervals) as the two sample
t-test assuming equal population standard deviations.]. The formal name for this procedure is Single factor
- completely randomized design - Analysis of Variance.
While the name ANOVA conjures up analying variances, the technique s a test of the equality of popu-
lation means through the comparison of sample variations.
The ANOVA method is one of the most powerful and general techniques for the analysis of data. It can
be used in a variety of experimental situations. We are only going to look at it applied to a few experimental
designs. It is extremely important that you understand the experimental design before applying the
appropriate ANOVA technique. The most common problem that we see as statistical consultants is the
inappropriate use of a particular analysis method for the experiment at hand because of failure to recognize
the experimental setup.
The Single-Factor Completely Randomized Design (CRD)-ANOVA is also often called the one-way
ANOVA. This is the generalization of the two independent samples experiment that we saw previously.
Data can be collected in one of two ways:
1. Independent surveys are taken from two or more populations. Each survey must follow the RRR
outlined earlier and should be a simple random sample from the corresponding population. For exam-
ple, a survey could be conducted to compare the average household incomes among the provinces of
Canada. A separate survey would be conducted in each province to select households.
2. A set of experimental units is randomized to one of the treatments in the experiment. Each experi-
mental unit receives one and only one experimental treatment. For example, an experiment could be
conducted to compare the mean yields of several varieties of wheat. The eld plots (experimental
units) are randomly assigned to one of the varieties of wheat.
Here are some examples of experiments or survey which should NOT be analyzed using the Single-
Factor-CRD-ANOVA methods:
Animals are given a drug and measured at several time points in the future. In this experiment, each
animal is measured more than once which violates the assumption of a simple CRD. This experiment
should be analyzed using a Repeated-Measures-ANOVA.
Large plots of land are prepared using different fertilizers. Each large plot is divided into smaller plots
which receive different varieties of wheat. In this experiment, there are two sizes of experimental units
- large plots receiving fertilizers and smaller plots receiving variety. This violates the assumption of
a CRD that there is only one size of experimental unit. This experiment should be analyzed using a
Split-Plot-ANOVA (which is discussed in a later chapter).
Honey bees colonies are arranged on pallets, three per pallet. Interest lies in comparing a method
of killing bee mites. Three methods are used, and each pallet receives all three methods. In this
experiment, there was not complete randomization because each pallet has to receive all three treat-
ments which violates one of the assumptions of a CRD. This experiment should be analyzed using a
Randomized-Block-ANOVA (which is discussed in a later chapter).
Three different types of honey bees (hygienic, non-hygienic, or a cross) are to be compared for sweet-
ness of the honey. Five hives of each type are sampled and two samples are taken from each hive. In
this experiment, two sub-samples are taken from each hive. This violates the assumption of a CRD
that a single observation is taken from each experimental unit. This experiment should be analyzed
using a Sub-sampling ANOVA (which is discussed in a later chapter).
The key point, is that there are many thousands of experimental designs. Every design can be analyzed
using a particular ANOVA model designed for that experimental design. One of the jobs of a statistician is
to be able to recognize these various experimental designs and to help clients analyze the experiments using
appropriate methods.
5.9.1 An intuitive explanation for the ANOVA method
Consider the following two experiments to examine the yields of three different varieties of wheat.
In both experiments, nine plots of land were randomized to three different varieties (three plots for each
variety) and the yield was measured at the end of the season.
The two experiments are being used just to illustrate how ANOVA works and to compare two
possible outcomes where, in one case, you nd evidence of a difference in the population means and,
in the other case, you fail to nd evidnece of a difference in the population means. In an actual
experiment, you would only do a single experiment. They are not real data - they were designed to
show you how the method works!
Experiment I Experiment II
Method Method
A B C A B C
--------- ---------
65 84 75 80 100 60
66 85 76 65 85 75
64 86 74 50 70 90
--------- ---------
Average 65 85 75 65 85 75
The data are available in a in the Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/
Which experiment has better evidence of a difference in the true mean yield among the varieties?
Lets look at dot plots for both experiments:
It seems that in Experiment I, it is easy to tell differences among the means of the three levels (a, b, or
c) of the factor (variety) because the results are so consistent. In Experiment II, it is not so easy to tell the
difference among the means of three levels (a, b, or c) because the results are less consistent.
In fact, what people generally look at is the variability within each group as compared to the variability
among the group means to ascertain if there is evidence of a difference in the group population means.
In Experiment I, the variability among the group means is much larger than the variability of individual
observations within each single group. In Experiment II, the variability among the group means is not very
different than the variability of individual observation within each single group.
This is the basic idea behind the Analysis of Variance (often abbreviated as ANOVA). The technique
examines the data for evidence of differences in the corresponding population means by looking at the ratio
of the variation among the group sample means to the variation of individual data points within the
groups. If this ratio is large, there is evidence against the hypothesis of equal group population means.
This ratio (called the F-ratio) can be thought of a signal-to-noise ratio. Large ratios imply the signal
(difference among the means) is large relative to the noise (variation within groups) and so there is evidence
of a difference in the means. Small ratios imply the signal (difference among the means) is small relative to
the nose (variation within groups) and so there is no evidence that the means differ.
Lets look at those two experiments in more detail and apply an analysis.
1. Formulate the hypothesis:
The null and alternate hypotheses are:
H:
1
=
2
=
3
or all means are equal
A: not all the means are equal or at least one mean is different from the rest
This is a generalization of the two-sample t-test hypothesis to the case of two or more groups. Note
that the null hypothesis is that all of the population means are equal while the alternate hypothesis is
very vague - at least one of the means is different from the others, but we dont know which one.
The following specications for the alternate hypothesis are NOT VALID specications:
A:
1
=
2
=
3
. This implies that every mean is unequal to every other mean. It may turn out
that the rst two means are equal but the third unequal to the rst two.
A: every mean is different. Same as above.
The concept of a one-sided or two-sided hypothesis does not exist when there are three or more groups
unlike when there are only two groups.
2. Collect some data and compute summary statistics
Here are the dot plots for both experiments and summary statistics:
This conrms our earlier impression that the variation among the sample means in Experiment I
is much larger than the variation (standard deviation) within each group, but in Experiment II, the
variation among the sample means is about the same magnitude as the variation within each group.
3. Find a test statistic and p-value.
The computations in ANOVA are at best tedious and at worst, impossible to do by hand. Virtually
no-one computes them by hand anymore nor should they. As well, dont be tempted to program a
spreadsheet to do the computations by yourself - this is a waste of time, and many of the numerical
methods in spreadsheets are not stable and will give wrong results!
There are many statistical packages available at a very reasonable cost (e.g. JMP, R, or SAS) that can
do all of the tedious computations. What statistical packages cannot do is apply the correct model to
your data! It is critical that you understand the experimental design before doing any analysis!
The idea behind the ANOVA is to partition the total variation in the data (why arent all of the numbers
from your experiment identical?) into various sources. In this case, we will partition total variation
into variation due to different treatments (the varieties) and variation within each group (often called
error for historical reasons).
These are arranged in a standard fashion called the ANOVA table. Here are the two ANOVA tables
from the two experiments.
The actual computations of the quantities in the above tables is not important - let the computers do
the arithmetic. In fact, for more complex experiment, many of the concepts such as sums of squares
are an old-fashioned way to analyze the experiment and better methods (e.g. REML) are used!
The rst three columns (entitled Source, DF, Sum of Squares) is a partitioning of the total variation
(the C Total) row into two components - due to treatments (varieties) entitled Model, and the within
group variation entitled Error.
The DF (degrees of freedom) column measure the amount of information available. There are a total
of 9 observations, and the df for total is always total number of observations 1. The df for Model
is the number of treatments 1 (in this case 31 = 2). The df for Error are obtained by subtraction
(in this case 8 6 = 2). The df can be fractional in some complex experiments.
The Sum of Squares column (the SS) measures the variation present in the data. The total SS is
partitioned into two sources. In both experiments, the variation among sample means (among the
means for each variety) were the same and so the SS
Model
for both experiments is identical. The
SS
Error
measures the variation of individual values within each groups. Notice that the variation
of individual values within groups for Experiment I is much smaller than the variation of individual
values within groups for Experiment II.
The Mean Square column is an intermediate step in nding the test-statistic. Each mean square is the
ratio of the corresponding sum of squares and the df . For example:
MS
Model
= SS
Model
/ df
Model
MS
Error
= SS
Error
/ df
Error
MS
Total
= SS
Total
/ df
Total
Finally, the test-statistic is denoted as the F-statistic (named after a famous statistician Fisher) is
computed as :
F = MS
Model
/ MS
Error
This is the signal-to-noise ratio which is used to examine if the data are consistent with the hypothesis
of equal means.
In Experiment I, the F-statistic is 300. This implies that the variation among sample means is much
larger than the variation within groups. In Experiment II, the F-statistic is only 1.333. The variation
among sample means is on the same order of magnitude as the variation within groups.
Unfortunately, there is no simple rule of thumb to decide if the F-ratio is sufciently large to pro-
vide evidence against the hypothesis. The F-ratio is compared to an F-distribution (which we wont
examine in this course) to nd the p-value. The p-value for Experiment I is < .0001 while that for
Experiment II is 0.332.
4. Make a decision
The p-value is interpretted in exactly the same way as in previous chapters, i.e. small p-values are
strong evidence that the data are NOT consistent with the hypothesis and so you have evidence against
the hypothesis.
Once the p-value is determined, the decision is made as before. In Experiment I, the p-value is very
small. Hence, we conclude that there is evidence against all population means being equal. In Exper-
iment II, the p-value is large. We conclude there is no evidence that the population means differ.
If the hypothesis is doubt, you still dont know where the differences in the population means could
have occurred. All that you know (at this point) is that not all of the means are equal. You will need to
use multiple comparison procedures (see below) to examine which population means appear to differ
from which other population means.
5.9.2 A modeling approach to ANOVA
[The following description has been fabricated solely for the entertainment and education of the reader. Any
resemblance between the characters described and real individuals is purely coincidental.]
Just before Thanksgiving, Professor R. decided to run a potato-peeling experiment in his class. The nom-
inal purpose was to compare the average speeds with which students could peel potatoes with a specialized
potato peeler vs., a paring knife, and with the peeler held in the dominant hand vs. the potato held in the
dominant hand. In the jargon of experimental design, these three different methods are called treatments.
Here, there were three treatment groups, those using the peeler in the dominant hand (PEELERS), the knife
in dominant hand (KNIFERS), and the potato in dominant hand (ODDBALLS).
Twelve volunteers were selected to peel potatoes. These 12 were randomly divided into three groups
of 4 individuals. Groups were labeled as above. The experimental subjects were then each given a potato in
turn and asked to peel it as fast as they could. The only restrictions were that individuals in Group PEELERS
were to use the potato peeler in their dominant hand, etc.
Why was randomization used?
Times taken to peel each potato were recorded as follows:
Replicate
Group 1 2 3 4
PEELERS 44 69 37 38
KNIFERS 42 49 32 37
ODDBALLS 50 58 78 102
This data was analyzed using the SAS system using the program potato.sas with output potato.pdf
available in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms.
Do these times demonstrate that the average time taken to peel a potato depends on the tool used and the
hand in which it is held? Obviously, we ought to begin by computing the average for each group:
Replicate Mean
Group 1 2 3 4
PEELERS 44 69 37 38 47
KNIFERS 42 49 32 37 40
ODDBALLS 50 58 78 102 72
The mean for the group holding the potato in the dominant hand is over one and one half times as great
as is the mean for each of the other two. This strategy appears to take the longest; the other two strategies
seem more comparable, but with the knife having a slight advantage over the peeler.
In a intuitive sense, some of the variation among the 12 times needed to peel the potatoes come from the
treatments applied, i.e. the three methods of peeling.
This experimental design leaves open the possibility that the observed differences are attributable to
chance uctuations - chance uctuations generated by the random assignment of individuals to groups, and
of potatoes to individuals, etc. This possibility can be assessed by a statistical test of signicance. The null
hypothesis is that there are no systematic differences in the population mean time taken to peel potatoes with
the three different methods. The observed differences are then just chance differences.
To perform the statistical test, you must consider what any good scientist would consider. You must
imagine repeating the experiment to see if the results are reproducilble. You would not, of course, expect to
obtain identical results. However, if the differences were real, you would expect to see equally convincing
results quite often. If not, then you would expect to see such substantial differences as these only rarely. The
p-value that we are about to calculate will tell us how frequently we ought to expect to see such apparently
strong evidence of differences between the group means when they are solely chance differences.
To proceed, we shall need a model for the variation in the time to peel a potatoe seen in the data. This
is the key aspect of any statistical analysis - formulating a mathematical model that we hope is a reasonable
approximation to reality. Then we apply the rules of probability and statistics to determine if our model is
a reasonable t to the data and then based upon our tted model, to examine hypotheses about the model
parameters which match a real-life hypothesis of interest.
The models are developed by examining the treatment, experimental unit, and restricted randomization
structures in the experiment. In this course, it is always assumed that complete randomization is done as
much as possible and so the effects of restricted randomization are assumed not to exist.
The treatment structure consists of the factors in the experiment and any interactions among them if
there are more than one factor (to be covered in later chapters). The experimental unit structure consists of
variation among identical experimental units within the same treatment group. If there was no difference
in the mean time to peel a potato, then there would be NO treatment effect. It is impossible to get rid of
experimental unit effects as this would imply that different experimental units would behave identically in
all respects.
The standard deviations for the groups are assumed to be identical. [It is possible to relax this assumption,
but this is beyond the space of this course.]
A highly simplied syntax is often used to specify model for experimental designs. To the left of the
equals sign,
10
the response variable is specied. To the right of the equals sign, the treatment and experi-
mental units are specied. For this example, the model is
Time = Method PEOPLE(R)
. This expression is NOT a mathematical equality - it has NO meaning as a mathematical expression. Rather
it is interpreted as the variation in response variable Time is affected by the treatments (Method) and by the
experimental units (People). The effect of experimental units is random and cannot be predicted in advance
(the (R) term). In general, there will be a random component for each type of experimental unit in the study
- in this case there is only one type of experimental unit - a potato.
It turns out, that most statistical packages and textbooks drop the experimental units terms UNLESS
THERE IS MORE THAN ONE SIZE OF EXPERIMENTAL UNIT (such as in split-plot designs to be
covered later in this course). Hence, the model is often expressed as:
Time = Method
with the effect of the experimental unit dropped for convenience.
You may have noticed that these models are similar in form to the regression models you have seen in
a previous course. This is not an accident - regression and analysis of variance are all part of a general
method called Linear Models.) If you look at the SAS program potato.sas
11
you will see that the MODEL
statement in PROC GLM follows closely the syntax above. If you look at the R program potato.r
12
you will
see the formula in the aov() function closely follows the syntax above. .
10
This syntax is not completely uniform among packages. For example, in R the equal sign is replaced by the ; in JMP (Analyze-
>Fit Model platform), the response variable is entered in a different area from the effects
11
Available from the Sample Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms
12
Available from the Sample Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms
The ANOVA methods must partition the total variation in the data into its constituent parts.
First, consider the average response for each treatment. In statistical jargon, these are called the treatment
means, corresponding to the three different treatments applied to the three groups. These can be estimated
by the corresponding group sample means. [This will not always be true - so please dont memorize this
rule.] Similarly, the overall mean can be estimated by the overall mean for the observed results. Here, this is
53. [This will not always be true - so please dont memorize this rule - it only work here because the design
is balanced, i.e. has an equal number of observations in each treatment group.]
The treatment effects are estimated by the difference between the sample mean for each group and the
overall grand mean.
The estimates of the experimental unit effects is found by the deviations of the individual observations
from their corresponding group sample means. For example, the effect of the rst potato in the rst treatment
group is found as 44 47 = 3, i.e. this experimental unit was peeled faster than the average potato in this
group. You may recognize that this terms look very similar to the residuals from a regression setting. This
is no accident. These residuals are important for assessing model t and adequacy as will be explored later.
At this point, tedious arithmetic takes place as outlined earlier and will not be covered in this class. The
end product is the ANOVA table where each term in the model is represented by a line in the table.
The total variation (the left of the equal sign) appears at the bottom of the table. The variation attributable
to the treatment structure appears in a separate line in the table. The variation attributable to experimental
unit variation is represented by the Error line in the table.
13
The columns representing degrees of freedom
represent a measure of information available, the column labeled sums of squares represents a measure
of the variation attributable to each effect, and the columns labeled Mean Square and F-statistic are the
intermediate computations to arrive at the nal p-value.
Most computer packages will compute the various sums of squares automatically and correctly and so
we wont spend too much time on these. Similarly most computer packages will automatically compute
p-values for the test-statistic and we again wont spend much time on this. It is far more important for you
to get a feeling for the rationale behind the method than to worry about the details of the computations.
The p-value is found to be 0.0525. This implies that if the null hypothesis were true, there is only a 5%
chance of observing this set of data (or a more extreme set) by chance. Hence, the differences between the
treatment means cannot reasonably be attributed to chance alone.
5.10 Example - Comparing phosphorus content - single-factor CRD
ANOVA
A horticulturist is examining differences in the phosphorus content of tree leaves from three varieties.
13
The term Error to represent experimental unit variation is an historical artifact and does NOT represent mistakes in the data.
She randomly selects ve trees from each variety within a large orchard, and takes a sample of leaves
from each tree. The phosphorus content is determined for each tree.
Variety
Var-1 Var-2 Var-3
0.35 0.65 0.60
0.40 0.70 0.80
0.58 0.90 0.75
0.50 0.84 0.73
0.47 0.79 0.66
The data is available in the phosphor.csv le in the Sample Program Library at http://www.stat.
way:
data phosphor;
infile phosphor.csv dlm=, dsd missover firstobs=2;
input phosphor variety $;
run;
Obs phosphor variety
1 0.35 var1
2 0.40 var1
3 0.58 var1
4 0.50 var1
5 0.47 var1
6 0.65 var2
7 0.70 var2
8 0.90 var2
9 0.84 var2
10 0.79 var2
1. Think about the design aspects. What is the factor? What are the levels? What are the treatments?
Can treatments be randomized to experimental units? If not, how were experimental units selected?
What are the experimental and observational units? Why is only one value obtained for each tree?
Why were ve trees of each variety taken - why not just take 5 samples of leaves from one tree? Is the
design a single-factor CRD?
2. Statistical Model. The statistical model must contain effects for the treatment structure, the experi-
mental unit structure, and the randomization structure. As this is a CRD, the last component does not
exist. The treatment consists of a single factor Variety. The experimental units are the Trees. As this
is a CRD, the effect of non-complete randomization does not exist. Hence our statistical model says
that the variation in the response variable (Phosphorus) depends upon the effects of the treatments and
variations among individual trees within each level of the treatment.
Most statistical packages require that you specify only the treatment effects unless there is more than
one experimental unit (e.g. a split-plot design to be covered later in the course). They assume that any
left over variation after accounting for effects specied must be experimental unit variation. Hence, a
simplied syntax for the model that represent the treatment, experimental, and randomization structure
for this experiment could be written as:
Phosphorus = Variety
which indicates that the variation in phosphorus levels can be attributable to the different varieties
(treatment structure) and any variation left over must be experimental unit variation.
3. Formulate the hypothesis of interest.
We are interested in examining if all three varieties have the same mean phosphorus content.
The hypotheses are:
H:
Var-1
=
Var-2
=
Var-3
A: not all means are equal, i.e., at least one mean is different from the others.
4. Collect data and summarize.
The data must be entered in stacked column format with two columns, one for the factor (the variety)
and one for the response (the phosphor level). Each line must represent an individual subject. Note
that every subject has one measurement the multiple leaves from a single sample are composited
into one sample and one concentration is found.
This is a common data format for complex experimental designs where each observation is in a dif-
ferent row and the different columns represent different variables.
The dataset contains two variables, one of which is the factor variable and the other the response
variable. You will nd it easiest to code factor variables with alphanumeric codes as was done in this
study.
We start using Proc SGplot to create a side-by-side dot-plot to check for outliers:
proc sgplot data=phosphor;
title2 dot plot of rawdata;
scatter x=variety y=phosphor;
run;
Then Proc Tabulate is used to construct a table of means and standard deviations:
proc tabulate data=phosphor;
title2 summary statistics;
class variety;
var phosphor;
table variety, phosphor
*
(n
*
f=5.0 mean
*
f=6.2 std
*
f=6.2) / rts=20;
run;
which gives:
phosphor
N Mean Std
variety 5 0.46 0.09
phosphor
N Mean Std
var1
var2 5 0.78 0.10
var3 5 0.71 0.08
We note that the sample standard deviations are similar in each group and there doesnt appear to be
any outliers or unusual data values. The assumption of equal standard deviations in each treatment
group appears to be tenable.
5. Find the test-statistic and compute a p-value.
A single-factor CRD can be analyzed using either Proc GLM or Proc Mixed. Both will give the same
results, and it is a matter of personal preference which is used.
14
We will demonstrate the output from both procedures.
First, Proc Glm:
ods graphics on;
proc glm data=phosphor plots=all;
title2 ANOVA using GLM;
class variety;
model phosphor = variety;
lsmeans variety / adjust=tukey pdiff cl stderr lines;
ods output LSmeanDiffCL = GLMdiffs;
ods output LSmeans = GLMLSmeans;
ods output LSmeanCL = GLMLSmeansCL;
ods output LSMlines = GLMlines;
ods output ModelANOVA = GLManova;
run;
ods graphics off;
Proc GLM computes various test-statistics which, in the case of a single-factor CRD are all the same.
In general, you are interested in the Type III Tests from GLM.
15
Dependent HypothesisType Source DF
Type
I
SS
Mean
Square
F
Value
Pr
>
F
phosphor 3 variety 2 0.27664000 0.13832000 16.97 0.0003
14
I prefer Proc Mixed because it is more general than GLM.
15
Contact me for details about the Type I, II, III, and IV sums-of-squares and tests.
Second, Proc Mixed:
ods graphics on;
proc mixed data=phosphor plots=all;
title2 ANOVA using Mixed;
class variety;
model phosphor=variety;
lsmeans variety / adjust=tukey diff cl;
ods output tests3 =MixedTest;
ods output lsmeans=MixedLsmeans;
ods output diffs =MixedDiffs;
run;
ods graphics off;
which gives:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
variety 2 12 16.97 0.0003
In Proc Mixed, the concept of Type I, II, and III tests does not exist, and there is only one table of test
statistics produced.
16
The F-statistic is 16.97. The p-value is 0.0003.
6. Make a decision.
Because the p-value is small, we conclude that there is evidence that not all the population means are
equal. At this point, we still dont know which means may differ from each other, but the gives us a
good indication of which varieties appears to have means that differ from the rest.
Once again, we have not proved that the means are not all the same. We have only collected good
evidence against them being the same. We may have made a Type I error, but the chances of it are
rather small (this is what the p-value measures).
We start by nding the estimates of the marginal means along with the standard errors and condence
intervals. These are obtained from the LSmeans statement in both Proc GLM and Proc Mixed, but
beware of the slightly different syntaxes in the two procedures.
The corresponding outputs for Proc GLM are:
16
Indeed, Proc Mixed does away with Sums-of-Squares and all that jazz and follows a REML procedure contact me for more
details.
Effect Dependent variety
phosphor
LSMEAN
Standard
Error
Pr
>
|t|
LSMEAN
Number
variety phosphor var1 0.46000000 0.04037326 <.0001 1
Effect Dependent variety LowerCL
phosphor
LSMEAN UpperCL
variety phosphor var1 0.372034 0.460000 0.547966
and for Proc Mixed are:
Effect variety Estimate
Standard
Error DF
t
Value
Pr
>
|t| Alpha Lower Upper
variety var1 0.4600 0.04037 12 11.39 <.0001 0.05 0.3720 0.5480
variety var2 0.7760 0.04037 12 19.22 <.0001 0.05 0.6880 0.8640
variety var3 0.7080 0.04037 12 17.54 <.0001 0.05 0.6200 0.7960
The standard errors for the estimated means use the pooled estimate of the within group variation to
compute better standard errors. Because the group sizes are all equal, their standard errors will also be
equal. For large group sizes, there wont be much of a difference between the standard errors reported
here and those reported earlier.
7. If you nd sufcient evidence against the null hypothesis, use a multiple comparison procedure
which is dicussed in later sections.
If you nd sufcient evidence against the null hypothesis, you still dont know which population
means appear to be different from the other population means.
In order to further investigate this, you will need to do a multiple comparison procedure. There
are thousands of possible multiple comparison procedures - and there is still a controversy among
statistician about which is the best (if any) procedure to use - so proceed cautiously. We will discuss
some of the problems later in the course.
One common multiple comparison procedure is the Tukey multiple comparison procedure.
The adjust=Tukey options in the LSmeans statement is where you request the Tukey multiple compar-
ison procedure.
There are several types of output, all of which tell the same story in various ways. The choice of which
method to us is based on your familiarity with the output and purposes it is needed for.
In both Proc GLM and Proc Mixed we can request the Tukey multiple comparison procedures using
the LSmeans statements with the adjust=Tukey option. Note that the syntax is slightly different in the
two procedures.
Proc GLM will generate a plot of the individual comparisons and a joined-lines plot; Proc Mixed
requires you use to use the PDM800.sas macro to get the joined-line plot.
%include pdmix800.sas;
%pdmix800(MixedDiffs,MixedLsmeans,alpha=0.05,sort=yes);
Here is the joined-line output from GLM:
Effect Dependent Line1 variety
phosphor
LSMEAN EqLS1 EqLS2 EqLS3
LSMEAN
Number
variety phosphor A var2 0.776 1 1 0 2
variety phosphor A _ 1 1 0 _
variety phosphor A var3 0.708 1 1 0 3
variety phosphor _ 1 1 0 _
variety phosphor B var1 0.460 0 0 1 1
and the corresponding output from Proc Mixed:
Effect=variety Method=Tukey(P<0.05) Set=1
Obs variety Estimate Standard Error Alpha Lower Upper Letter Group
1 var2 0.7760 0.04037 0.05 0.6880 0.8640 A
2 var3 0.7080 0.04037 0.05 0.6200 0.7960 A
3 var1 0.4600 0.04037 0.05 0.3720 0.5480 B
The sample means are rst sorted from largest to smallest. In this case, variety 2 had the largest
sample mean, variety 3 has the next largest mean and variety 1 has the smallest mean. Then starting
with Variety 2, which means cannot be distinguished from that of Variety 2 are joined by the same
letter (in this case, the letter A). [The actual value of the letters are not important, just which groups
are joined by the same letter). This indicates that there is no evidence in the data to distinguish the
mean of Variety 2 from that of Variety 3. Note that the mean of Variety 2 is NOT joined with that of
Variety 1 by any letter. This indicates that there is sufcient evidence to conclude that the mean of
Variety 2 may be different than the mean of Variety 1.
Next, look at the second largest mean (that of Variety 3) and repeat the process. In this case, the mean
of Variety 3 is again NOT joined with that of Variety 1 indicating that there is evidence that the mean
of Variety 3 could differ from that of Variety 1.
Finally, look at the smallest mean (that of Variety 1). It is unable to nd any groups whose mean could
be equal to that of Variety 1 and so no other group is joined by the letter B.
It is also useful to compute all the estimates of the pairwise differences in means along with adjusted
condence intervals for the difference in the means. The set of all pairwise differences in the means
can also be produced. In Proc GLM, these consist of two tables, one with the mean and one with the
estimated differences:
Effect Dependent variety
phosphor
LSMEAN
Standard
Error
Pr
>
|t|
LSMEAN
Number
Effect Dependent i j LowerCL
Difference
Between
Means UpperCL variety _variety
variety phosphor 1 2 -0.468324 -0.316000 -0.163676 var1 var2
variety phosphor 1 3 -0.400324 -0.248000 -0.095676 var1 var3
variety phosphor 2 3 -0.084324 0.068000 0.220324 var2 var3
In Proc Mixed, one table is produced:
variety _variety Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
var1 var2 -0.3160 0.05710 Tukey 0.0004 -0.4683 -0.1637
variety _variety Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
var1 var3 -0.2480 0.05710 Tukey 0.0025 -0.4003 -0.09568
var2 var3 0.06800 0.05710 Tukey 0.4805 -0.08432 0.2203
The estimated difference in means between Variety 2 and Variety 1 is .316 with a 95% condence
interval ranging from (.16 .49) which does NOT include the value of zero. There is evidence that
the two means could be unequal. The standard error for the difference in means and the p-value for
the hypothesis test of no difference in means are also presented.
17
There is evidence that the mean for
Variety 1 and Variety 2 differ.
The nal line fails to nd evidence of a difference in the means between Variety 3 and Variety 3. This
is again consistent with the joined-line-plots above.
Proc Glm also creates a condence-difference plot:
17
This p-value has been adjusted for the multiple comparison as explained later in this section.
In this plot, follow the light grey lines beside two pairs of brands to where they intersect on either a
solid blue line or a dashed red line. The condence interval runs out from this point. If the condence
interval for the difference in the means does NOT intersect the dashed grey line (X = Y ), then there
is evidence that this difference in the means is not equal to zero (the solid blue lines). If the condence
interval for the difference in the means does intersect the dashed grey line (the dashed red lines), then
there is no evidence of a difference in the mean lifetimes.
8. If you fail to nd sufcient evidence against the null hypothesis, remember that this is not evidence
that all the means are equal. It could be that you had a poor experiment with insufcient power to
detect anything meaningful.
SAS also provides diagnostic plots to check the residuals. Here are the two plots from Proc GLM and
Proc Mixed:
Neither of the diagnostic panels show any evidence of problems.
5.11 Example - Comparing battery lifetimes - single-factor CRDANOVA
Is there a difference in battery life by brand? Here are the results of a study conducted in the Schwarz
household during Christmas 1995.
We compare four brands of batteries when used in radio controlled cars for kids. A selection of brands
was bought, and used in random order. The total time the car functioned before the batteries were exhausted
was recorded to the nearest 1/2 hour.
Life time for each brand
Brand1 Brand2 Brand3 Brand4
5.5 7.0 4.0 4.5
5.0 7.5 3.5 4.0
6.5 4.5
7.0
The data is available in the battery.csv le in the Sample Program Library at http://www.stat.
way:
data lifetime;
infile battery.csv dlm=, dsd missover firstobs=2;
input brand $ lifetime;
run;
Obs brand lifetime
1 Brand1 5.5
2 Brand1 5.0
3 Brand2 7.0
4 Brand2 7.5
5 Brand2 6.5
6 Brand2 7.0
7 Brand3 4.0
8 Brand3 3.5
9 Brand3 4.5
10 Brand4 4.5
1. Think about the experimental design. Does this experiment satisfy the conditions for a Single-
factor-CRD-ANOVA? This is a single factor experiment (the factor is brand) with 4 levels (the actual
brands of battery). It consists of 4 independent samples from the population of all batteries of the
brand. The experimental units (batteries) were randomly assigned to the vehicle in random order.
2. Statistical Model. What is the model? Interpret the various terms.
3. Formulate the hypotheses.
We are interested in testing if mean lifetime differs by brand.
H:
Brand 1
=
Brand 2
=
Brand 3
=
Brand 4
A: not all the means are equal, i.e., at least one differs from the rest.
4. Collect data and preliminary sample statistics.
The assumption of no outliers and approximately equal population standard deviations in all groups
must be checked.
Proc SGplot creates a side-by-side dot-plot;
proc sgplot data=lifetime;
scatter x=brand y=lifetime;
run;
proc tabulate data=lifetime;
class brand;
var lifetime;
table brand, lifetime
*
(n
*
f=5.0 mean
*
f=6.2 std
*
f=6.2) / rts=20;
run;
which gives:
lifetime
N Mean Std
brand 2 5.25 0.35
Brand1
Brand2 4 7.00 0.41
Brand3 3 4.00 0.50
Brand4 2 4.25 0.35
The group standard deviations are roughly equal.
There is no evidence of outliers or any other unusual problems.
5. Compute a test statistic and p-value.
18
First, Proc Glm:
ods graphics on;
proc glm data=lifetime plots=all;
class brand;
model lifetime = brand;
lsmeans brand / adjust=tukey pdiff cl lines stderr;
ods output LSmeansCL = GLMLSmeansCL;
run;
ods graphics off;
18
Proc GLM computes various test-statistics which, in the case of a single-factor CRD are all the same.
In general, you are interested in the Type III Tests from GLM.
19
Type
I
SS
Mean
Square
F
Value
Pr
>
F
lifetime 3 brand 3 18.79545455 6.26515152 35.08 0.0001
Second, Proc Mixed:
ods graphics on;
proc mixed data=lifetime plots=all;
class brand;
model lifetime=brand;
lsmeans brand / adjust=tukey diff cl;
run;
ods graphics off;
which gives:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
brand 3 7 35.08 0.0001
20
The F-statistic is 35.1 and the p-value is < .0001.
6. Make a decision.
Because the p-value is small, we conclude that there is evidence of a difference among the population
means, i.e., there is evidence of a difference among the mean lifetime of the brands of batteries.
19
20
details.
7. Because we doubt the the null hypothesis, do a multiple comparison procedure.
the LSmeans statements with the adjust=Tukey option. Note that the syntax is slightly different in the
two procedures.
Proc GLM will generate a plot of the individual comparisons and a joined-lines plot; Proc Mixed
requires you use to use the PDM800.sas macro to get the joined-line plot.
Effect Dependent Line1 brand
lifetime
LSMEAN EqLS1 EqLS2 EqLS3 EqLS4
LSMEAN
Number
brand lifetime A Brand2 7.00 1 0 0 0 2
brand lifetime _ 1 0 0 0 _
brand lifetime B Brand1 5.25 0 1 1 1 1
brand lifetime B _ 0 1 1 1 _
brand lifetime B _ 0 1 1 1 _
Effect=brand Method=Tukey-Kramer(P<0.05) Set=1
Obs brand Estimate Standard Error Alpha Lower Upper Letter Group
1 Brand2 7.0000 0.2113 0.05 6.5004 7.4996 A
2 Brand1 5.2500 0.2988 0.05 4.5434 5.9566 B
3 Brand4 4.2500 0.2988 0.05 3.5434 4.9566 B
4 Brand3 4.0000 0.2440 0.05 3.4231 4.5769 B
In the joined-lines plot, the sample means are rst sorted from largest to smallest. In this case, Brand
2 had the largest sample mean, Brand 1 has the next largest mean, etc. Then starting with Brand 2,
which means cannot be distinguished from that of Brand 2 are joined by the same letter (in this
case, the letter A). [The actual value of the letters are not important, just which groups are joined
by the same letter). Because there are no other brands that are joined by the letter A with Brand 2,
this indicates that there is evidence that the mean of Brand 2 is different from the mean of all the other
brands.
Next, look at the second largest mean (that of Brand 1) and repeat the process. In this case, the mean
of Brand 1 is joined by the letter B with the mean of Brand 4 and Brand 3. This indicates that there is
no evidence that the means of these three brands are unequal.
The set of all pairwise differences in the means can also be produced. In Proc GLM, these consist of
two tables, one with the mean and one with the estimated differences:
Effect Dependent brand
lifetime
LSMEAN
LSMEAN
Number
brand lifetime Brand1 5.25000000 1
Difference
Between
Means UpperCL brand _brand
brand lifetime 1 2 -2.961379 -1.750000 -0.538621 Brand1 Brand2
brand lifetime 1 3 -0.026905 1.250000 2.526905 Brand1 Brand3
brand lifetime 1 4 -0.398780 1.000000 2.398780 Brand1 Brand4
brand lifetime 2 3 1.931664 3.000000 4.068336 Brand2 Brand3
brand lifetime 2 4 1.538621 2.750000 3.961379 Brand2 Brand4
brand lifetime 3 4 -1.526905 -0.250000 1.026905 Brand3 Brand4
brand _brand Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
Brand1 Brand2 -1.7500 0.3660 Tukey-Kramer 0.0084 -2.9614 -0.5386
brand _brand Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
Brand1 Brand3 1.2500 0.3858 Tukey-Kramer 0.0547 -0.02691 2.5269
Brand1 Brand4 1.0000 0.4226 Tukey-Kramer 0.1717 -0.3988 2.3988
Brand2 Brand3 3.0000 0.3227 Tukey-Kramer 0.0002 1.9317 4.0683
Brand2 Brand4 2.7500 0.3660 Tukey-Kramer 0.0006 1.5386 3.9614
Brand3 Brand4 -0.2500 0.3858 Tukey-Kramer 0.9130 -1.5269 1.0269
The estimated difference in means between Brand 2 and Brand 3 is 3 hours with a 95% condence
interval ranging from (1.93 4.06) hours which does NOT include the value of zero. There is
evidence that the two means could be unequal. The standard error for the difference in means and the
p-value for the hypothesis test of no difference in means are also presented. As seen in the joined-
line-plot, there is evidence that the mean for Brand 2 and Brand 3 differ.
The condence diamonds seem to indicate that the mean for Brand 1 may be different than the mean
for Brands 3 and 4 yet the joined line plot indicates that there is no evidence that they differ. This
seem contradictory. However, if you examine the condence intervals for the differences closely, you
see that the condence interval for the difference in means between Brand 1 and Brand 3 just barely
includes zero.
In this plot, follow the light grey lines beside two pairs of brands to where they intersect on either a
solid blue line or a dashed red line. The condence interval runs out from this point. If the condence
interval for the difference in the means does NOT intersect the dashed grey line (X = Y ), then there
is evidence that this difference in the means is not equal to zero (the solid blue lines). If the condence
interval for the difference in the means does intersect the dashed grey line (the dashed red lines), then
there is no evidence of a difference in the mean lifetimes.
Proc Mixed:
5.12 Example - Cuckoo eggs - single-factor CRD ANOVA
Reference: L.H.C. Tippett, The Methods of Statistics, 4th Edition, Williams and Norgate Ltd (London),
1952, p. 176.
L.H.C. Tippett (1902-1985) was one of the pioneers in the eld of statistical quality control. This data
on the lengths of cuckoo eggs found in the nests of other birds (drawn from the work of Latter, O.M. 1902.
The egg of Cuculus canorus Biometrika 1, 164) is used by Tippett in his fundamental text.
Cuckoos are knows to lay their eggs in the nests of other (host) birds. The eggs are then adopted and
hatched by the host birds.
It was already known in 1892, that cuckoo eggs differed in characteristics depending upon the locality
where found. A study by E.B. Chance in 1940 called The Truth About the Cuckoo demonstrated that cuckoos
return year after year to the same territory and lay their eggs in the nests of a particular host species. Further,
cuckoos appear to mate only within their territory. Therefore, geographical sub-species develop, each with a
dominant foster-parent species, and natural selection has ensured the survival of cuckoos most t to lay eggs
that would be adopted by a particular foster-parent.
The data is available in the cuckoo.csv le in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS and then
stacked in the usual way:
data length;
/
*
this illustrates the flexibility of SAS to read data in a
variety of formats and to transform it into a standard
case by variable format
*
/
infile cuckoo.csv dlm=, dsd missover firstobs=2;
input sp1 sp2 sp3 sp4 sp5 sp6; /
*
read in 6 species. A . implies a missing value
*
/
length species $15.;
species = Meadow Pipit ; length =sp1; output; /
*
convert to standard format
*
/
species = Tree Pipit ; length =sp2; output;
species = Hedge Sparrow; length =sp3; output;
species = Robin ; length =sp4; output;
species = Pied Wagtail ; length =sp5; output;
species = Wren ; length =sp6; output;
label length = Egg length (mm);
keep species length; /
*
which variables to retain
*
/
run;
Obs species length
1 Hedge Sparrow 20.85
Does this study satisfy the design requirements for a single-factor CRD? What is the factor? What are
its levels? What are the treatments? What are the experimental units and the observational units? How is
randomization introduced in the study?
We start by looking at the dot-plots to check for outliers using Proc SGplot to create a side-by-side
dot-plot;
proc sgplot data=length;
scatter x=species y=length;
run;
There is no evidence of any outliers or problem points.
proc tabulate data=length;
class species;
var length;
table species, length
*
(n
*
f=5.0 mean
*
f=6.2 std
*
f=6.2) / rts=20;
run;
which gives:
Egg length (mm)
N Mean Std
species 14 23.12 1.07
Hedge Sparrow
Meadow Pipit 45 22.30 0.92
Pied Wagtail 15 22.90 1.07
Robin 16 22.58 0.68
Tree Pipit 15 23.09 0.90
Wren 15 21.13 0.74
The standard deviations are roughly comparable across all groups.
The model for this experiment is again built by considering the treatment, experimental unit, and ran-
domization structures. The simplied model syntax:
21
is
Length = Species
. We interpret this to read that variation in Length (variable on left side of equals sign) is attributable to
Species effects + random noise (not shown but implicitly present).
22
First, Proc Glm:
21
What happened to the experimental unit effects and the randomization structure effects in the model syntax?
22
ods graphics on;
proc glm data=length plots=all;
class species;
model length = species;
lsmeans species / adjust=tukey pdiff cl stderr lines;
estimate Special Contrast species 0 .5 0 -1 .5 0;
ods output LSmeanCL = GLMLSmeansCL;
ods output Estimates = GLMest;
run;
ods graphics off;
Proc GLM computes various test-statistics which, in the case of a single-factor CRD are all the same. In
general, you are interested in the Type III Tests from GLM.
23
Type
I
SS
Mean
Square
F
Value
Pr
>
F
length 3 species 5 42.93965079 8.58793016 10.39 <.0001
Second, Proc Mixed:
ods graphics on;
proc mixed data=length plots=all;
class species;
model length=species;
lsmeans species / adjust=tukey diff cl;
estimate Special Contrast species 0 .5 0 -1 .5 0;
ods output estimates=MixedEst;
23
run;
ods graphics off;
which gives:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
species 5 114 10.39 <.0001
24
There appears to be strong evidence of differences in the mean egg size among the host-species nests
(F = 10.4, p < .0001). At this point, we dont know which means could differ from which other means, so
we need to perform a (Tukey) multiple-comparison procedure.
We start by nding the estimates of the marginal means along with the standard errors and condence
intervals. These are obtained from the LSmeans statement in both Proc GLM and Proc Mixed, but beware of
the slightly different syntaxes in the two procedures.
The corresponding outputs for Proc GLM are:
Effect Dependent species
length
LSMEAN
Standard
Error
Pr
>
|t|
LSMEAN
Number
species length Hedge Sparrow 23.1214286 0.2430079 <.0001 1
species length Meadow Pipit 22.2988889 0.1355433 <.0001 2
species length Pied Wagtail 22.9033333 0.2347680 <.0001 3
species length Robin 22.5750000 0.2273131 <.0001 4
species length Tree Pipit 23.0900000 0.2347680 <.0001 5
species length Wren 21.1300000 0.2347680 <.0001 6
24
details.
Effect Dependent species LowerCL
length
LSMEAN UpperCL
species length Hedge Sparrow 22.640032 23.121429 23.602825
species length Meadow Pipit 22.030379 22.298889 22.567399
species length Pied Wagtail 22.438260 22.903333 23.368407
species length Robin 22.124695 22.575000 23.025305
species length Tree Pipit 22.624926 23.090000 23.555074
species length Wren 20.664926 21.130000 21.595074
and for Proc Mixed are:
Effect species Estimate
Standard
Error DF
t
Value
Pr
>
species Hedge Sparrow 23.1214 0.2430 114 95.15 <.0001 0.05 22.6400 23.6028
species Meadow Pipit 22.2989 0.1355 114 164.51 <.0001 0.05 22.0304 22.5674
species Pied Wagtail 22.9033 0.2348 114 97.56 <.0001 0.05 22.4383 23.3684
species Robin 22.5750 0.2273 114 99.31 <.0001 0.05 22.1247 23.0253
species Tree Pipit 23.0900 0.2348 114 98.35 <.0001 0.05 22.6249 23.5551
species Wren 21.1300 0.2348 114 90.00 <.0001 0.05 20.6649 21.5951
the LSmeans statements with the adjust=Tukey option. Note that the syntax is slightly different in the two
procedures.
Proc GLM will generate a plot of the individual comparisons and a joined-lines plot; Proc Mixed requires
you use to use the PDM800.sas macro to get the joined-line plot.
Effect Dependent Line1 Line2 species
length
LSMEAN EqLS1 EqLS2 EqLS3 EqLS4 EqLS5 EqLS6
LSMEAN
Number
species length A Hedge Sparrow 23.12143 1 1 1 1 0 0 1
species length A _ 1 1 1 1 0 0 _
species length A Tree Pipit 23.09000 1 1 1 1 0 0 5
species length A _ 1 1 1 1 0 0 _
species length B A Pied Wagtail 22.90333 1 1 1 1 1 0 3
species length B A _ 1 1 1 1 1 0 _
species length B A Robin 22.57500 1 1 1 1 1 0 4
species length B _ 1 1 1 1 1 0 _
species length B Meadow Pipit 22.29889 0 0 1 1 1 0 2
species length _ 0 0 1 1 1 0 _
species length C Wren 21.13000 0 0 0 0 0 1 6
Effect=species Method=Tukey-Kramer(P<0.05) Set=1
Obs species Estimate Standard Error Alpha Lower Upper Letter Group
1 Hedge Sparrow 23.1214 0.2430 0.05 22.6400 23.6028 A
2 Tree Pipit 23.0900 0.2348 0.05 22.6249 23.5551 A
3 Pied Wagtail 22.9033 0.2348 0.05 22.4383 23.3684 AB
4 Robin 22.5750 0.2273 0.05 22.1247 23.0253 AB
5 Meadow Pipit 22.2989 0.1355 0.05 22.0304 22.5674 B
6 Wren 21.1300 0.2348 0.05 20.6649 21.5951 C
Now the interpretation of the joined-line plots is not as straight forward as in the previous examples.
The key problem in interpretation is the lack of transitivity among the joined lines. For example, according
to the joined-line plot, you are unable to distinguish among the mean eggsize laid in Hedge, Tree, Wagtail,
and Robin nests (they are joined by the letter A); you are unable to distinguish among the mean eggsize laid
in Wagtail, Robin, and Meadow nests (they are all joined by the letter B); but you are able to distinguish
between the mean eggsize of eggs laid in Hedge nests and Meadow nests (they are not joined by the same
letter).
This lack of transitivity can be explained using an analogy. Suppose you are comparing the colors of
three paint chips. The colors on the rst two chips may be so close that you cannot readily distinguish
between them; those on the second and third chip may also be so close that you cannot readily distinguish
between them; but the colors on the rst and third chip are different enough that you can distinguish between
the two. The same thing happens here statistical equality is not the same as mathematical equality. By
joining means with the same letter, you are only saying that the sample means are not far enough apart that
you can readily distinguish between the respective population means.
The set of all pairwise differences in the means can also be produced. In Proc GLM, these consist of two
tables, one with the mean and one with the estimated differences:
Effect Dependent species
length
LSMEAN
Standard
Error
Pr
>
|t|
LSMEAN
Number
species length Hedge Sparrow 23.1214286 0.2430079 <.0001 1
species length Meadow Pipit 22.2988889 0.1355433 <.0001 2
species length Pied Wagtail 22.9033333 0.2347680 <.0001 3
species length Robin 22.5750000 0.2273131 <.0001 4
species length Tree Pipit 23.0900000 0.2347680 <.0001 5
species length Wren 21.1300000 0.2347680 <.0001 6
Difference
Between
Means UpperCL species _species
species length 1 2 0.015946 0.822540 1.629134 Hedge Sparrow Meadow Pipit
species length 1 3 -0.761369 0.218095 1.197560 Hedge Sparrow Pied Wagtail
species length 1 4 -0.418146 0.546429 1.511003 Hedge Sparrow Robin
species length 1 5 -0.948036 0.031429 1.010893 Hedge Sparrow Tree Pipit
species length 1 6 1.011964 1.991429 2.970893 Hedge Sparrow Wren
species length 2 3 -1.390264 -0.604444 0.181375 Meadow Pipit Pied Wagtail
species length 2 4 -1.043292 -0.276111 0.491070 Meadow Pipit Robin
Difference
Between
Means UpperCL species _species
species length 2 5 -1.576931 -0.791111 -0.005291 Meadow Pipit Tree Pipit
species length 2 6 0.383069 1.168889 1.954709 Meadow Pipit Wren
species length 3 4 -0.618938 0.328333 1.275605 Pied Wagtail Robin
species length 3 5 -1.149096 -0.186667 0.775762 Pied Wagtail Tree Pipit
species length 3 6 0.810904 1.773333 2.735762 Pied Wagtail Wren
species length 4 5 -1.462272 -0.515000 0.432272 Robin Tree Pipit
species length 4 6 0.497728 1.445000 2.392272 Robin Wren
species length 5 6 0.997571 1.960000 2.922429 Tree Pipit Wren
species _species Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
Hedge Sparrow Meadow Pipit 0.8225 0.2783 Tukey-Kramer 0.0429 0.01595 1.6291
Hedge Sparrow Pied Wagtail 0.2181 0.3379 Tukey-Kramer 0.9872 -0.7614 1.1976
Hedge Sparrow Robin 0.5464 0.3328 Tukey-Kramer 0.5726 -0.4181 1.5110
Hedge Sparrow Tree Pipit 0.03143 0.3379 Tukey-Kramer 1.0000 -0.9480 1.0109
Hedge Sparrow Wren 1.9914 0.3379 Tukey-Kramer <.0001 1.0120 2.9709
Meadow Pipit Pied Wagtail -0.6044 0.2711 Tukey-Kramer 0.2325 -1.3903 0.1814
Meadow Pipit Robin -0.2761 0.2647 Tukey-Kramer 0.9022 -1.0433 0.4911
Meadow Pipit Tree Pipit -0.7911 0.2711 Tukey-Kramer 0.0475 -1.5769 -0.00529
Meadow Pipit Wren 1.1689 0.2711 Tukey-Kramer 0.0005 0.3831 1.9547
Pied Wagtail Robin 0.3283 0.3268 Tukey-Kramer 0.9155 -0.6189 1.2756
Pied Wagtail Tree Pipit -0.1867 0.3320 Tukey-Kramer 0.9932 -1.1491 0.7758
Pied Wagtail Wren 1.7733 0.3320 Tukey-Kramer <.0001 0.8109 2.7358
Robin Tree Pipit -0.5150 0.3268 Tukey-Kramer 0.6160 -1.4623 0.4323
Robin Wren 1.4450 0.3268 Tukey-Kramer 0.0003 0.4977 2.3923
Tree Pipit Wren 1.9600 0.3320 Tukey-Kramer <.0001 0.9976 2.9224
In this plot, follow the light grey lines beside two pairs of brands to where they intersect on either a solid
blue line or a dashed red line. The condence interval runs out from this point. If the condence interval
for the difference in the means does NOT intersect the dashed grey line (X = Y ), then there is evidence
that this difference in the means is not equal to zero (the solid blue lines). If the condence interval for the
difference in the means does intersect the dashed grey line (the dashed red lines), then there is no evidence
of a difference in the mean lifetimes.
Sometimes specic contrasts among the means are of interest. For example, suppose you wished to
estimate the difference in length between the average egg length of Meadow and Tree Pipit (combined) vs.
the average egg length of Robins.
These contrasts can be estimated using the Estimate statement in Proc GLM and Proc Mixed (for one
degree of freedom contrasts) and the LSMEstimate statement in Proc Mixed. In order to use either statement,
you need to know the order in which SAS has sorted the levels of the factor species in most cases this is
alphabetical, but the output from the class statement at the start of the procedure will show the order.
The output from both procedures is the same:
Dependent Parameter Estimate
Standard
Error
t
Value
Pr
>
|t|
length Special Contrast 0.11944444 0.26465684 0.45 0.6526
Label Estimate
Standard
Error DF
t
Value
Pr
>
|t|
Special Contrast 0.1194 0.2647 114 0.45 0.6526
Proc Mixed:
5.13 Multiple comparisons following ANOVA
5.13.1 Why is there a problem?
The general question of multiple comparisons has generated more papers in statistical theory than virtually
any other topic! Unfortunately this has lead to more smoke than enlightenment.
We used the Analysis of Variance procedure to assess the possibility that the differences among three (or
more) sample means could be attributed to chance uctuations. In the potato-peeling example, there was a
barely reasonable probability that chance alone could produce such substantial variation between the sample
means (p = 5.25%). In the cuckoo experiment there was very little probability that chance alone could
produce such substantial variation (p <0.0001).
However, in neither case does the Analysis of Variance provide a complete analysis. If we accept, for ex-
ample, the marginal evidence in the rst example as being signicant, then all that we can formally conclude
from the Analysis of Variance is that there are systematic differences in the times taken to peel potatoes
using the three different techniques. The Analysis of Variance has not told us where these differences lie.
We cannot automatically conclude, for example, that people take, on average, more time to peel a potato
with a potato peeler than with a paring knife. All that we can conclude is that at least two of the techniques
differ in their mean time to peel potatoes. Maybe the use of the wrong hand slows a person down on average,
but it makes no difference whether a knife or a specialized peeler is used.
Similarly in the cuckoo example, the primary interest is in differences between the mean lengths of the
eggs laid in the nests of the various host species. The Analysis of Variance tells us that there is evidence of
some differences in the population means, but provides no direct information on which means could differ
from which other means.
To make these specic comparisons, we could try using individual t-tests. In the rst instance above, we
are comparing 3 means. Thus there are three pairs of means to test (1 vs. 2, 2 vs. 3, and 3 vs. 1). If we were
comparing k means, there would be k(k 1)/2 possible comparisons to be made, each with its own t-test.
This number increases rapidly. For example, amongst the k = 6 means in the cuckoo example, there are 15
possible t-tests. With k = 20 means, there are 190 possible comparisons.
When you are performing all these t-tests, chances are that one of themwill produce a signicant p-value,
even if there are no systematic differences amongst the groups just by random chance.
5.13.2 A simulation with no adjustment for multiple comparisons
For example, the following table illustrates the problem. It contains the results of 100 simulations of each
of the 10 possible t-tests for comparing a group of 5 means when there were NO systematic differences
among the 5 population means!
For each of the 100 simulations, a 10 character vector (the Pairs column) represents the results of the
10 pairwise tests (mean 1 vs. mean 2, mean 1 vs. mean 3, . . . mean 4 vs. mean 5). A period (.) in column x
indicates no statistically signicant results was detected at =0.05. An asterisk (*) in column x represents a
p-value under 5% - i.e. a Type I error has been committed and a difference has been declared to exist among
the ve means when, in fact, there really is no difference.
The column labeled Any represents ANY difference detected among the 10 possible comparisons.
Output from the simulation study comparing 5 means when no
real difference exists. Each comparison at 0.05 level.
Sim Pairs Any Sim Pairs Any Sim Pairs Any
1 .......... . 2 .......... . 3 .......... .
4 .......... . 5 .......... . 6 .......... .
7 .......... . 8 .......... . 9 .......... .
10 .......... . 11 .......... . 12 .......... .
13 .......... . 14 .......... . 15 .......... .
16 ....
*
..
*
..
*
17 ..
*
......
* *
18 .......... .
19 ......
*
...
*
20 .......... . 21 .......... .
22 .......... . 23 ..
*
.......
*
24 .......... .
25 .......... . 26 .......... . 27 .......... .
28 .......... . 29 .......
*
..
*
30 .......... .
31 .....
*
....
*
32 .......... . 33 .......... .
34 .......... . 35 .......... . 36 ....
*
.
*
...
*
37 .......... . 38 .......... . 39 .......... .
40 .......... . 41 .......... . 42 .......... .
43 .......... . 44 .......... . 45 ...
*
......
*
46 .......... . 47 .......... . 48 .
*
.
*
......
*
49 .......... . 50 .......... . 51 .....
*
.
*
..
*
52 .......... . 53 .......... . 54 .......... .
55 .......... . 56 ...
*
....
** *
57 .......... .
58 .......... . 59 .......... . 60 .......... .
61 .......... . 62 .......... . 63 .......... .
64 .......... . 65 .......... . 66 .......... .
67 .......... . 68 ...
*
......
*
69 .......... .
70 ...
*
......
*
71 .......... . 72 .......... .
73 .......... . 74
*
.
**
......
*
75 .......... .
76 .......... . 77 .......... . 78 .......... .
79 .......
**
.
*
80 .........
* *
81 .......... .
82 ......
*
...
*
83 .......... . 84 .......... .
85 .......... . 86 .......... . 87 .
*
......
*
.
*
88 .......... . 89 .......... . 90 .......... .
91 .......... . 92 .......... . 93 ..
*
.......
*
94 .......... . 95 .......... . 96 ..
*
.......
*
97 ......
**
.
* *
98 .......... . 99 .......... .
100 .......... .
Fix your attention on any given pair of means (i.e., going down a particular column in the Pairs vector ).
About 5% of the tests erroneously point to a signicant difference. In fact the total number of statistically
signicant differences detected when, in fact, there was no real difference was:
Total significant results
when each pair tested at the .05 level.
Pair 1 2 3 4 5 6 7 8 9 10 Any
------------------------------------
Sig Diff 1 2 5 6 2 2 4 5 3 4 21
These uctuate around 5% as expected (why?).
But now cast your eye across the entire table and look at the Any column. This indicates if any of the 10
pairwise comparisons were declared statistically signicant for each simulation. The signicant test results
(*s) do not all occur together in the same simulations. They are more or less scattered about. In almost
1/5 of the simulations, at least one signicant difference was found which was NOT REAL as the data was
purposely generated with all the means equal.
5.13.3 Comparisonwise- and Experimentwise Errors
The previous simulation illustrates the two types of comparison errors that can be made. First is the
comparison-wise error rate - the probability that a particular comparison will erroneously declare a posi-
tive result when, in fact, none exists. This is controlled by the level and as seen in the above simulation, is
about 5% as expected.
However, in any ANOVA, more than one comparison is being made. In the above example, there were
10 possible comparisons among the 5 sample means. The experiment-wise error rate is the probability that
a false positive will occur somewhere in the entire set of comparisons. As shown by the above simulation,
this is quite high even if each individual comparison error rate is held to 5%.
The key idea behind all of the multiple comparisons is to control the Experimentwise error rate, i.e
the probability that you will make at least one Type I error somewhere among the entire set (family) of
comparisons examined. The simple t-test that you used above controls the comparison-wise error rate -
the probability of a Type I error in each comparison. As you saw above, even if this error rate is low, there
can be a large Experimentwise error rate, i.e. almost 1/3 of the simulated results resulted in a false positive
somewhere in the experiment!
There are literally hundreds of multiple comparison procedures - which one is best is a difcult question
to answer. Dont get hung up on deciding among multiple comparisons - most of the problems arise in
situations where the means are just on the borderline of being declared statistically signicant where I would
be more concerned with violations of assumptions having an effect upon my results. Also, many of the
problems are moot if your experiment only has two or three treatments - another reason for doing simple
straightforward experiments.
5.13.4 The Tukey-Adjusted t-Tests
The Tukey multiple comparison procedure that looks at all the possible pairs of comparison among the
groups means while keeping the experimentwise or familywise error rate controlled to the alpha (usually
0.05) level.
This means that under the Tukey multiple comparison procedure, there is at most a 5% chance of nding
a false positive among ALL of the pairwise comparison rather than a 5% chance of a false positive for EACH
comparison.
The way this procedure works is that each comparison is done at a slightly lower alpha rate to ensure
that the overall error rate is controlled. Many books have the gory details on the exact computations for the
tests.
Let us now repeat the earlier simulation, except now we will use the Tukey procedure to control the error
rates.
Output from the simulation study comparing 5 means when no
real difference exists using the Tukey procedure.
Sim Pairs Any Sim Pairs Any Sim Pairs Any
1 .......... . 2 .......... . 3 .......... .
4 .......... . 5 .......... . 6 .......... .
7 .......... . 8 .......... . 9 .......... .
10 .......... . 11 .......... . 12 .......... .
13 .......... . 14 .......... . 15 .......... .
16 .......... . 17 .......... . 18 .......... .
19 .......... . 20 .......... . 21 .......... .
22 .......... . 23 .......... . 24 .......... .
25 .......... . 26 .......... . 27 .......... .
28 .......... . 29 .......... . 30 .......... .
31 .......... . 32 .......... . 33 .......... .
34 .......... . 35 .......... . 36 .......... .
37 .......... . 38 .......... . 39 .......... .
40 .......... . 41 .......... . 42 .......... .
43 .......... . 44 .......... . 45 .......... .
46 .......... . 47 .......... . 48 .......... .
49 .......... . 50 .......... . 51 .......... .
52 .......... . 53 .......... . 54 .......... .
55 .......... . 56 .......... . 57 .......... .
58 .......... . 59 .......... . 60 .......... .
61 .......
*
..
*
62 .......... . 63 .......... .
64 .......... . 65 .......... . 66 .......... .
67 .......... . 68 .......... . 69 .......... .
70 .......... . 71 .......... . 72 .......... .
73 .......... . 74 .......... . 75 .......... .
76 .......... . 77 .......... . 78 .......... .
79 .......... . 80 .......... . 81 .......... .
82 .......... . 83 .......... . 84 .......... .
85 ..
*
.......
*
86 .......... . 87 .......... .
88 .......... . 89 .......... . 90 .......... .
91 .......... . 92 .......... . 93 .......... .
94 .......... . 95 .......... . 96 .......... .
97 .......... . 98 .......... . 99 .......... .
100 .......... .
We now see that each individual comparison is declared statistically signicant at a much smaller rate
and the experiment-wise error rate is also reduced.
A summary of the above is
Total significant results
when each pair tested at the .05 level.
Pair 1 2 3 4 5 6 7 8 9 10 Any
------------------------------------
Sig Diff 0 0 1 0 0 0 0 1 0 0 2
We see that the comparison-wise error rates are very small, and that even the experiment-wise error rate
is less than 5%.
5.13.5 Recommendations for Multiple Comparisons
Dont get hung up on this topic - understand why these tests are needed, and take care if you are doing
experiments with many, many treatments.
Virtually every statistical text has recommendations and these are often at odds with each other!
You may also wish to read the article,
Day, R.W. and Quinn, G. P. (1989).
Comparisons of Treatments after an analysis of variance in Ecology.
Ecological Monographs, 59, 433-463.
http://dx.doi.org/10.2307/1943075.
[This paper is NOT part of this course and is not required reading.]
In this paper, Day and Quinn have an exhaustive (!) review of multiple comparison methods with lots of
technical details. Their nal recommendations are:
Enough information should be presented in papers to allow the readers to judge the results for them-
selves. Present tables of means, sample sizes, standard errors, and condence intervals for differences.
All assumptions for statistical tests should be considered carefully.
Be aware of the problems of multiple testing.
A list of recommended procedures is given. Most computer packages implement many of their rec-
ommendations.
My advice is that rather than worrying about minute differences that may or may not be detectable with
different multiple comparisons, try and keep the experiment as simple as possible and present results using
good graphs and using condence intervals.
5.13.6 Displaying the results of multiple comparisons
There are many methods for displaying the results of multiple comparisons.
Many statistical packages can produce joined-line plots to indicate which means have and have not been
declared equal as a result of the multiple comparison procedure.
The joined-line multiple comparison plot is created as follows:
1. Find the estimated means (usually the LSMeans).
25
2. Sort the estimated means from smallest to largest.
3. Plot the estimated means on a number line.
26
4. Compute the LSMeans comparison table using an appropriate multiple comparison procedure to adjust
the p-values. This can be done automatically by most packages following a model t.
5. Starting with the smallest mean, draw a line joining this mean with any other mean that is not declared
statistically signicant different.
6. Repeat for the next smallest mean, etc.
For example, refer back to the cuckoo example. The estimated means are rst sorted (smallest to largest)
by species as Hedge, Tree, Wagtail, Robin, Meadow, and nally Wren. Then refer to the LSMeans difference
table reproduced from JMP below (other packages produce similar output):
25
In simple balanced designs, the LSMeans are equal to the sample means. However, this is not always true - refer to examples in
the two-factor designs for details.
26
Many packages simply list the means using (misleading) equally spaced intervals. This is not recommended.
Begin with the smallest mean that belonging to the Hedge species. Find the comparison with each
of the other means in order from smallest to largest. There is no evidence of a difference in the mean
length of eggs from the Hedge species and the Tree, Wagtail, Robin species, but clear evidence of
a difference between the Hedge and the Meadow and Wren species. Hence draw a line under the
LSMeans corresponding to the Hedge, Tree, Wagtail, and Robin species. This corresponds to the
AAAA line in the plot below. (see the nal result below).
Next try the next smallest mean that belonging to the Tree species. Again, use the table and compare
to the other means that are LARGER than this mean. No evidence of a difference in the means for the
species Wagtail and Robin species and the Tree species is found. Hence again draw a line connecting
the Hedge and Robin species. Because this line is completely enclosed in an existing line, it is not
necessary to draw it. (again refer to the nal result below).
Now refer to the Wagtail mean. We nd no difference in the mean between the Wagtail species and
the Robin and Meadow species. A line is drawn joining these three species. As it is NOT included in
an existing line, it is the next line in the plot below (the line of BBBB).
Looking at the mean for the Robin species, it is completely enclosed in the above BBBB line and so
is not drawn.
Finally, the mean for the Wren species appears to be different from all the other means.
The nal joined-line multiple comparison plot following a Tukey HSD multiple comparison procedure
produced by JMP (and other packages follow similar conventions) is:
Level Mean
Hedge A 23.12
Tree A 23.09
Wagtail A B 22.90
Robin A B 22.57
Meadow B 22.29
Wren C 21.13
Levels not connected by same letter are significantly different
Note that the above plot does NOT plot the mean on a number line. If you do plot the actual means on a
number line, the difference in the group means becomes more apparent.
This indicates the the mean eggsize for the eggs layed in the Hedge, Tree, Wagtail, Robin species were not
declared signicantly different; nor were the mean eggsizes for eggs layed in the Wagtail, Robin, Meadow
species; etc.
Note the non-transitivity of the comparisons. Even though the mean for Hedge is not found to be different
from the mean for Robin, and the mean for Robin is not found to be different from the mean for Meadow,
the means for Hedge and Meadow are declared to be different.
5.14 Prospective Power and sample sizen - single-factor CRDANOVA
Determination of sample sizes for planning purposes and power of existing designs proceeds in a similar
fashion as that done for the two-sample t-test.
As before, the power of the test will depend upon the following:
level. This is the largest value for the p-value of the test at which you will conclude that you have
sufcient evidence against the null hypothesis. Usually, most experiments use = 0.05, but this is
not an absolute standard. The smaller the alpha level, the more difcult it is to conclude that you have
sufcient evidence against the null hypothesis, and hence the lower the power.
Effect size. The effect size is the actual size of the difference that is to be detected and is biologically
important. This will depend upon economic and biological criteria. It is easier to detect a larger
difference and hence power increases with the size of the difference to be detected. This is the
hardest part of a prospective power analysis!
Natural variation (noise). All data has variation. If there is a large amount of natural variation in the
response, then it will be more difcult to detect a shift in the mean and power will decline as variability
increases. When planning a study, some estimate of the natural variation may be obtained from pilot
studies, literature searches, etc. If you have a previous experiment, you could use the average of the
within treatment group standard deviations. If you have the ANOVA table only, this is equivalent to
the
MSE entry.
Sample size. It is easier to detect differences with larger sample sizes and hence power increases with
sample size.
As before, there are a number of way of determining the necessary sample sizes
Computational formula such as presented in Zar.
Tables as presented in some books.
Computer programs such as in JMP, R, SAS, or on the web. For example refer to the Jave applets by
Russ Length at http://www.cs.uiowa.edu/~rlenth/Power/.
Unfortunately, in the case of more than two treatments, the computations are not straight forward, as
different congurations of the population means will lead to different powers. For example, if there are 4
treatments, then power will differ if the 4 means are equally spaced or if the 4 means are in 2 groups of 2
at each extreme. Fortunately, one can show that power is minimized if there are two means, one at each
extreme, and the remaining means are all situated at the middle value.
Suppose that there are 6 treatment groups, with means ranging from about 21 to 23 mm and the standard
deviations of the individual observations is about 1.0.
5.14.1 Using Tables
The second set of tables at the end of this chapter is indexed by r, the number of treatment groups, and
=
max() min()
=
2
1
= 2
the relative difference between the largest and smallest mean.
Looking for r = 6, = 2, = .05 and a power of 80%, the table indicates that a sample size of 8 will
be required in EACH of the six treatment groups for a total of 48 experimental units.
Again note the effect of decreasing values of .
These tables are designed to give the worst possible sample size. They implicity assume that the 6
groups means are 21, 22, 22, 22, 22, and 23 mm, i.e. one group at each extreme and all the other groups in
the middle. Notice that the groups in the middle really dont contribute to detecting the difference of 2 mm.
5.14.2 Using SAS to determine power
SAS has several methods to determine power. Proc Power computes power for relatively simple designs with
a single random error (such as the ones in this chapter). SAS also has a stand alone program for power analy-
sis. Refer to the http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms for
links to example of power analysis in SAS.
This procedure requires information on the within-group VARIANCE (the square of the standard devia-
tion) and the conguration of means that you expect to see. For the cuckoo example, the mean lengths range
from about 21 to 23 mm and the standard deviations of the lengths is about 1.0 mm.
For planning purposes, enter 6 means between 21 to 23 with 4 of the means having the value of 22, i.e.
use 21, 22, 22, 22, 22, 23 as the six values of the mean. As noted above, this leads to the lowest possible
power and will give conservative sample size numbers.
The standard deviation should take the value of around 1.0.
The following code fragment then estimates the sample size:
proc power;
title2 Using a base difference of 2 mm;
onewayanova
groupmeans = 21 | 22 | 22 | 22 | 22 | 23 /
*
list the group means
*
/
stddev = 1 /
*
what is the standard deviation
*
/
alpha = .05 /
*
what is the alpha level
*
/
power = .80 /
*
target power
*
/
ntotal = . /
*
solve for power
*
/
; /
*
end of the onewayanova statement - dont forget it
*
/
footnote This configuration has the worst power and so the largest possible sample size;
run;
This gives:
Alpha Mean1 Mean2 Mean3 Mean4 Mean5 Mean6
Std
Dev
Nominal
Power
Actual
Power
N
Total
0.05 21 22 22 22 22 23 1 0.8 0.840 48
for a total sample size requirement of 8(6) = 48 nests.
If however, you believe that that means are much more separated, i.e. three groups around 21 and three
groups around 23,
proc power;
title2 Using a base difference of 2 mm with a different configuration of means;
onewayanova
groupmeans = 21 | 21 | 21 | 23 | 23 | 23 /
*
*
/
stddev = 1 /
*
*
/
alpha = .05 /
*
*
/
power = .80 /
*
target power
*
/
ntotal = . /
*
solve for power
*
/
; /
*
*
/
footnote This configuration has the best power and so the smallest possible sample size;
run;
you get
Alpha Mean1 Mean2 Mean3 Mean4 Mean5 Mean6
Std
Dev
Nominal
Power
Actual
Power
N
Total
0.05 21 21 21 23 23 23 1 0.8 0.923 24
the total sample size required is around 24 units. This shows the effect of a different conguration of the
means upon required sample sizes - the rst case with the remaining means in the middle is the worst case
scenario.
5.14.3 Retrospective Power Analysis
See earlier comments about the dangers of a retrospective power analysis.
5.15 Pseudo-replication and sub-sampling
The following articles discuss the issue of pseudo-replication.
Hurlbert, S. H. (1984).
Pseudo-replication and the design of ecological eld experiments.
Ecological Monographs 54, 187-211.
http://dx.doi.org/10.2307/1942661.
Heffner, R.A., Butler, M.J., and Reilly, C. K. (1996).
Pseudo-replication revisited.
Ecology, 77, 2558-2562.
http://dx.doi.org/10.2307/2265754.
Hurlbert (1984) has become one of the most widely cited papers in the biological literature it has been
awarded the Citation Classic status. There is no more devastating review of a report, than a simple one-liner
indicating that the researcher has fallen prey to pseudo-replication.
What is pseudo-replication? Hurlbert (1984) denes pseudo-replication as
Pseudo-replication is dened as the use of inferential statistics to test for treatment effects with
data from experiments where either treatments are not replicated (though samples may be) or
replicates are not statistically independent.
As an example of pseudo-replication, consider an experiment to investigate the effects of a chemical
upon the growth of sh. The researcher originally plans to conduct the experiment by randomly assigning a
single sh to each of 10 tanks. Five of the tanks will have pure water; the other ve will have the chemical
added.
When setting up the experiment, the researcher decides instead to put all ve sh into a single tank - one
with pure water and the other with the chemical.
Both experiments use 10 sh; both have ve measurements of sh growth. Yet the second experiment
DOES NOT YIELD ANY USEFUL RESULTS the second experiment is pseudo-replicated.
Why is the second experiment a poor choice? The key point is that the experimental unit is the tank but
the observational unit is the sh. In the rst experiment, there are ve replicates of the experimental unit for
each treatment; in the second experiment there is but a single experimental unit for each treatment. The rst
experiment allows the experimenter to estimate experimental error - the second experiment does not.
No one would dream of putting a single sh into a tank and measuring the single sh ve times and
thinking these are ve independent replicates. Yet putting ve sh into a single tank is similar. Any tank
effect will operate on all ve sh simultaneously the readings of the ve sh will not be independent of
each other.
What does the second experiment tell us? We can apply a valid statistical test to test if the mean growth
of tank 1 is the same as tank 2. If we nd sufcient evidence against the null hypothesis, all that we know
is that there appears to be difference in the mean growth between these two specic tanks we cannot
extrapolate to say that the effect is caused by treatment differences.
In some cases, pseudo-replication is acceptable. The most obvious case is that of environmental impact
studies where interest does lie in comparison of the specic spot where the impact occurred to other, control,
sites.
Hurlbert (1984) is very nice paper and is a required reading for any one designing experiments. I particu-
larly enjoyed his Table 1 where he outlines sources of experimental error and ways to control or minimize
their effect. If anyone succeeds in getting a research proposal accepted by a research ethics committee using
his seventh listed source of experimental error and suggested remedy, please let me know!
His survey of the literature is sobering. Of over 500 experiments reviewed in scientic journals, fewer
than 40% adequately described their design and almost 50% of all papers committed obvious pseudo-
replication.
Hurlbert (1984) identied fours types of pseudo-replication (refer to his Figure 5 and the text):
1. Simple pseudo-replication. This is the most common type of pseudo- replication and takes place
when there are only two experimental units and multiple measurements are taken from the same ex-
perimental unit and erroneously treated as real replicates. The sh example above is such an example.
Many eld trials where a single site has many study plots are likely pseudo-replicated. Environmental
impact studies are pseudo-replicated if multiple study plots are selected at the impact and at a single
control site. As noted earlier in the notes, it is often advantageous to have replicated control sites.
2. Sacricial pseudo-replication. This is a generalization of the above where there are true replicates
of experimental units, but multiple samples or observations are still taken from each each unit. The
data are then pooled. This pooling ignores the structure of the experimental and observation units.
3. Temporal pseudo-replication. This differs from simple pseudo-replication only in that multiple sam-
ples are not taken simultaneously from each experimental unit but rather sequentially over time. Dates
are taken to represent replicated treatments when in fact they are not.
4. Implicit pseudo-replication. Here the authors recognize that they performed pseudo-replication but
then continue to discuss results as if it hadnt occurred hoping that no one would notice.
Twelve years after Hurlbert (1984), Heffner, Butler and Reilly (1996) again reviewed the ecological
literature. Despite over 600 references to Hurlberts paper by this time, they found that the incidence of
pseudo-replication had declined to only 20% of studies a rate still unacceptable to the authors.
5.16.1 What does the F-statistic mean?
The F-statistic; what do we do with it; how do we interpret it?
The F-statistic is simply a number (a statistic) that is an intermediate step in nding the p-value. Before
the advent of computers, the F-statistic was used to look up the approximate p-value from tables. It has
no intrinsic meaning other than large F-statistics indicate that the variation among means is much greater
than the variation within groups. One could think of it as a signal-to-noise ratio - higher values of the
F-statistic indicate that the signal is very strong relative to background noise, i.e., there is good evidence
that the means may not be equal. For historical reasons it is still reported, but has no real usefulness for
anything else.
5.16.2 What is a test statistic - how is it used?
I am confused with the denition of a test statistic.
A test statistic is a statistic computed from the sample data used to test a hypothesis. There are many
different test statistics, but the basic idea is a measure of how far the data lie from the null hypothesis.
In the case of two independent samples, the test statistic is a T-value computed using a two-sample t-test.
In the case of paired data (next chapter) the test statistic is also a T-value but computed in a different way.
In most ANOVA, the test-statistic is an F-statistic computed as the ratio of mean-squares. In two sample
experiments, either a T-statistic or an F-statistic could be reported - the F-value is usually the square of the
T-value, i.e. F = T
2
.
There are many types of test statistics and they depend upon the analysis chosen. In the cases above, Ive
shown the typical test-statistics, but there are lots of other possibilities as well that you dont typically see in
this course. There is no intrinsic meaning behind a test statistic other than it is a number for which a p-value
can be determined. In some cases, for example, two-sample experiments, either a T or an F value could be
reported.
5.16.3 What is MSE?
MSE -what does it mean and why use it? What is root mean square error; where does it come
from; what does it mean?
A fundamental assumption of the Analysis of Variance method is that the POPULATION standard devi-
ation in every treatment group is equal. That is why when initially examining the data, one of the rst steps
is to see if the SAMPLE standard deviations of each treatment group are about equal. Now suppose that
this assumption appears to be tenable. It seems reasonable that if the POPULATION standard deviations are
equal, then you should be able to somehow pool the information from all the treatment groups sample
standard deviations to get a better estimate of the common value. For example, if one group had a SAMPLE
standard deviation of 10 and the other group had a SAMPLE standard deviation of 12, then a potential es-
timate of the COMMON POPULATION standard deviation would be the average of 10 and 12 or 11. This
pooling is performed in the ANOVA table. The line corresponding to ERROR contains information on the
best estimate of the common variation. The Mean Square Error (MSE) is an estimate of the VARIANCE
of the observations. the Root Mean Square Error (RMSE) is an estimate of the common standard deviation
of the treatment groups. There is no simple computation formula available.
5.16.4 Power - various questions
What is meant by detecting half the difference?
What is meant by detecting half the difference?
Suppose that in an experiment, the sample means had the values of 3, 5, 7 and 9. The difference between
the largest and smallest sample mean is 6. Half of this difference is 3. In this case you would nd the sample
size needed to detect a difference of 3 in the population means.
Do we use the std dev, the std error, or root MSE in the power computations?
Do we use the std dev, the std error, or root MSE in the power computations?
The SE is NEVER used. What is needed is an estimate of the variation of INDIVIDUAL data values
in the experiment after removing the effects of treatments, blocks, and any other known causes of variation.
The treatment group standard deviations provide such an estimate in cases where the experiment is a CRD.
Even in this case, because of the implicit assumption of equal population standard deviation in all treatment
groups, there is a better estimate of the common standard deviation within groups - the root MSE (see above).
Root MSE is also the appropriate estimate to use in more complex experiment designs.
Retrospective power analysis; how is this different from regular (i.e., prospective) power analysis?
Retrospective power analysis; how is this different from regular (i.e., prospective) power analy-
sis?
A retrospective power analysis is one conducted after the fact, i.e., a post- mortem often performed when
an experiment didnt go well and failed to detect a difference. Unfortunately, just as autopsies are an imper-
fect way of treating disease, retrospective power analysis is fraught with subtle problems (refer to papers in
the notes). In particular, it turns out that if you fail to nd sufcient evidence against the null hypothesis, the
computed retrospective power (using the prospective power/sample size formulae) cannot mathematically
exceed 50% even though the real power may be much larger. A prospective power analysiss goals are much
different. Prospective power analysis is done to prevent problems by ensuring that your experiment is
sufciently large to detect biologically important effects. One of the key benets of a prospective power
analysis is to force the experimenter to dene exactly what is an important effect rather than I have no
idea what is important- lets just spend time and money and see what happens. The major difculty with
a prospective power analysis is that you will need estimates of the biological effect size, of the variation
among replicate sample, and some idea of the conguration of the means.
What does power tell us?
What does power tell us?
A retrospective power analysis is like a post-mortem - trying to nd out what went wrong and why the
patient died. Again, everything is tied to the biologically important effect. If retrospective power analysis
indicates that your experiment had an 80% chance of detecting this biologically important effect and you,
in fact did not, then you are more condent that the experiment was not a failure but rather that the effect
just isnt very big. However, refer to the subtle problem with a retrospective power analysis noted above.
A prospective power analysis is to tell you if your experiment has a reasonable chance of detecting a
biologically important effect. If your power is very low, why bother doing the experiment - for example,
could you defend spending a million dollars on an experiment with only a 2%chance of success? Remember,
if the power is low, and you fail to nd sufcient evidence against the hypothesis, you havent learned
anything expect that you likely committed a Type II error.
When to use retrospective and prospective power?
When is it appropriate to use retrospective power analysis and prospective power analysis.
Prospective power analysis is used when PLANNING a study, i.e. before conducting the experiment.
Retrospective power analysis is used after a study is complete, usually when a study has failed to detect a
biologically meaningful difference. There is nothing theoretically wrong with a retrospective power analysis
- the problem is that most computer packages do not compute retrospective power analyses correctly. As
outlined in the papers in the notes, the formulae used in prospective power analyses are NOT appropriate
for retrospective power analyses. JMP has a feature to try and adjust the power for a retrospective power
analysis - when you look at the condence interval for a retrospective power analysis, you will be surprised
to see just how poorly estimated is the retrospective power. An example will be shown in class.
When should power be reported
Is it common or preferred practice to report the power of a study when results are for or against
the null hypothesis?
A power analysis is not usually done if the study detects a biologically meaningful result. More com-
monly, the study failed to detect an effect and the question is why. Refer to the previous question for
comments about a retrospective power analysis. In my opinion, a condence interval of the estimated effect
size provides sufcient information to determine why a study failed to detect an effect.
What is done with the total sample size reported by JMP?
JMP reports the TOTAL SAMPLE size in a power analysis - how do you determine the number
of replicates?
JMP reports the TOTAL sample size required for the entire experiment. This is then divided by the
number of TREATMENT combinations to obtain the sample size for each treatment combination. For
example, suppose that JMP reported that a total sample size of 30 is needed. If the experiment has 2 levels
of Factor A and 4 levels of Factor B, there are a total of 2 4 = 8 treatment combinations. The total sample
size of 30 is divided by the 8 treatment combinations to give about 4 replicates per treatment combination.
Other packages may report the sample size requirements differently, i.e., number of replicates per level
of each factor. The nal required sample size is the same under both methods.
5.16.5 How to compare treatments to a single control?
How to compare treatments to a single control?
This is a specialized multiple comparison procedure. In JMP select the Dunnet multiple comparison
procedure and then specify the control treatment level.
5.16.6 Experimental unit vs. observational unit
I am having trouble identifying the experimental unit and understanding how it differs from
the observational unit.
An experimental unit is the unit to which a level is applied. The observational unit it the unit that
is measured. In simple designs, these are often the same. For example, if you are investigating the effect of
a drug on blood pressure, one experimental design involves randomizing the drugs to different people. The
blood pressure is measured on each person. Both the experimental unit and the observational unit are the
person.
In the sh tank experiments, you have 4 tanks each containing 10 sh. You apply a chemical to two
tanks and nothing (control) to 2 tanks. You measure the weight gain of each sh. The experimental unit =
tank, the observational unit=sh.
In another example, you have 40 tanks each with 1 sh. You apply chemicals to 20 tanks; controls to
20 tanks and measure the weight gain of each sh. Here the experimental unit is the tank; the observational
unit is the sh; but now you cant distinguish the effects of individual tanks vs. individual sh so you say
that either both units are the tank or both units are the sh.
5.16.7 Effects of analysis not matching design
A student wrote:
One thing that you keep saying is to make sure the analysis matches the design. Can you give
an example of when this does not occur and what the detriments are.
Example 1 Recall the abalone example from assignment 2. The design was a cluster sample with
individual quadrats measured along each transect.
If you analyze the data as a cluster sample, the results are:
n Mean(area)
of per Mean(count)per Est se
transects transect transect Density Density
30 405.7 16.1 0.0396 0.0055
Now suppose that a researcher did NOT take into account the clustered structure of the design and treated
each individual quadrat as the sampling unit, and found a ratio estimator based on the individual quadrats
(after inserting any missing zeroes). The following are the results:
n
of Est se
QUADRATS Density Density
1127 0.0396 0.00293
Here the estimates are equal, but the reported standard error is too small, i.e. the estimates too precise.
This is a typical result - the estimates are often not affected too much (they dont always stay the same), but
the reported standard errors are wrong, usually TOO small but they can also go in the opposite direction (see
over). This gives a false impression of the precision of your result.
Example 2: (Maze experiment from a previous exam)
In this experiment, people were timed on the completion of a maze using their left and right hand. The
actual experiment is a paired design as each person was measured using both hands.
Here is the raw data:
Hand 1 2 3 4 5 6
Left 48 54 81 82 37 51
Right 44 44 63 72 31 41
The results if analyzed (WRONGLY) as single-factor CRD are:
Difference t-Test DF Prob>|t|
Estimate 9.66667 0.988 10 0.3463
Std Error 9.78037
Lower 95% -12.1253
Upper 95% 31.45868
The results if analyzed (Correctly) as a paired experiment are:
Difference t-Test DF Prob>|t|
Estimate 9.66667 4.930 5 0.0044
Std Error 1.96073
Lower 95% 4.62646
Upper 95% 14.70687
Here the wrong analysis failed to detect a difference in the mean time to complete the maze between the
two hands while the correct analysis showed strong evidence in a difference in the means.
Example 3: (wrong analysis for study design)
Taken fromCBCDisclosure http://cbc.ca/disclosure/archives/030114.html#hockey
Recently, a 20-year-old ban on full body contact in kids hockey was lifted. Now nine-year-olds
on ice rinks across Canada can slam each other just like their NHL heroes.
The lifting of the ban came after a university study concluded body checking at a young age
wouldnt cause more injuries than hockey without body checking.
That didnt seem quite right to us, and when Mark Kelley investigated, he found some surprising
results. Measuring the effects of initiating body checking at the Atomage level [10-11 year olds]
- nal report to the Ontario Hockey Federation and the Canadian Hockey Association.
One of the errors in the Lakehead Study is found on page 24, in the chart entitled Self-reported
Injuries. The O.D.M.H.A refers to the non-body checking group, and Professor Bill Montel-
pare originally calculated that there were 9 injuries per 1000 Athletic Exposures. He also
counted 6.9 injuries per 1000 Athletic Exposures in the body checking group, the OHF. The
math in both these calculations is wrong.
Therefore, the studys conclusion on page 41, is awed: Based on the results of this study: -
there is no signicant difference in injury rates between the comparison groups.
After being interviewed by the CBC, Professor Montelpare later recalculated his numbers. He found
that, far from there being fewer injuries in the body checking group, there were nearly four times more.
In year two for example, he found 8.6 injuries per 1000 Athletic Exposures in the body checking group,
compared to 2.1 injuries per 1000 Athletic Exposures in the non-body checking group. In a supplemental
report, Professor Montelpare told the Canadian Hockey Association that he now considered the differences
between the groups to be signicant.
5.17 Table: Sample size determination for a two sample t-test
Power for a two-sided two-sample t-test at alpha=.05
Delta
n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
2 5 5 5 6 6 7 7 8 9 10 10 11 13 14 15 16 17 19 20 22
3 5 5 6 7 8 9 10 12 14 16 18 21 23 26 29 33 36 39 43 46
4 5 6 7 8 9 11 13 16 19 22 26 30 34 38 43 48 52 57 61 66
5 5 6 7 9 11 13 16 20 24 29 33 39 44 49 55 60 65 70 75 79
6 5 6 8 10 12 16 20 24 29 35 41 47 53 59 65 71 76 80 84 88
7 5 6 8 11 14 18 23 28 34 41 47 54 61 67 73 78 83 87 90 93
8 5 7 9 12 15 20 26 32 39 46 54 61 68 74 80 84 88 92 94 96
9 5 7 9 13 17 22 29 36 43 51 59 67 74 80 85 89 92 95 97 98
10 6 7 10 14 19 25 32 40 48 56 64 72 78 84 89 92 95 97 98 99
12 6 8 11 16 22 29 37 47 56 65 73 80 86 91 94 96 98 99 99 100
14 6 8 12 17 25 33 43 53 63 72 80 86 91 95 97 98 99 100 100 100
16 6 9 13 19 28 38 48 59 69 78 85 91 94 97 98 99 100 100 100 100
18 6 9 14 21 31 42 53 65 75 83 89 94 97 98 99 100 100 100 100 100
20 6 9 15 23 34 46 58 69 79 87 92 96 98 99 100 100 100 100 100 100
25 6 11 18 28 41 55 68 79 88 93 97 99 99 100 100 100 100 100 100 100
30 7 12 21 33 48 63 76 86 93 97 99 100 100 100 100 100 100 100 100 100
35 7 13 24 38 54 70 82 91 96 98 99 100 100 100 100 100 100 100 100 100
40 7 14 26 42 60 75 87 94 98 99 100 100 100 100 100 100 100 100 100 100
45 8 16 29 47 65 80 91 96 99 100 100 100 100 100 100 100 100 100 100 100
50 8 17 32 51 70 84 93 98 99 100 100 100 100 100 100 100 100 100 100 100
55 8 18 34 55 74 88 95 99 100 100 100 100 100 100 100 100 100 100 100 100
60 8 19 37 58 78 90 97 99 100 100 100 100 100 100 100 100 100 100 100 100
65 9 20 40 62 81 92 98 99 100 100 100 100 100 100 100 100 100 100 100 100
70 9 22 42 65 84 94 98 100 100 100 100 100 100 100 100 100 100 100 100 100
75 9 23 45 68 86 95 99 100 100 100 100 100 100 100 100 100 100 100 100 100
80 10 24 47 71 88 96 99 100 100 100 100 100 100 100 100 100 100 100 100 100
85 10 25 49 74 90 97 100 100 100 100 100 100 100 100 100 100 100 100 100 100
90 10 27 52 76 92 98 100 100 100 100 100 100 100 100 100 100 100 100 100 100
95 11 28 54 78 93 98 100 100 100 100 100 100 100 100 100 100 100 100 100 100
100 11 29 56 80 94 99 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Power is in %
Delta = abs(difference in means)/ sigma
This table assumes equal sample sizes in both groups
The table is indexed along the side by the relative effect size dened as:
=
|
1
2
|
where
1
and
2
are the two means to be compared, and is the standard deviation of the responses around
their respective means. The latter is often a guess-estimate obtained from a pilot study or a literature search.
Along the top are several choices of levels and several choices for power. Usually, you would like to
be at least 80% sure of detecting an effect. [Note that a 50% power is equivalent to ipping a coin!]
For example if = .5, then at least 64 in each group is required for a 80% power at = 0.05..
What does this table indicate? First, notice that as you increase the power for a given relative effect size
the sample size increases. Similarly, as you decrease the relative effect size to be detected, the sample size
increases. And, most important, you need very large experiments to detect small differences!
Power is maximized if the two groups have equal sample sizes, but it is possible to do a power analysis
with unequal sample sizes - consult some of the references listed in the notes.
5.18 Table: Sample size determination for a single factor, xed ef-
fects, CRD
Power for a single factor, fixed effects, CRD at alpha=0.05
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
2 3 5 5 6 7 8 9 10 12 14 16 18 21 23 26 29 33 36 39 43 46
2 4 5 6 7 8 9 11 13 16 19 22 26 30 34 38 43 48 52 57 61 66
2 5 5 6 7 9 11 13 16 20 24 29 33 39 44 49 55 60 65 70 75 79
2 6 5 6 8 10 12 16 20 24 29 35 41 47 53 59 65 71 76 80 84 88
2 7 5 6 8 11 14 18 23 28 34 41 47 54 61 67 73 78 83 87 90 93
2 8 5 7 9 12 15 20 26 32 39 46 54 61 68 74 80 84 88 92 94 96
2 9 5 7 9 13 17 22 29 36 43 51 59 67 74 80 85 89 92 95 97 98
2 10 6 7 10 14 19 25 32 40 48 56 64 72 78 84 89 92 95 97 98 99
2 15 6 8 12 18 26 35 46 56 66 75 83 89 93 96 98 99 99 100 100 100
2 20 6 9 15 23 34 46 58 69 79 87 92 96 98 99 100 100 100 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
3 3 5 5 6 6 7 8 9 10 11 13 14 16 18 21 23 26 29 32 35 38
3 4 5 5 6 7 8 9 11 13 15 17 20 23 27 30 34 38 43 47 52 56
3 5 5 6 6 7 9 11 13 16 19 22 26 30 35 40 45 50 55 60 65 70
3 6 5 6 7 8 10 12 15 19 22 27 32 37 43 49 55 60 66 71 76 81
3 7 5 6 7 9 11 14 17 22 26 32 38 44 50 57 63 69 75 80 84 88
3 8 5 6 7 9 12 16 20 25 30 37 43 50 57 64 70 76 81 86 90 92
3 9 5 6 8 10 13 17 22 28 34 41 49 56 64 70 77 82 87 90 93 95
3 10 5 6 8 11 14 19 24 31 38 46 54 62 69 76 82 87 91 94 96 97
3 15 6 7 10 14 20 27 36 46 56 65 74 82 88 92 95 97 99 99 100 100
3 20 6 8 12 18 26 36 47 59 70 79 87 92 96 98 99 100 100 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
4 3 5 5 6 6 6 7 8 9 10 11 13 14 16 18 20 23 25 28 31 34
4 4 5 5 6 6 7 8 10 11 13 15 17 20 23 26 30 34 38 42 46 50
4 5 5 5 6 7 8 10 11 14 16 19 22 26 30 35 39 44 49 54 60 64
4 6 5 6 6 7 9 11 13 16 19 23 27 32 37 43 49 54 60 65 71 75
4 7 5 6 7 8 10 12 15 19 23 27 33 38 44 51 57 63 69 74 79 84
4 8 5 6 7 9 11 13 17 21 26 32 38 44 51 58 64 71 76 81 86 89
4 9 5 6 7 9 12 15 19 24 29 36 43 50 57 64 71 77 82 87 90 93
4 10 5 6 7 10 12 16 21 26 33 40 47 55 63 70 77 82 87 91 94 96
4 15 5 7 9 12 17 23 31 40 49 59 68 76 83 89 93 96 98 99 99 100
4 20 6 7 11 15 22 31 41 52 64 74 82 89 93 96 98 99 100 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
5 3 5 5 5 6 6 7 8 8 9 10 12 13 15 17 19 21 23 26 28 31
5 4 5 5 6 6 7 8 9 10 12 14 16 18 21 24 27 30 34 38 42 46
5 5 5 5 6 7 8 9 10 12 15 17 20 23 27 31 36 40 45 50 55 60
5 6 5 5 6 7 8 10 12 14 17 21 25 29 34 39 44 50 55 61 66 71
5 7 5 6 6 8 9 11 14 17 20 24 29 34 40 46 52 58 64 70 75 80
5 8 5 6 7 8 10 12 15 19 23 28 34 40 46 53 60 66 72 78 82 87
5 9 5 6 7 8 11 13 17 21 26 32 38 45 52 60 66 73 79 84 88 91
5 10 5 6 7 9 11 15 19 24 29 36 43 50 58 65 72 78 84 88 92 94
5 15 5 6 8 11 15 21 28 36 45 54 63 72 80 86 91 94 97 98 99 100
5 20 5 7 10 14 20 28 37 48 59 69 78 86 91 95 97 99 99 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
6 3 5 5 5 6 6 7 7 8 9 10 11 12 14 15 17 19 21 24 26 29
6 4 5 5 6 6 7 8 9 10 11 13 15 17 19 22 25 28 32 35 39 43
6 5 5 5 6 6 7 8 10 11 13 16 18 22 25 29 33 37 42 47 52 57
6 6 5 5 6 7 8 9 11 13 16 19 23 27 31 36 41 46 52 57 63 68
6 7 5 6 6 7 9 10 13 15 19 22 27 32 37 43 49 55 61 66 72 77
6 8 5 6 6 8 9 11 14 17 21 26 31 37 43 49 56 62 69 74 79 84
6 9 5 6 7 8 10 12 16 19 24 29 35 42 49 56 63 69 75 81 85 89
6 10 5 6 7 8 11 13 17 22 27 33 40 47 54 62 69 75 81 86 90 93
6 15 5 6 8 11 14 19 25 33 41 50 59 68 76 83 89 93 96 97 99 99
6 20 5 7 9 13 18 25 34 44 55 65 75 83 89 94 97 98 99 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
7 3 5 5 5 6 6 6 7 8 9 9 11 12 13 15 16 18 20 22 25 27
7 4 5 5 6 6 7 7 8 9 11 12 14 16 18 20 23 26 30 33 37 41
7 5 5 5 6 6 7 8 9 11 13 15 17 20 23 27 31 35 39 44 49 54
7 6 5 5 6 7 8 9 11 13 15 18 21 25 29 33 38 43 49 54 60 65
7 7 5 5 6 7 8 10 12 14 17 21 25 29 34 40 46 52 58 63 69 74
7 8 5 6 6 7 9 11 13 16 20 24 29 34 40 46 53 59 65 71 77 82
7 9 5 6 6 8 9 12 15 18 22 27 33 39 46 52 59 66 72 78 83 87
7 10 5 6 7 8 10 13 16 20 25 31 37 44 51 58 65 72 78 83 88 91
7 15 5 6 8 10 13 18 23 30 38 47 56 65 73 81 87 91 94 97 98 99
7 20 5 7 9 12 17 23 31 41 51 62 72 80 87 92 96 98 99 99 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
8 3 5 5 5 6 6 6 7 8 8 9 10 11 12 14 15 17 19 21 23 26
8 4 5 5 5 6 6 7 8 9 10 11 13 15 17 19 22 25 28 31 35 38
8 5 5 5 6 6 7 8 9 10 12 14 16 19 22 25 29 33 37 42 46 51
8 6 5 5 6 7 7 9 10 12 14 17 20 23 27 31 36 41 46 51 57 62
8 7 5 5 6 7 8 9 11 14 16 20 23 28 32 38 43 49 55 61 66 72
8 8 5 6 6 7 8 10 12 15 19 22 27 32 38 44 50 56 63 69 74 79
8 9 5 6 6 7 9 11 14 17 21 26 31 37 43 50 57 63 70 76 81 85
8 10 5 6 7 8 10 12 15 19 23 29 35 41 48 55 63 69 76 81 86 90
8 15 5 6 7 10 13 17 22 28 36 44 53 62 71 78 85 90 93 96 98 99
8 20 5 6 8 11 16 22 29 39 49 59 69 78 85 91 95 97 99 99 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
9 3 5 5 5 6 6 6 7 7 8 9 10 11 12 13 15 16 18 20 22 24
9 4 5 5 5 6 6 7 8 9 10 11 12 14 16 18 21 23 26 30 33 37
9 5 5 5 6 6 7 8 9 10 12 13 15 18 21 24 27 31 35 39 44 49
9 6 5 5 6 6 7 8 10 11 13 16 19 22 26 30 34 39 44 49 54 60
9 7 5 5 6 7 8 9 11 13 15 18 22 26 31 36 41 47 52 58 64 69
9 8 5 5 6 7 8 10 12 14 18 21 26 30 36 42 48 54 60 66 72 77
9 9 5 6 6 7 9 11 13 16 20 24 29 35 41 47 54 61 67 73 79 84
9 10 5 6 6 8 9 11 14 18 22 27 33 39 46 53 60 67 73 79 84 88
9 15 5 6 7 9 12 16 21 27 34 42 51 60 68 76 83 88 92 95 97 98
9 20 5 6 8 11 15 21 28 36 46 56 67 76 83 89 94 96 98 99 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
10 3 5 5 5 6 6 6 7 7 8 9 9 10 12 13 14 16 17 19 21 23
10 4 5 5 5 6 6 7 8 8 9 11 12 14 15 18 20 22 25 28 32 35
10 5 5 5 6 6 7 7 8 10 11 13 15 17 20 23 26 30 34 38 42 47
10 6 5 5 6 6 7 8 9 11 13 15 18 21 24 28 32 37 42 47 52 58
10 7 5 5 6 7 8 9 10 12 15 18 21 25 29 34 39 44 50 56 62 67
10 8 5 5 6 7 8 10 11 14 17 20 24 29 34 40 46 52 58 64 70 75
10 9 5 6 6 7 8 10 13 15 19 23 28 33 39 45 52 58 65 71 77 82
10 10 5 6 6 7 9 11 14 17 21 26 31 37 44 51 58 65 71 77 82 87
10 15 5 6 7 9 12 15 20 25 32 40 49 57 66 74 81 87 91 94 97 98
10 20 5 6 8 11 14 20 26 35 44 54 64 74 82 88 93 96 98 99 100 100
-------------------------------------------------------------------------------
Power is in %
Delta = (max difference in means)/ sigma
r = number of treatments
This table assumes equal sample sizes in all groups
The power tabulated is conservative because it assumes the worst possible configuration
for the means for a given delta and assumes equal sample sizes in all groups
This tables are indexed using
=
max() min()
where is the standard deviation of units around each population mean.

For example, supposed that an experiment had 6 treatment groups but largest and smallest mean differed
by 2 units with a standard deviation of 1 unit. Then = 2.
Scan the rst table for r=6 groups, power = 80%, = 2, = .05, and it indicates about 8 is needed for
each treatment group.
5.19 Scientic papers illustrating the methods of this chapter
5.19.1 Injury scores when trapping coyote with different trap designs
Darrow, P. A., Skiptstunas, R. T, Carlson, S.W., and Shivik, J.A. (2009). Comparison of injuring
to coyote from 3 types of cable foot restraints. Journal of Wildlife Management, 73, 1441-1444.
Available at: http://dx.doi.org/10.2193/2008-566
This paper uses a single-factor CRD design to compare the mean injury scores when coyote are trapped
using different trap designs.
Some questions to think about when reading this paper:
Draw a picture of the experimental design
Why and how was randomization done?
Why were the veterinarians blinded
Examine Table 1. Do you understand what is being reported? [At this point, you dont know how to
compute the se by hand, but you could get these from JMP. Think of the creel example and the were
there enough lifeboats - yes/no example.
Examine Table 2. Do you understand what is being reported? Draw a sample graph showing approx-
imate side-by-side condence intervals for the ISO score. Do the results from your plot match the
results from the formal F-test and subsequent HSD? Why?
What hypotheses were being tested in the various comparisons. [Careful about stating the hypotheses!]
How would you set up the data table to include ALL of the relevant data captured in this study. FOr
example, give a few lines of (hypothetical) data as it would appear in a JMP data le.
Why did they use Tukeys HSD following an ANOVA?
How did they do a comparison of injury scores by specialist?
What method did they use to compare injury scores by sex? (because there are only 2 sexes, could
you use a different statistical procedure?)
Suppose you are planning a future study where differences of 10 in the ISO score are biologically
important. How many animals in each trap would you need to have reasonable power to detecting this
difference (if it existed)? [A good exam question would have you ll in the JMP dialogue box.]
The authors did not use blocking/stratication/pairing in this experiment. How could you modify this
design to include a blocking variable?
The paper also used a regression analysis which is covered in a later chapter.
Chapter 6
Single factor - pairing and blocking
Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
6.2 Randomization protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
6.2.1 Some examples of several types of block designs . . . . . . . . . . . . . . . . . . 399
6.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
6.3.1 Does the analysis match the design? . . . . . . . . . . . . . . . . . . . . . . . . 403
6.3.2 Additivity between blocks and treatments . . . . . . . . . . . . . . . . . . . . . . 404
6.3.3 No outliers should be present . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
6.3.4 Equal treatment group standard deviations? . . . . . . . . . . . . . . . . . . . . . 406
6.3.5 Are the errors normally distributed? . . . . . . . . . . . . . . . . . . . . . . . . . 407
6.3.6 Are the errors independent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
6.4 Comparing two means in a paired design - the Paired t-test . . . . . . . . . . . . . . . 408
6.5 Example - effect of stream slope upon sh abundance . . . . . . . . . . . . . . . . . . 409
6.5.1 Introduction and survey protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 409
6.5.2 Using a Differences analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
6.5.3 Using a Matched paired analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 415
6.5.4 Using a General Modeling analysis . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.5.5 Which analysis to choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
6.5.6 Comments about the original paper . . . . . . . . . . . . . . . . . . . . . . . . . 420
6.6 Example - Quality check on two laboratories . . . . . . . . . . . . . . . . . . . . . . . 421
6.7 Example - Comparing two varieties of barley . . . . . . . . . . . . . . . . . . . . . . 427
6.8 Example - Comparing prep of mosaic virus . . . . . . . . . . . . . . . . . . . . . . . 432
6.9 Example - Comparing turbidity at two sites . . . . . . . . . . . . . . . . . . . . . . . 437
6.9.1 Introduction and survey protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 437
6.9.2 Using a Differences analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
6.9.3 Using a Matched paired analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 442
395
CHAPTER 6. SINGLE FACTOR - PAIRING AND BLOCKING
6.9.4 Using a General Modeling analysis . . . . . . . . . . . . . . . . . . . . . . . . . 443
6.9.5 Which analysis to choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
6.10 Power and sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
6.11 Single Factor - Randomized Complete Block (RCB) Design . . . . . . . . . . . . . . . 449
6.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
6.11.2 The potato-peeling experiment - revisited . . . . . . . . . . . . . . . . . . . . . . 449
6.11.3 An agricultural example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
6.11.4 Basic idea of the analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
6.12 Example - Comparing effects of salinity in soil . . . . . . . . . . . . . . . . . . . . . . 453
6.12.1 Model building - tting a linear model . . . . . . . . . . . . . . . . . . . . . . . 455
6.13 Example - Comparing different herbicides . . . . . . . . . . . . . . . . . . . . . . . . 461
6.14 Example - Comparing turbidity at several sites . . . . . . . . . . . . . . . . . . . . . 468
6.15 Power and Sample Size in RCBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
6.16 Example - BPK: Blood pressure at presyncope . . . . . . . . . . . . . . . . . . . . . . 476
6.16.1 Experimental protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
6.16.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
6.16.3 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
6.17 Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
6.18.1 Difference between pairing and confounding . . . . . . . . . . . . . . . . . . . . 488
6.18.2 What is the difference between a paired design and an RCB design? . . . . . . . . 489
6.18.3 What is the difference between a paired t-test and a two-sample t-test? . . . . . . 489
6.18.4 Power in RCB/matched pair design - what is root MSE? . . . . . . . . . . . . . . 490
6.18.5 Testing for block effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
6.18.6 Presenting results for blocked experiment . . . . . . . . . . . . . . . . . . . . . . 491
6.18.7 What is a marginal mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
6.18.8 Multiple experimental units within a block? . . . . . . . . . . . . . . . . . . . . 492
6.18.9 How does a block differ from a cluster? . . . . . . . . . . . . . . . . . . . . . . . 492
6.1 Introduction
In the completely randomized design, a complete randomization of experimental units to treatments was
performed. This randomization ensured that the effects of all possible other variables that might affect the
response are, on average, equal in all treatment groups. Consequently, differences in the group means can
be attributed to the treatments.
In some cases, it is known or suspected in advance, that a variable, not of primary interest to the ex-
perimenter, will affect the results and it is possible to group experimental units into clusters (or blocks or
strata) where units within a cluster have similar values of this other variable. By changing the experimental
design slightly, it is possible to design a more powerful experiment that adjusts for the potential effects of
this additional explanatory variable.
For example, suppose that an experiment was to be performed to investigate the effect of a drug in
lowering blood pressure. A group of test subjects is available.
In a completely randomized design, 1/2 of the test subjects would be assigned at random to the control
group to receive a placebo, and 1/2 of the test subjects would be assigned to the drug group. By randomizing,
the effects of other, uncontrolled variables such as amount of exercise, metabolism, diet, etc., would be
equal, on average, between the two groups. However, these other uncontrolled variables would result in a
large variation in blood pressure within each group making it harder to detect any changes. [Recall that
power for this type of experiment is related to the ratio of the difference in means to the standard deviation
within groups.]
The design can be improved by treating each subject with both the placebo and the drug (in random
order). Now each subject serves as a control for these other variables and the difference in blood pressure
readings will be free (we hope) of the effects of these other variables. This is known as a paired design,
or more generally, as a blocked design. This design is not perfect one still has to worry about carry-over
effects (e.g. the response for the second treatment might be affected by what happened in the rst treatment),
and about the interaction of the blocking factor with the treatment (i.e., perhaps people with high blood
pressure react differently to the drug than people with low blood pressure). It is possible to block by more
than one variable, e.g. the subjects could be further grouped by initial blood pressure levels this is beyond
the scope of this course.
This example with two treatment levels can be extended to a randomized complete block experiment
where 2 or more levels are randomized within each block.
In eld biology, common blocking variables are site of the experiment or biogeoclimatic zone.
Some other examples of blocked designs are:
Honey bee colonies are stacked on pallets, three per pallet. Investigators wish to determine which of
three brands of a chemical treatment is most effective in killing a bee mite. They randomly assign the
three treatments within each pallet ensuring that each pallet receives all three treatments. The blocks
are the pallets; the factor is the chemical treatment; the levels are the three brands.
Resting heart rate varies considerably among people. Consequently, you may decide to measure a
person before and after exercise to see the change in heart rate. The blocks are people; the factor is
time; the levels are before and after exercise. Notice in this experiment, you cant randomize time -
this can introduce subtle problems into the analysis.
1
Driving habits vary considerably among drivers. Consequently, you may decide to compare the dura-
bility of different brands of tires by mounting all brands on the same car and doing a direct comparison
under the same driving conditions, rather than using different cars for each grand with different drivers
and (presumably) different driving conditions. The blocks are driver/car; the factor is brand of tire;
the levels are the particular brands chosen in the experiment.
1
For the technical masochists in the audience, the subtle problem is that the error terms likely no longer have a compound symmetric
covariance structure. Measurements that are closer together in time will be more related than measurements that are distant in time.
There are two seemingly different experimental procedures and analyses.
1. Paired design. There are two treatment levels. A paired t-test is used to analyze the data.
2. Blocked deign. There are two or more treatment levels. An ANOVA is used to analyze the data.
The paired t-test is a special case of a more general ANOVA approach and the two approaches will give
identical results for designs with exactly 2 treatment levels. In cases with more than 2 levels, the paired
approach cannot be used.
The null hypothesis is exactly the same regardless if the design is a CRD or an RCB! More generally,
for xed effects, the hypotheses are always about the mean responses at different levels. Typically the null
hypothesis is H: no effect of factor X upon the mean response or H: the mean response for each level of
factor X is the same. The alternate hypothesis is that there is some effect of factor X upon the mean response
or that the mean response differs among the levels of factor X.
The reason that the hypotheses are the same is that hypotheses are concerned about treatment effects. The
treatment structure is quite independent of the experimental unit structure (i.e., an RCB, CRD, sub-sampling,
or split-plot design) and the randomization structure (i.e., was it complete randomization).
Advantages and Disadvantages of an RCB
Why choose a RCB over a CRD or vice-versa? Here are the advantages and disadvantages of each
design.
Design Advantages Disadvantages
CRD Easy to construct the design.
Easy to analyze even if the number
of replications differs in each group.
Can be used for any number of treat-
ments.
Experimental units are presumed to
be homogeneous so that complete
randomization is appropriate.
RCB One source of heterogeneity among
experimental units can be accounted
for.
May be difcult to get large enough
blocks if you have a large number of
treatments.
More complex analysis if doing
by hand - otherwise must properly
specify design to computer package.
Can be a complicated analysis if
many values go missing - but in
many cases modern software han-
dles missing values without prob-
lems.
In general, it is almost always advantageous to block. If the blocking is successful, you can substan-
tially increase the power of your design to detect differences; if blocking is unimportant, there is very little
loss of efciency in the design.
What is the difference between treating blocks as simply blocks or treating blocks as another
factor? In some cases, the creation of blocks is dependant upon a variable that looks like a factor. For
example, blocks could be formed based upon (hypothesized) fertility differences among elds. The key
difference between treating blocks as simply nusiance variables or treating them as a factor is how you go
about measuring the block variable. If blocks are formed upon (hypothesized) differences in fertility among
plots, it is NOT necessary to measure the actual fertility levels, nor is it necessary to restrict the number of
fertility levels to a small number of levels. If you wanted to treat fertility as a factor, you would normally
only have a few levels (e.g. low, medium, high), you would be forced to measure the fertility of each plot
of land, and you would like to have replicates of each level of fertility. You are now into the realm of two
factor designs which are discussed in other chapters.
6.2 Randomization protocol
A single factor, randomized complete block design (RCB) has the following attributes:
1. There is a single size of experimental unit.
2. Experimental units can be grouped into clusters (or blocks or strata) and within each block, experi-
mental units are as similar as possible.
3. Within each block, experimental units are completely randomized (independently within each block)
to treatments such that every treatment occurs once and only once in each block and all treatments
occur in each block.
The restricted randomization procedure is a key point of blocked designs - the randomization takes place
independently within each block.
The most common violation of this protocol is incomplete randomization within each block. For exam-
ple, in a drug study to compare the blood pressure of a drug vs. the blood pressure when taking a placebo,
the placebo may always be given before the drug. In this case, it is impossible to know if the drug caused any
change in mean blood pressure; perhaps the different amount of sunlight between the two occasions caused
the change.
6.2.1 Some examples of several types of block designs
Suppose that the Ministry of Forests is interested in studying the effects of supplemental fertilization on the
growth of seedlings after planting. Three levels of supplemental fertilization will be used (none, low, or high
amounts of fertilization). The experiment will be conducted at six different test sites around the province. At
each location, three recently replanted forest plots are available for use in the experiment. Here are several
possible designs.
Completely randomized design - no blocking
Plot number at each site
Site 1 2 3
---------------------
1 0 high 0
2 low low high
3 high low 0
4 low 0 high
5 low 0 low
6 0 high high
---------------------
The experimental units were completely randomized to the treatments ignoring the site groupings. There
is no guarantee that each site receives each treatment level. Consequently comparisons between treatment
levels (e.g. high vs. low) include extra variation because different locations are used for the two treatments.
Randomized complete block design - RCB design
Site 1 2 3
---------------------
1 0 high low
2 low 0 high
3 high low 0
4 low 0 high
5 high 0 low
6 0 high low
---------------------
Within each block, all treatments occur once and only once and within each block randomization was
performed independently of the randomization in other blocks.
Randomized complete block design - RCB design - missing values
Site 1 2 3
---------------------
1 0 high low
2 low 0
*
3 high low 0
4 low 0 high
5 high 0 low
6 0 high low
---------------------
Within each block, all treatments occur once and only once and within each block randomization was per-
formed independently of the randomization in other blocks. However, during the experiment, some plots
were rendered unusable (e.g. damage by deer) and some treatments could not be measured (e.g. plot 3 in site
2). In the past (when computations were done by hand), this made the analysis more difcult. With modern
software (e.g. NOT Excel!), this usually doesnt cause many problem unless there are substantial number of
missing values. You may wish to seek advice on the analysis of blocked designs with missing values.
Incomplete block design - not an RCB
Site 1 2
---------------
1 0 high
2 low 0
3 high low
4 low 0
5 high 0
6 0 high
---------------------
In some bases, blocks may not be large enough to contain all treatments. The blocks are incomplete but
randomization still takes place within each block and independently of other blocks. Seek advice on the
design and analysis of such designs. For example, the above design is not balanced as not all pairs of
treatments occur equally often (the 0 vs. high pair occurs more often than other pairs).
Generalized randomized complete block design
Site 1 2 3 4
---------------------------
1 0 high low low
2 low 0 high 0
3 high low 0 0
4 low 0 high high
5 high 0 low high
6 0 high low low
---------------------------
In some cases, the blocks are large enough to have some or all of the treatments replicated within each
block. This provides additional information about the variability of treatments within blocks. In addition,
the fact that you now know the variability of responses within blocks for the same treatment, allows the
experimenter to conduct a formal statistical test to examine if the block-treatment additivity assumption
holds in this experiment.
For this reason, I recommend the use of the latter design (the Generalized Randomized Complete
Block Design) whenever possible The design gives you the advantage of blocking - increased precision. As
well, it allows you to empirically test the block-treatment assumption of addivitivity.
The balanced case of a GRCB design would have an equal number of replicates of each treatment in
each block (e.g. 2 replicates within each block). However, this is not essential if the blocks cannot be made
big enough. While there is no formal rule, it seems sensible to spread the replication over the different
treatments among the blocks, i.e. in some block replicate treatment levels a1 and a2, while in other blocks,
replicate a2 and a3, and then a1 and a3 etc. Seek advice on the design and analysis of such designs.
Two nice references are:
Addelman, S. (1969).
The Generalized Randomized Block Design.
American Statistician, 23, 35-36.
http://dx.doi.org/10.2307/2681737.
Gates C. E. (1999).
What really is experimental error in block designs.
American Statistician 49, 362-363.
http://dx.doi.org/10.2307/2684574.
6.3 Assumptions
As noted earlier, each and every statistical procedure makes a number assumptions about the data that should
be veried as the analysis proceeds. Some of these assumptions can be examined using the data at hand;
other, often the most important can only be assessed using the meta-data about the experiment.
The set of assumptions for the single factor RCB are, for the most part, identical to those for the single
factor CRD. To make this chapter self-contained, they are repeated in detail below and the key differences
for an RCB will be highlighted.
6.3.1 Does the analysis match the design?
THIS IS THE MOST CRUCIAL ASSUMPTION!
In this chapter, the data were collected under a blocked design.
It is not possible to check this assumption by examining the data and you must spend some time exam-
ining exactly how the treatments were randomized to experimental units, and if the observational unit is the
same as the experimental unit (i.e. the meta-data about the experiment). This comes down to the RRRs of
statistics - how were the experimental units randomized, what are the numbers of experimental units, and
are there groupings of experimental units (blocks)?
The key features of a RCB design are the restricted randomization of the treatments within each block,
and that the block are complete (i.e. every treatment occurs exactly once in every block).
2
Typical problems are lack of randomization within each block, pseudo-replication, and (ironically) a lack
of blocking.
Was randomization complete? If you are dealing with analytical survey, then verify that the samples
are true random samples (not merely haphazard samples). If you are dealing with a true experiments, ensure
that there was a complete randomization of treatments to experimental units.
What is the true sample size? Are the experimental units the same as the observational units? In
pseudo-replication (Hurlbert, 1984), the experimental and observational units are different. An example of
pseudo-replication are experiments with sh in tanks where the tank is the experimental unit (e.g. chemicals
added to the tank) but the sh are the observational units.
Is blocking present? The experimental units should be grouped into more homogeneous units with
restricted randomizations within each group. Note how this differs from a CRD where there is no a-priori
grouping of experimental units and there is complete randomization of treatments to experimental units.
2
The assumption of a complete block can be relaxed somewhat through either additional replicates in each block or incomplete
blocks. In either case, please seek additional help in design and the analysis of such designs.
6.3.2 Additivity between blocks and treatments
THIS IS A CRUCIAL ASSUMPTION that allows the analysis to be interpreted!
A crucial feature in the design and analysis of blocked designs is the assumption of additivity between
treatment effects and block effects upon the response variable.
This assumptions states that the difference in the mean response between any two treatments is the same
in all blocks. The overall mean of the responses from each treatment may vary among blocks, but the
differences must be constant.
For example, consider an experiment with three treatments conducted in two blocks. Table 6.1 shows
true population means under the assumption of additivity.
Table 6.1: Population means under assumption of additivity
Treatment
Block a b c
1 10 20 15
2 35 45 40
Notice that the response is generally higher in block 2 than in block 1 in fact the difference between the
two blocks is a constant value of 25 for each treatment. The mean of treatment b is always 10 units higher in
both blocks than the mean for treatment a. The difference in the means for any pair of treatment is the same
in all blocks.
Note that the above table refers to POPULATION means - the sample means may not enjoy this strict
additivity as an artifact of the sampling process.
Another name for the assumption of additivity is no interaction between treatments and blocks.
There are two ways in which additivity can fail. First, the units may be measured on the wrong scale.
Consider Table 6.2 of means where additivity does not hold:
Table 6.2: Population means when the assumption of additivity is false but correctable
Treatment
Block a b c
1 10 20 15
2 100 200 150
In Table 6.2, the treatment effects are not the same in both blocks (why?), but notice that the mean for
treatment b is always twice the mean for treatment a in both blocks. This suggests that the effects of treatment
are multiplicative rather than additive, and that the analysis should proceed on the log-scale. Indeed, consider
the same values in Table 6.3 after a log-transformation:
3
Table 6.3: Population means when the assumption of additivity is false but correctable. A log-transform is
applied.
Treatment
Block a b c
1 2.30 3.00 2.71
2 4.61 5.30 5.01
The treatment effects in Table 6.3 are now additive on the log-scale (how can you tell?).
In some cases, no transformation will correct non-additivity as illustrated in Table 6.4..
Notice that the response Table 6.4 is generally higher in block 2 than in block 1 but there is no simple
pattern to the difference among means. The difference between the mean response for treatment a and
treatment b changes in the two blocks. In block 1, treatment b has a mean that is 10 units larger than the
mean for treatment a, while in block 2, the mean for b is 5 units lower than the mean for a. This is also
known as an interaction between treatments and blocks.
The discussion of the concept of interaction and additivity is resumed when two-factor designs are ex-
amined in a later chapter..
The assumption of additivity between blocks and treatments can be assessed (roughly) in two ways
depending if the design is a simple paired experiment or a more complex design with more than two treat-
ments.
4
In a paired design, a plot of the differences between the two treatments is plotted against the sample
average of the two treatments for each pair. The data should show a rough scatter in a parallel band to the
X-axis (this will be demonstrated in the examples).
In blocked designs with more than two treatments, plot the data points against the treatments and join
3
Either the ln or log transformation can be used. The results will differ by a constant
4
A more formal test that compares complete additivity to a multiplicative effect is possible. It is called Tukeys test for additivity,
but is rarely more useful than the simple plots outlined in this section
Table 6.4: Population means when the assumption of additivity is false and not correctable.
Treatment
Block a b c
1 10 20 15
2 35 30 25
points from the same block (again this will be demonstrated in the examples).
5
6.3.3 No outliers should be present
This is the same assumption as in the CRD.
As seen in previous chapters, the idea behind the tests for equality of means is, ironically, to compare
the relative variation among means to the variation with each group. Outliers can severely distort estimates
of the within-group variation and severely distort the results of the statistical test.
Construct side-by-side scatterplots of the individual observations for each group. Check for any outliers
- are there observations that appear to be isolated from the majority of observations in the group? Try to nd
a cause for any outliers. If the cause is easily corrected, and not directly related to the treatment effects (like
a data recording error) then alter the value. Otherwise, include a discussion of the outliers and their potential
signicance to the interpretation of the results in your report. One direct way to assess the potential impact
of an outlier is to do the analysis with and without the outlier.
6
If there is no substantive difference in the
results - be happy!
6.3.4 Equal treatment group standard deviations?
This is the same assumption as in the CRD.
Every procedure for comparing means that is a variation of ANOVA, assumes that all treatment groups
have the same standard deviation. This can be informally assessed by computing the sample standard devia-
tion for each treatment group to see if they are approximately equal.
7
Because the sample standard deviation
is quite variables over repeated samples from the same population, exact equality is not expected. In fact,
unless the ratio of the standard deviations is extreme (e.g. more than a 5:1 ratio between the smallest and
largest value), the assumption is likely satised.
More formal tests of the equality of variances can be constructed (e.g. Levenes test is recommended),
but these are not covered in this course.
For example, traps with an ineffective bait will typically catch very few insects. The numbers caught may
typically range from 0 to under 10. By contrast, a highly effective bait will tend to pull in more insects, but
also with a greater range. Both the mean and the standard deviation will tend to be larger.
5
The MatchingColumn options in the Analyze->Fit Y-by-X platform gives this.
6
Note that the analysis of RCB designs with missing values is difcult by hand, but modern software should be able to hand this
without difculty if the number of missing values is small. You may wish to seek help if your design has substantial numbers of missing
values.
7
Note that in an RCB, the sample standard deviation will INCLUDE extra variability from the block effects, but as this extra
variability is equal for all treatments, it is still valid to compare the raw sample standard deviations.
If you have equal or approximately equal numbers of replicates in each group, and you have not too many
groups, heteroscedasticity (unequal standard deviations) will not cause serious problems with an Analysis of
Variance. However, heteroscedasticity does cause problems for multiple comparisons (covered later in this
section). By pooling the data from all groups to estimate a common , you can introduce serious bias into
the denominator of the statistic that compares the means for those groups with larger standard deviations.
In fact, you will underestimate the true standard errors of these means, and could easily misinterpret a large
chance error for a real, systematic difference.
I recommend that you start by constructing the side-by-side dot plots comparing the observations for
each group. Does the scatter seem similar for all groups? Then compute the sample standard deviations of
each group. Is there a wide range in the standard deviations? [I would be concerned if the ratio of the largest
to the smallest standard deviation is 5x or more.] Plot the standard deviation of each treatment group against
the mean of each treatment group. Does there appear to be relationship between the standard deviation and
the mean?
8
If you have paired data, plot the differences vs. the average for each poir. Is the scatter roughly equal
across the entire graph?
Sometimes, transformations can be used to alleviate some of the problems. For example, if the response
variable are counts, often a log or sqrt transform makes the variance approximately equal in all groups.
If all else fails, procedures are available that relax this assumption (e.g. the two-sample t-test using
the Satterthwaite approximation or bootstrap methods).
9
Many researchers believer that non-parametric
statistics are also suitable in these cases. CAUTION: despite their name, non-parametric methods often
make similar assumptions about the variation in the populations - this is a common fallacy about non-
parametric methods.
10
Consequently, I actually nd very few cases where I am forced to use non-parametric
methods and generally dont use them.
6.3.5 Are the errors normally distributed?
This assumption is the same as in the CRD.
ANOVA assumes that observations WITHIN each treatment group are normally distributed.
11
However,
because ANOVA estimates treatment effects using sample averages, the assumption of normality is less
important when sample sizes within each treatment group are reasonably large. Conversely, when sample
sizes are very small in each treatment group, any formal tests for normality will have low power to detect
non-normality. Consequently, this assumption is most crucial in cases when you can do least to detect it!
8
Taylors Power Law is an empirical rule that relates the standard deviation and the mean. By tting Taylors Power Law to these
plots, the appropriate transform can often be determined. This is beyond the scope of this course.
9
These will not be covered in this course
10
For example, the rank based methods where the data are ranked and the ranks used an analysis still assume that the populations
have equal variances.
11
Notice that the normality assumption applies within treatment groups, so that simply pooling over treatment groups and doing tests
for normality is not appropriate.
I recommend that you construct side-by-side dot-plots or boxplots of the individual observation for each
group. Does the distribution about the mean seem skewed? Find the residuals after the model is t and
examine normal probability plots. Sometimes problems can be alleviated by transformations. If the sample
sizes are large, non-normality really isnt a problem.
Again if all else fails, a bootstrap procedure or a non-parametric method (but see the cautions above) can
be used.
6.3.6 Are the errors independent?
This is the same assumption as in a CRD.
Another key assumption is that experimental units are independent of each other. For example, the
response of one experimental animal does not affect the response of another experimental animal.
This is often violated by not paying attention to the details of the experimental protocol. For example,
technicians get tired over time and give less reliable readings. Or the temperature in the lab increases
during the day because of sunlight pouring in from a nearby window and this affects the response of the
experimental units. Or multiple animals are housed in the same pen, and the dominant animals affect the
responses of the sub-dominant animals.
If the chance errors (residual variations) are not independent, then the reported standard errors of the
estimated treatment effects are incorrect. In particular, if different observations from the same group are
positively correlated (as would be the case if the replicates were all collected from a single location, and
you wanted to extend your inference to other locations), then you could seriously underestimate the standard
error of your estimates, and generate articially signicant p-values. This sin is an example of a type of
spatial-pseudo-replication (Hurlbert, 1984).
I recommend that you plot the residuals in the order the experiment was performed. There should be
random scatter about 0. A non-random looking pattern should be investigated. If your experiment has large
non-independence among the experimental units, seek help.
6.4 Comparing two means in a paired design - the Paired t-test
This is the famous Two-sample paired t-test rst developed in the early 1900s. It is a special case of a
more general ANOVA approach to analyze this type of design.
In this experiment, the analyst realizes the variation among experimental units may be very large com-
pared to the difference in response caused by the treatment, and decides to conduct the experiment by doing
both treatments on each experimental unit. The data are paired because there is a pair of observations from
each experimental unit.
For example:
measuring every persons change in heart-rate using two different walking styles (in random order).
measuring every person before and after a drug is administered.
administering every person two drugs (in random order) and measuring the heart rate after each drug
is administered.
panel surveys where the same set of people is surveyed more than once.
This should be compared to the independent samples or completely randomized design where separate
experimental units are used for every treatment.
JMP has four different ways to analyze this simple experiment! All approaches give identical results but
the output may look a bit different for each approach. We will proceed by example.
6.5 Example - effect of stream slope upon sh abundance
6.5.1 Introduction and survey protocol
This example is based upon the paper
Isaak, D.J. and Hubert, W.A. (2000). Are trout populations affected by reach-scale stream slope.
Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477.
Astreamreach is a portion of a streamfrom10 to several hundred metres in length that exhibits consistent
slope. The slope inuences the general speed of the water which exerts a dominant inuence on the structure
of physical habitat in streams. If sh populations are inuenced by the structure of physical habitat, then the
abundance of sh populations may be related to the slope of the stream.
Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout
populations, yet previous studies confound the effect of stream slope with other factors that inuence trout
populations.
Past studies addressing this issue have used sampling designs wherein data were collected either using
repeated samples along a single stream or measuring many streams distributed across space and time.
Reaches on the same stream will likely have correlated measurements making the use of simple statistical
tools problematical. [Indeed, if only a single stream is measured on multiple locations, then this is an
example of pseudo-replicaiton and inference is limited to that particular stream.]
Inference from streams spread over time and space is made more difcult by the inter-stream differences
and temporal variation in trout populations if samples are collected over extended periods of time. This extra
variation reduces the power of any survey to detect effects.
For this reason, a paired approach was taken. A total of twenty-three streams were sampled from a
large watershed. Within each stream, two reaches were identied in a low slope environment and a high
slope environment where low and high slopes were dened as less than or greater than an approximate
2.5% gradient.
In each reach, sh abundance was determined using electro-shing methods and the numbers converted
to a density per 100 m
2
of stream surface.
Here is the (ctitious but based on the above paper) raw data
Slope Slope Density
Stream (%) Class (per 100 m
2
)
1 0.7 low 15.0
1 4.0 high 21.0
2 2.4 low 11.0
2 6.0 high 3.1
3 0.7 low 5.9
3 2.6 high 6.4
4 1.3 low 12.2
4 4.0 high 17.6
5 0.6 low 6.2
5 4.4 high 7.0
6 1.3 low 39.8
6 3.2 high 25.0
7 2.0 low 6.5
7 4.2 high 11.2
8 1.3 low 9.6
8 4.2 high 17.5
9 2.0 low 7.3
9 3.6 high 10.0
10 0.7 low 11.3
10 3.5 high 21.0
11 2.3 low 12.1
11 6.0 high 12.1
12 2.5 low 13.2
12 4.2 high 15.0
13 2.3 low 5.0
13 6.0 high 5.0
14 1.2 low 10.2
14 2.9 high 6.0
15 0.7 low 8.5
15 2.9 high 7.0
16 1.1 low 5.8
16 3.0 high 5.0
17 2.2 low 5.1
17 5.0 high 5.0
18 0.7 low 65.4
18 3.2 high 55.0
19 0.7 low 13.2
19 3.0 high 15.0
20 0.3 low 7.1
20 3.2 high 12.0
21 2.3 low 44.8
21 7.0 high 48.0
22 1.8 low 16.0
22 6.0 high 20.0
23 2.2 low 7.2
23 6.0 high 10.1
Notice that the density varies considerably among stream but appears to be fairly consistent within each
stream.
The raw data are available in the paired-stream.csv le in the Sample Program Library at http:
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into
SAS in the usual fashion:
data density;
length slope_class $10.;
infile paired-stream.csv dlm=, dsd missover firstobs=2; /
*
skip the headers in the data file
*
/
input stream slope slope_class density;
logdensity=log(density);
format logdensity 7.2;
run;
The rst few lines of the raw data are shown below:
Obs slope_class stream slope density logdensity
1 high 1 4.0 21.0 3.04
2 low 1 0.7 15.0 2.71
3 high 2 6.0 3.1 1.13
4 low 2 2.4 11.0 2.40
5 high 3 2.6 6.4 1.86
6 low 3 0.7 5.9 1.77
7 high 4 4.0 17.6 2.87
8 low 4 1.3 12.2 2.50
9 high 5 4.4 7.0 1.95
10 low 5 0.6 6.2 1.82
This is an example of an Analytical Survey. The treatments (low or high slope) cannot be randomized
within stream the randomization occurs by selecting streams at random from some larger population of po-
tential streams. As noted in the early chapter on Observational Studies, causal inference is limited whenever
a randomization of experimental units to treatments cannot be performed.
Several possible analyses
With simple paired-data, most packages give you a choice of way to analyze the data. There are some
subtle differences in the models being t (especially in the covariance matrix of the observations), but unless
your data are pathological, all of the analyses should give similar results.
The approaches differ in the generalizability to more complex situations. For example, the linear models
approach will also work with blocked experiments with more than two levels of the factor but the analysis
of the differences approach cannot be used when there are more than two levels.
Lastly, the approaches also differ in how they treat missing values. For example, in the differences of
the readings of a pair approach, missing values for one member of the pair imply that the entire pairs data
is discarded. But the linear models approach can extract some of the information from the single datapoint
of the pair.
Contact me for more details about the subtle differences in the approaches.
SAS offers three basic approaches for the analysis of a simple paired experiment. These will be illustrated
below.
6.5.2 Using a Differences analysis
This is the simplest approach. In this approach, the difference in density is computed for each streambetween
the low and high slope sections. The relevant hypothesis is that the mean difference is zero. A one-sample
t-test is used to test the hypothesis and to estimate the average difference.
Let:

diff
be the true population mean of the difference in densities.
n be the number of differences.
Y
diff
be the sample mean difference.
s
diff
be the sample standard deviation of the differences.
1. Specify the hypothesis
H:
diff
= 0
A:
diff
= 0
A difference of 0 in the population mean difference would correspond to no effect of stream slope.
This is a two-sided test as we dont know a-priori the effect of the different stream slopes upon the
density of sh.
2. Collect data
The data table will have to be converted to side-by-side columns (one for the low slope and one for
the high slope) and a new column computed corresponding to the difference in densities.
Proc Transpose can switch the orientation of data :
proc sort data=density;
by stream slope_class;
run;
proc transpose data=density out=density2;
by stream;
id slope_class;
idlabel slope_class;
var density;
run;
Create a new variable in the dataset for the difference between the two densities measured on each
stream:
data density2;
set density2;
diff = high-low; /
*
explicity compute the difference
*
/
run;
This gives the rst lines of the transposed dataset.
Obs stream _NAME_ high low diff
1 1 density 21.0 15.0 6.0
2 2 density 3.1 11.0 -7.9
3 3 density 6.4 5.9 0.5
4 4 density 17.6 12.2 5.4
5 5 density 7.0 6.2 0.8
6 6 density 25.0 39.8 -14.8
7 7 density 11.2 6.5 4.7
8 8 density 17.5 9.6 7.9
9 9 density 10.0 7.3 2.7
10 10 density 21.0 11.3 9.7
3. Compute a test statistic, p-value, and condence intervals
Proc Ttest does the formal test of the hypothesis on the differences.
ods graphics on;
proc ttest data=density2;
title2 test if the mean difference is 0 - outlier included;
var diff;
ods output Statistics=Stat1;
run;
ods graphics off;
which gives:
Variable
t
Value DF
Pr
>
|t|
diff 0.61 22 0.5500
and
Variable N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
diff 23 0.7217 1.1889 -1.7439 3.1874
A plot of the results is automatically produced using the ODS GRAPHICS option
The estimated mean difference in density (high minus low) is 0.72 sh (se 1.19 sh) with a 95%
condence interval ranging from (1.74 3.18) sh.
The test-statistics is T = 0.61 with a two-sided p-value of 0.55.
If you computed the difference as low.density high.density then the statistics will have the signs
reversed but you will come to the same conclusion.
4. Make a decision Because the condence interval includes the value 0, there is no evidence of a
difference in the mean density between high and low slope reaches.
The p-value of 0.55 also indicates that we would conclude that there is no evidence of a difference in
the mean density between the high and low slope reaches.
In the original paper, the authors made some very nice displays of the results from many paired tests
of the hypotheses in their study.
6.5.3 Using a Matched paired analysis
Yet another name for a paired design is a Matched paired design. The advantage of this analysis is that some
of the assumptions can be explicitly examined, but is functionally identical to the analysis of the differences
presented earlier.
Let:

diff
be the true population mean of the difference in densities.
n be the number of differences.
Y
diff
be the sample mean difference.
s
diff
be the sample standard deviation of the differences.
H:
diff
= 0
A:
diff
= 0
A difference of 0 in the population mean difference would correspond to no effect of stream slope.
This is a two-sided test as we dont know a-priori the effect of the different stream slopes upon the
density of sh.
2. Collect data
The density variable must again be split according to the value of the slope-class variable. After the
split, it is not necessary to create a new column for the difference.
Proc Ttest can again be used, but now specify the two variables using the paired statement.
ods graphics on;
proc ttest data=density2;
title2 paired t-test on density data;
paired high
*
low;
run;
ods graphics off;
which gives:
Variable
1
Variable
2 Difference
t
Value DF
Pr
>
|t|
high low high - low 0.61 22 0.5500
and
Variable
1
Variable
2 Difference N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
high low high - low 23 0.7217 1.1889 -1.7439 3.1874
The output contains the estimated mean difference and the 95% condence interval for the population
mean difference.
The estimate of the mean difference (except for a sign reversal), its se, 95% condence interval, test
statistics, and p-value are identical to the previous analysis.
4. Make a decision
The same decision is made.
6.5.4 Using a General Modeling analysis
This is the most general approach to the analysis of experimental data. As such, great care must be taken
to correctly specify the model to obtain the correct analysis. If you specify the wrong model, you will get
incorrect results!
Variation in the response variable (the density) is caused by several sources, some of which we can
explicitly identify. As in the previous chapter, terms in the model will correspond to the treatment structure,
the experimental unit structure, and the randomization structure. As in the previous chapter, randomization
is complete (within each block) and hence this effect does not explicitly appear in the model.
The rst source of variation comes from the treatment structure - here the different slope classes (high
or low gradient) in the stream may give rise to differences in the density of the sh. The second source of
variation is from the experimental unit structure. This actually has two terms corresponding to blocks (which
is specied explicitly by specifying the stream measured) and to experimental units within blocks which are
implicitly assumed. Lastly, as in the previous chapter, randomization is complete (within each block) and
hence this effect does not explicitly appear in the model.
The model is then written with terms corresponding to these sources of variation in a simplied syntax:
Density = SlopeClass Stream
The terms SlopeClass and Stream can be written in any order.
The crucial assumption of additivity between treatments and blocks is implicit and is specied by not
specifying an interaction term. [Refer to the chapter on two factor designs for a discussion of interaction.]
The hypothesis can be specied in term of the means for each group
H:
low
=
high
A:
low
=
high
If there was no effect of the slope-class upon the mean density, we would expect that the mean densities
would be equal for both groups. It is a two sided test again because we dont know a-priori in which
direction the inequality would fall.
2. Collect data
The dataset must have separate columns for the factor slope.class and the blocking variable Stream
along with the response variable density. This is how the data were originally read in, so no changes
are needed.
Use Proc Mixed in SAS to specify the model above. The Class statement tells SAS that each variable
is a factor and not a numerical value:
ods graphics on;
proc mixed data=density plots=all;
title2 Modelling approach using MIXED;
class slope_class stream;
model density = slope_class; /
*
specify treatment variable here
*
/
random stream; /
*
specify blocking variable here
*
/
lsmeans slope_class / diff pdiff; /
*
ask for differences among means
*
/
ods output tests3=Test3;
ods output lsmeans=lsmeans3;
ods output diffs=diffs3;
run;
ods graphics off;
This gives:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
slope_class 1 22 0.37 0.5500
The plots to check the assumptions of the model are automatically generated by the ODS GRAPHICS
statement:
Unfortunately, most statistical packages FAIL to differentiate between terms corresponding to treat-
ment structures and blocking structures and treat both symmetrically. This can lead to some output
that is really not appropriate for terms corresponding to blocking factors as noted below.
The statistical signicance of each effect in the model is summarized by a series of F-statistics. When
there are only two levels in a factor, this is equivalent to the (T-statistic)
2
, i.e., the F-ratio for the
slope-class effect is 0.368 which is equal to the square of the T-statistics, i.e., 0.368 = 0.6071
2
. The
p-value of 0.55 is identical to the previous results.
The test for a blocking variable is automatically computed, but is rarely of interest, and is usually not
appropriate.
12
Estimates of the population marginal means and the differences in the marginal means are obtained
by the lsmeans statements in the SAS code seen earlier.
This gives:
Effect slope_class Estimate
Standard
Error DF
t
Value
Pr
>
|t|
slope_class high 15.4348 2.9129 22 5.30 <.0001
slope_class low 14.7130 2.9129 22 5.05 <.0001
Effect slope_class _slope_class Estimate
Standard
Error DF
t
Value
Pr
>
|t|
slope_class high low 0.7217 1.1889 22 0.61 0.5500
This gives results identical to previous analyses.
4. Make a decision
The same decision would be made.
6.5.5 Which analysis to choose?
As all approaches give the same results, which is best? This depends upon the comfort level of the analyst.
In any case, the most important decision is the recognition that pairing (or blocking) has taken place. If
this decision is missed, then all the subsequent output is incorrect.
6.5.6 Comments about the original paper
In the original paper, the authors had three stream slope classes (low, medium, and high), and not all classes
were observed in each stream. Their study was an example of an incomplete block design. The authors then
used subsets of the data to examine each relevant pairing of slopes. A more rened analysis is possible using
12
The question of testing for block effects can provoke a very stimulating discussion at statistical meetings - st ghts have nearly
broken out! I think this shows that some statisticians need to get out more.
incomplete block methodology. As well, as the authors have the actual stream slopes available, the analysis
of covariance is yet an alternate approach to this problem.
13
There is also some debate as to the appropriate variable to analyze. The authors used both the areal
density (i.e., per m
2
of area) and a volumetric density (i.e., per m
3
of water). An analysis of the log(density)
is also appropriate this model would assume that the response to slope is multiplicative rather than additive.
6.6 Example - Quality check on two laboratories
Many medical and other tests are done by independent laboratories. How can the regulatory bodies ensure
that the work done by these laboratories is of good quality?
A paired design using split samples is often used as a quality control check on two laboratories.
For example, six samples of water are selected, and divided into two parts. One half is randomly chosen
and sent to Lab 1; the other half is sent to Lab 2. Each lab is supposed to measure the impurities in the water.
Sample Lab 1 Lab 2
------------------
1 31.4 28.1
2 37.0 37.1
3 44.0 40.6
4 28.8 27.3
5 59.9 58.4
6 37.6 38.9
------------------
The raw data are available in the paired-barley.csv le in the Sample Program Library at http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in
the usual fashion:
data labs;
infile paired-labs.csv dlm=, dsd missover firstobs=2;
input sample lab1 lab2;
diff = lab1 - lab2;
run;
13
Using ANCOVA to analyze this experiment appears in the appropriate Chapter.
Obs sample lab1 lab2 diff
1 1 31.4 28.1 3.3
2 2 37.0 37.1 -0.1
3 3 44.0 40.6 3.4
4 4 28.8 27.3 1.5
5 5 59.9 58.4 1.5
6 6 37.6 38.9 -1.3
Notice that the difference between the laboratories is small relative to variation among the samples. You
are really interested in examining if the mean difference among the laboratories is zero.
Does this satisfy the conditions laid out earlier for a paired design? Yes, the measurements from each
laboratory are related by the sample. There are two readings from each sample - one from each laboratory.
It would be quite inappropriate to analyze this design using the two-independent samples t-test as the
readings for lab 1 are not independent of the readings for lab 2.
Let:

1
and
2
represent the population mean readings from lab 1 and lab 2 respectively.
n represent the number of pairs. Here n = 6. Note it is not correct to say that our sample size is 12 -
there were only 6 samples analyzed.
Y
1
and Y
2
represent the sample mean readings of lab 1 and lab 2, respectively.
The hypothesis testing proceeds in much the same fashion as before:
1. Specify the hypotheses:
We are not really interested in the actual impurity readings for the various samples, but rather only
interested in the difference. This can be specied as
H:
1
=
2
or
1
2
= 0 or
di
= 0
A:
1
=
2
or
1
2
= 0 or
di
= 0
where
di
is the mean difference in the readings. Note again that the hypotheses are in terms of
population parameters and we are interested in testing if the difference is 0. [A difference of 0 would
imply no difference in the readings between laboratories, on average]
2. Collect data:
We plot Lab1 against Lab2 with a reference line of X = Y :
proc sgplot data=labs;
title2 Plot of Lab1 vs Lab2;
scatter y=lab1 x=lab2;
lineparm x=0 y=0 slope=1;
run;
If both labs are measuring the impurities identically, all the points should fall exactly on the line
corresponding to a 0 difference. But there is variation in the readings, due to technical problems, etc.
and so the points will not fall exactly on the line. If there is no systematic bias, the points should be
randomly scattered above and below the zero line. However, if there is a systematic bias in one lab,
the points will consistently fall below or above zero.
3. Compute a test-statistics and a p-value.
Proc Ttest is used with the two variables being specied:
14
14
You could also use Proc Ttest on the differences or Proc Mixed on the stacked data. These are illustrated in the complete SAS
program.
ods graphics on;
proc ttest data=labs;
title2 do a paired t-test;
paired lab1
*
lab2;
run;
ods graphics off;
This gives:
Variable
1
Variable
2 Difference
t
Value DF
Pr
>
|t|
lab1 lab2 lab1 - lab2 1.83 5 0.1270
and
Variable
1
Variable
2 Difference N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
lab1 lab2 lab1 - lab2 6 1.3833 0.7565 -0.5613 3.3280
It also provides a graph of the results:
The report indicates that the mean difference is about 1.383 units, but the standard error is rather
large. Indeed, the 95% condence interval indicates that the mean difference might be zero since 0 is
included in the interval.
The test statistic is computed as:
T =
(estimate of dierence hypothesized dierence)
estimated se of dierence
=
(1.38 0)
.756
= 1.83
Both are compared to a t-distribution with df = number of pairs 1 = 6 1 = 5.
Once again, you must decide between the two-sided or one-sided p-values. The alternate hypothesis
is two sided (we didnt know in advance if the differences would be positive or negative), so the
two-sided p-value of 0.1270 is used.
4. Make a decision.
Because the p-value is large, we conclude that there is insufcient evidence against the hypothesis that
both laboratories give the same mean readings on the samples.
As before, we havent proven that both labs give the same reading. We only failed to detect a difference
in the mean readings.
6.7 Example - Comparing two varieties of barley
Two varieties of barley are to be compared to see if the mean yield for the second variety is different than
the mean yield for the rst variety.
Ten farms (from various parts of Manitoba) were selected, and both varieties were planted on each farm.
Farm V1 V2
-----------------
1 312 346
2 233 372
3 356 392
4 316 351
5 310 330
6 352 364
7 389 375
8 313 315
9 316 327
10 346 378
-----------------
the usual fashion:
data farms;
infile paired-barley.csv dlm=, dsd firstobs=2;
input farm V1 V2;
diff = V1 - V2;
/
*
NOTE: The value of 233 on farm 2 is WRONG - it should read 333
*
/
run;
Obs farm V1 V2 diff
1 1 312 346 -34
2 2 333 372 -39
3 3 356 392 -36
4 4 316 351 -35
Obs farm V1 V2 diff
5 5 310 330 -20
6 6 352 364 -12
7 7 389 375 14
8 8 313 315 -2
9 9 316 327 -11
10 10 346 378 -32
Does this experiment satisfy the requirements that the observations are paired? How would this experi-
ment be different, if the experiment were not paired?
1. Formulate the hypotheses
Here we are interested in differences in one direction only. Our hypotheses are:
H:
1
=
2
or
1
2
= 0 or
di
= 0
A:
1
=
2
or
1
2
= 0 or
di
= 0
Notice that this is a two-sided alternate hypothesis so we are interested in differences between the
means in either direction, i.e. we are interested to know if the mean of variety 1 is less than or more
than the mean for variety 2.
2. Collect data and look at summary statistics:
In many cases, simple plots should be used rst to examine if there are outliers or anomalies in the
data.
In the case of paired data, a natural validation is to plot the response of one treatment against the
response from another treatment. All of the points should fall on an approximate straight line.
We plot V1 against V2 with a reference line of X = Y :
proc sgplot data=farms;
title2 Plot of V1 vs V2;
scatter y=v1 x=v2;
run;
There is quite an outlier present. Going back to the raw data, we see that in the second observation,
there is a typo. The yield for variety 1 in farm 2 should be 333 rather than 233. The datum is corrected.
data farms;
set farms;
if farm=2 then v1=333;
diff=v1-v2;
run;
3. Find the test statistic, condence intervals, and p-value
ods graphics on;
proc ttest data=farms;
title2 do a paired t-test - outlier repaired ;
paired V1
*
V2;
ods output Statistics=stat2b;
ods output TTests=Test2b;
ods output ConfLimits=CIMean2b;
run;
ods graphics off;
This gives:
Variable
1
Variable
2 Difference
t
Value DF
Pr
>
|t|
V1 V2 V1 - V2 -3.71 9 0.0048
and
Variable
1
Variable
2 Difference N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
V1 V2 V1 - V2 10 -20.7000 5.5798 -33.3224 -8.0776
The estimated mean difference is 20.7 (SE 5.6) bu/acre and the 95% condence interval doesnt seem
to cover 0, which implies that variety 1 appears to give a lower yield than variety 2, on average.
The test-statistic is computed as:
T =
(estimated di hypothesized di )
estimated se of di
=
(20.7 0)
5.580
= 3.71
This will be compared to a t-distribution with 9df (9 = (number of pairs 1) = 10 1).
The two-sided p-value is 0.0048. The one-sided p-value for the alternative that variety 1 has a smaller
mean yield than variety 2 is 0.0024.
4. Make a decision
Because the p-value is small, we conclude that there is evidence against the hypothesis that the mean
yields are the same for both varieties. In fact, it appears that variety 1 gives a lower mean yield than
variety 2.
We have not proved that the variety 1 gives lower mean yield than variety 2. All we have shown is
that the observed data are very unusual if the null hypothesis were true.
6.8 Example - Comparing prep of mosaic virus
A single leaf is taken from 11 different tobacco plants. Each leaf is divided in half, and given one of two
preparations of mosaic virus. We wish to examine if there is a difference in the mean number of lesions from
the two preparations.
Plant Prep1 Prep2
-----------------
1 18 14
2 20 15
3 9 6
4 14 12
5 38 32
6 26 30
7 15 9
8 10 2
9 25 18
10 7 3
11 13 6
-----------------
the usual fashion:
data plants;
infile paired-virus.csv dlm=, dsd missover firstobs=2;
input plant prep1 prep2;
diff = prep1 - prep2;
run;
Obs plant prep1 prep2 diff
1 1 18 14 4
2 2 20 15 5
3 3 9 6 3
4 4 14 12 2
Obs plant prep1 prep2 diff
5 5 38 32 6
6 7 15 9 6
7 8 10 2 8
8 9 25 18 7
9 10 7 3 4
10 11 13 6 7
Does this experiment satisfy the paired experiment setup? How would it be conducted as an independent
samples experiment?
Here we are interested in differences in either direction. Our hypotheses are:
H:
1
=
2
or
1
2
= 0 or
di
= 0
A:
1
=
2
or
1
2
= 0 or
di
= 0.
2. Collect data and look at summary statistics:
We plot Virus1 against Virus2 with a reference line of X = Y :
proc sgplot data=plants;
title2 Plot of Lab1 vs Lab2;
scatter y=prep1 x=prep2;
run;
Most of the differences are positive, but a potential outlier exists in observation 6. Careful inspection
of the data records shows that it was a valid data point, and it was decided to retain the point in the
subsequent analysis.
15
3. Find the test-statistic and the p-value:
16
ods graphics on;
proc ttest data=plants dist=normal; /
*
dist=normal gives the se
*
/
title2 do a paired t-test - outlier included ;
paired prep1
*
prep2;
15
If this outlier plant is removed, the conclusions are even stronger that there appears to be a difference between the mean lesion
count of preparation 1 and preparation 2
16
You could also use Proc Ttest on the differences or Proc Mixed on the stacked data. These are illustrated in the complete SAS
program.
run;
ods graphics off;
This gives:
Variable
1
Variable
2 Difference
t
Value DF
Pr
>
|t|
prep1 prep2 prep1 - prep2 4.35 10 0.0014
and
Variable
1
Variable
2 Difference N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
prep1 prep2 prep1 - prep2 11 4.3636 1.0025 2.1300 6.5973
The estimated mean difference is 4.3 (se 1.0) lesions and the 95% condence interval does not cover
the value 0. We nd strong evidence of a difference in the mean number of lesions between the two
preparations.
The test statistic is T = 4.3529. It will be compared to a t-distribution with df = number of pairs
1) = (11 1) = 10df .
The two-sided p-value is 0.0014
4. Make a decision:
We conclude that there is evidence that the two preparations do not give the same mean number of
lesions. In fact, it appears that Prep-1 gives, on average, between 2.1 and 6.6 more lesions than Prep-2
(taken from the 95% condence interval).
6.9 Example - Comparing turbidity at two sites
6.9.1 Introduction and survey protocol
French Creek is located on Vancouver Island, B.C. and runs from rural through urban areas out to the sea.
Measurements of water quality are taken at several points along the creek. Interest lies in a comparison of
the mean water quality among the sites. For example, is there a difference in mean turbidity (NTU) among
the sites.
These measurements are taken approximately monthly, but twice a year, ve measurements are taken
within 30 days. The readings are synoptic at the sites, i.e. taken at roughly the same time (usually within a
day of each other). Consequently, events such as a heavy rain would be expected to impact the water quality
at all the sites.
Here is the raw data on turbidity at ve sites. The sample time is the YYYY.MMDD format. Notice that
there is a missing value for the NewHwy site in 2010.1116.
Turbidity (NTU) at ve sites
SampleTime BB Coombs Grafton NewHwy WinchRd
2010.0427 1.9 1.1 0.9 1.9 1.4
2010.0525 1 0.7 0.6 0.7 1
2010.0622 1.8 0.6 0.3 0.7 1
2010.0726 0.7 0.3 0.8 0.5 0.8
2010.0816 0.7 0.3 1.5 0.6 0.9
2010.0824 0.8 0.5 0.4 0.4 1.4
2010.0831 0.9 0.3 1.1 0.6 1.6
2010.0907 0.7 0.3 1.6 0.5 2.5
2010.0914 0.8 0.3 0.2 0.4 0.7
2010.1020 0.5 0.4 0.2 0.3 0.6
2010.1026 4.3 2 1.2 2.3 1.1
2010.1102 7.4 4 2.1 4.2 1.9
2010.1109 2.5 1.3 1.3 2 1.3
2010.1116 1.9 1.1 0.6 0.8
2010.1229 5.3 3.8 4.3 6.2 3.1
We will start with a comparison of mean turbidity at sites BB and Coombs. A paired approach will be
taken for the comparison of turbidity between the two sites (and as will be seen later), a RCB (Randomized
Complete Block) approach will be used when all sites are compared simultaneously.
The raw data are available in the TurbidityCompare.csv le in the Sample Program Library at http:
data turbidity;
length SampleTime $10.;
infile TurbidityCompare.csv dlm=, dsd missover firstobs=2;
input SampleTime BB Coombs;
logBB = log(BB);
logCoombs = log(Coombs);
format logBB logCoombs 7.2;
run;
Obs SampleTime BB Coombs logBB logCoombs
1 2010.0427 1.9 1.1 0.64 0.10
2 2010.0525 1.0 0.7 0.00 -0.36
3 2010.0622 1.8 0.6 0.59 -0.51
4 2010.0726 0.7 0.3 -0.36 -1.20
5 2010.0816 0.7 0.3 -0.36 -1.20
6 2010.0824 0.8 0.5 -0.22 -0.69
7 2010.0831 0.9 0.3 -0.11 -1.20
8 2010.0907 0.7 0.3 -0.36 -1.20
9 2010.0914 0.8 0.3 -0.22 -1.20
10 2010.102 0.5 0.4 -0.69 -0.92
11 2010.1026 4.3 2.0 1.46 0.69
12 2010.1102 7.4 4.0 2.00 1.39
13 2010.1109 2.5 1.3 0.92 0.26
14 2010.1116 1.9 1.1 0.64 0.10
15 2010.1229 5.3 3.8 1.67 1.34
Several possible analyses As in the Stream Slope example seen earlier, there are several (equivalent)
ways to analyze this data. Before analyzing the data, you need to decide to deal with the raw data, or with
a transformation of the raw data. A key assumption of a paired (and RCB) approach is that the differences
(among sites) should be roughly equal regardless of the underlying mean
17
. In many cases, differences
tend to get larger as the mean increases, indicating that the treatment effects are multiplicative rather than
17
As you will see later, this is the assumption of additivity between treatments and blocks.
additive. For example, the reading at Site 1 could always be 50% (or 1.5 greater) than the readings at Site
2.
In cases of a multiplicative effect, the log()
18
transform is often used.
6.9.2 Using a Differences analysis
In this approach, the log(ratio)
19
of turbidity in Site BB to Site Coombs is computed for each sampling
date.
The relevant hypothesis is that the mean log(ratio) is zero. A one-sample t-test is used to test the
hypothesis and to estimate the average log(ratio).
Let:

log(ratio)
be the true population mean of the log(ratio) in turbidity.
n be the number of sites
Y
log(ratio)
be the sample mean log(ratio)
s
log(ratio)
be the sample standard deviation of the log(ratio)
H:
log(ratio)
= 0
A:
log(ratio)
= 0
A population log(ratio) of 0 would correspond to a population ratio of 1 which would correspond
to no difference (on average) in turbidity between the two sites. This is a two-sided test as we dont
know a-priori if BB tends to have a higher or lower turbidity than Coombs.
2. Collect data
A new variable is computed corresponding to the log(ratio).
Create a new variable in the dataframe for the log(ratio) of turbidities measured at each date
data turbidity2;
set turbidity;
logratio =logBB-logCoombs; /
*
explicitly compute the log-ratio
*
/
run;
The rst few lines after the log(ratio) is computed are:
18
Recall that log() is the natural logarithm.
19
This is equivalent to taking the difference of the log() of each reading.
Obs SampleTime BB Coombs logBB logCoombs logratio
1 2010.0427 1.9 1.1 0.64 0.10 0.54654
2 2010.0525 1.0 0.7 0.00 -0.36 0.35667
3 2010.0622 1.8 0.6 0.59 -0.51 1.09861
4 2010.0726 0.7 0.3 -0.36 -1.20 0.84730
5 2010.0816 0.7 0.3 -0.36 -1.20 0.84730
6 2010.0824 0.8 0.5 -0.22 -0.69 0.47000
7 2010.0831 0.9 0.3 -0.11 -1.20 1.09861
8 2010.0907 0.7 0.3 -0.36 -1.20 0.84730
9 2010.0914 0.8 0.3 -0.22 -1.20 0.98083
10 2010.102 0.5 0.4 -0.69 -0.92 0.22314
11 2010.1026 4.3 2.0 1.46 0.69 0.76547
12 2010.1102 7.4 4.0 2.00 1.39 0.61519
13 2010.1109 2.5 1.3 0.92 0.26 0.65393
14 2010.1116 1.9 1.1 0.64 0.10 0.54654
15 2010.1229 5.3 3.8 1.67 1.34 0.33271
Proc Ttest does the formal test of the hypothesis.
ods graphics on;
proc ttest data=turbidity2;
title2 Examine the log ratio (BB/Coombs);
var logratio;
ods output TTests=TestLogRatio2;
ods output ConfLimits=CIMeanLogRatio2;
run;
ods graphics off;
which gives:
Variable
t
Value DF
Pr
>
|t|
logratio 9.64 14 <.0001
and
Variable Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
Std
Dev
Lower
Limit
of
Std
Dev
Upper
Limit
of
Std
Dev
UMPU
Lower
Limit
of
Std
Dev
UMPU
Upper
Limit
of
Std
Dev
logratio 0.6820 0.5303 0.8337 0.2739 0.2006 0.4320 0.1963 0.4203
The estimated mean log(ratio) of turbidity (BB to Coombs) is 0.68 (se 0.27) with a 95% condence
interval ranging from (0.53 0.83).
The test-statistics is T=9.6 with a two-sided p-value < .0001.
If you computed the log(ratio) as Coombs/BB, then the statistics will have the signs reversed but you
will come to the same conclusion.
4. Make a decision Because the condence interval excludes the value 0, there is strong evidence that
the readings at BB tend to larger than the readings at Coombs.
The very small p-value also indicates that we would conclude that there is strong evidence that the
readings at the two sites are not comparable.
Many people dont understand log(ratio)s. The inverse transformation gives ratio = exp(log(ratio)) =
exp(0.682) = 1.98 which indicates that the readings at BB are about 2 that at Coombs. The se of
the ratio is found using the Delta Method
20
to be se(ratio) = .07 1.98 = 0.14.
6.9.3 Using a Matched paired analysis
Yet another name for a paired design is a Matched paired design. The advantage of this analysis is that some
of the assumptions can be explicitly examined, but is functionally identical to the analysis of the differences
(or log(ratio) presented earlier.
Let:

log(ratio)
be the true population mean of the log(ratio) of the turbidity.
n be the number of sample times
Y
log(ratio)
be the sample mean log(ratio)
s
log(ratio)
be the sample standard deviation of the log(ratio)
H:
log(ratio)
= 0
A:
log(ratio)
= 0
A population mean log(ratio) of 0 would correspond to no no consistent difference in the readings
between the two sites.
This is a two-sided test as we dont know a-priori which site will be higher or lower than the other
site.
2. Collect data
We need to compute individual log(turbidity) for each site in the usual way.
Proc Ttest is used again, but this time you specify the two variables to be used
ods graphics on;
proc ttest data=turbidity;
title2 Paired t-test on log(turbidity) data;
pair logBB
*
logCoombs;
20
A Taylor Series approximation.
run;
ods graphics off;
which gives the same output as seen earlier.
The output contains the estimated mean difference in the log(turbidity) and the 95% condence
interval for the population mean difference in the log(turbidity).
The estimate (except for a sign reversal), its se, the 95% condence interval, the test statistic, and the
p-value are identical (as they must) to the previous analysis.
4. Make a decision
The same decision is made.
6.9.4 Using a General Modeling analysis
This is the most general approach to the analysis of experimental data. As such, great care must be taken
to correctly specify the model to obtain the correct analysis. If you specify the wrong model, you will get
incorrect results!
Variation in the response variable (the log(turbidity)) is caused by several sources, some of which we
can explicitly identify. As in the previous chapter, terms in the model will correspond to the treatment
structure, the experimental unit structure, and the randomization structure.
The rst source of variation comes from the treatment structure - here the sites on the creek may give
rise to differences in the log(turbidity). The second source of variation is from the sampling dates where
events such as a major rainfall inuence the readings at both sites.
log(turbidity) = Site SamplngDate
The terms Site and SamplingDate can be written in any order.
The hypothesis can be specied in term of the mean log(turbidity) for each site
H:
BB
=
Coombs
A:
BB
=
Coombs
If there was no effect of the site upon the mean log(turbidity), we would expect that the mean
log(turbidity) would be equal for both sites. It is a two sided test again because we dont know
a-priori in which direction the inequality would fall.
2. Collect data
We need to Stack the data (remember that we will be analyzing the log(turbidity) by creating the
transposed dataset.
data turbidity3;
set turbidity;
length SiteName $10;
sitename = "BB"; logturbidity = logBB; output;
sitename = Coombs; logturbidity = logCoombs; output;
keep SampleTime SiteName logturbidity;
run;
The rst few lines of the stacked data are shown below:
part of the transposed raw data
Obs SampleTime SiteName logturbidity
1 2010.0427 BB 0.64185
2 2010.0427 Coombs 0.09531
3 2010.0525 BB 0.00000
4 2010.0525 Coombs -0.35667
5 2010.0622 BB 0.58779
6 2010.0622 Coombs -0.51083
7 2010.0726 BB -0.35667
8 2010.0726 Coombs -1.20397
9 2010.0816 BB -0.35667
10 2010.0816 Coombs -1.20397
ods graphics on;
proc mixed data=turbidity3 plots=all;
title2 Modelling approach as an RCB using MIXED;
class SampleTime SiteName;
model logturbidity = SampleTime SiteName/ ddfm=KR;
lsmeans SiteName / diff pdiff;
run;
ods graphics off;
This gives:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
SampleTime 14 14 40.13 <.0001
SiteName 1 14 92.97 <.0001
The plots to check the assumptions of the model are automatically generated by the ODS GRAPHICS
statement:
The statistical signicance of each effect in the model is summarized by a series of F-statistics. When
there are only two levels in a factor, this is equivalent to the (T statistic)
2
, i.e., the F-ratio for the
Site effect is 92.97 which is equal to the square of the T-statistics, i.e., 92.97 = 9.64
2
. The p-value of
< .0001 is identical to the previous results.
appropriate.
21
Estimates of the population marginal means and the differences in the marginal means are obtained
by the lsmens statements in the SAS code seen earlier.
This gives:
Effect SiteName Estimate
Standard
Error DF
t
Value
Pr
>
|t|
SiteName BB 0.3734 0.05002 14 7.47 <.0001
SiteName Coombs -0.3086 0.05002 14 -6.17 <.0001
Effect SiteName _SiteName Estimate
Standard
Error DF
t
Value
Pr
>
|t|
SiteName BB Coombs 0.6820 0.07073 14 9.64 <.0001
These results are identical to previous analyses.
4. Make a decision
The same decision would be made.
6.9.5 Which analysis to choose?
As all approaches give the same results, which is best? This depends upon the comfort level of the analyst.
In any case, the most important decision is the recognition that pairing (or blocking) has taken place. If
this decision is missed, then all the subsequent output is incorrect.
21
6.10 Power and sample size determination
This is relatively straight-forward and identical to the procedure used for a completely randomized design
except that one has to be a little careful to get the proper value for .
As before, power can be determined either using computer packages or using tables.
There are two ways of obtaining power and sample size computations. First, you can determine power
by considering the differences directly (the pairing is implicit), or you can determine power by thinking of
the two variables but making an adjustment for the paired design. JMP and SAS can do both methods, but R
only provides a function for the rst method.
In the rst method, Proc Power is used. Specify the difference to detect and the standard deviation of
the differences. The latter may be obtained from a pilot study, a literature review, or expert opinion.
For example, consider the stream-slope study. The standard deviation of the differences is 5.701. Sup-
pose that a difference of 2 in the mean biomass is biologically important. This gives the call:
proc power;
title2 Finding power based on difference of 2;
onesamplemeans
mean = 2 /
*
list the group means that differ by biological effect
*
/
stddev = 5.701 /
*
*
/
alpha = .05 /
*
*
/
power = .80 /
*
target power
*
/
ntotal = . /
*
solve for power
*
/
; /
*
end of the onesamplemeans statement - dont forget it
*
/
ods output Output=power1;
run;
with output:
Alpha Mean
Std
Dev Sides
Null
Mean
Nominal
Power
Actual
Power
N
Total
0.05 2 5.7 2 0 0.8 0.802 66
which implies that about 66 PAIRS of streams would be needed, or a total of about 130 observations over
both slope classes need to be collected. The procedure can also be see to create a plot of the power as a
function of sample size or difference to detect, etc.
In the second approach, the individual means for each variable are specied, along with their respective
standard deviations, AND the correlation between the two variables. The correlation is needed to determine
the increase efciency obtained by blocking if the two variables had a correlation of zero, then there would
be no improvement by blocking.
Again, consider the stream-slope study. Choose any two means whose difference is 2 (say 6 and 8), and
then specify the standard deviations (14.9 and 12.9 respectively) and the correlation (0.92=, blocking will be
very efcient).
We again use Proc Power, but now the paired means statement:
proc power;
title2 Finding power based on difference of 2;
pairedmeans
pairedmeans = (6, 8) /
*
list two group means that differ by biological effect
*
/
pairedstddevs = (14.9, 12.) /
*
*
/
corr = 0.92 /
*
what is correlation between measurements?
*
/
alpha = .05 /
*
*
/
power = .80 /
*
target power
*
/
npairs = . /
*
solve for power
*
/
; /
*
end of the onesamplemeans statement - dont forget it
*
/
run;
with output:
Analysis Alpha Mean1 Mean2 StdDev1 StdDev2 Corr Sides
Nominal
Power
Actual
Power
N
Pairs
PairedMeans 0.05 6 8 14.9 12 0.92 2 0.8 0.802 75
Now about 75 pairs are needed. The difference from the 68 in the previous analysis is due to rounding of
results used in the power analysis.
Again, recall that the purpose of the power/sample size analysis is not so much to see if the sample size
is 65 or 66, but rather is the sample size 65, 650, or 6500. Afterall, you are making reasonable guesses
for the standard deviation, for the biological difference, etc. and there is no guarantee that these values are
exactly correct for a future experiment.
6.11 Single Factor - Randomized Complete Block (RCB) Design
6.11.1 Introduction
This is a generalization of the Paired t-test to two or more treatments. The formal name is a Single-factor,
Randomized Complete Block, Analysis of Variance.
Recall the purpose of randomization. Randomization does not remove the inuence of other, unknown
and uncontrollable, factors on the response, but rather ensures that, on average, their effects are roughly
equal in all the treatment groups. Then comparisons between the treatment groups will be free, on average,
of the inuences of these other factors, and any difference can be attributed to the treatments.
In a paired design, it was recognized that sometimes one of these uncontrollable factors is, in fact, under
the control of the experimenter. In a paired design, you try and account for the inuence of subject-to-
subject variation by giving each subject both treatments (in random order if possible). Then the difference
is computed for each subject, and the subject inuence on the results is removed when the differences are
analyzed.
A randomized complete block design (abbreviated RCB) is an extension of the paired design to the
case of two or more treatments. In an RCB, each larger experimental unit is broken into small units. The
treatments are randomized to the smaller experimental unit so that every treatment appears (usually exactly
once) in every block. The randomization is done independently for every larger unit.
6.11.2 The potato-peeling experiment - revisited
Recall the potato-peeling experiment. The three treatments were (i) peeling the potato with a potato peeler,
(ii) peeling the potato with a paring knife, and (iii) peeling the potato with a peeler held in the non-dominant
hand. The experimental results were analyzed for evidence of treatment effects on the mean time to peel a
potato.
It seems clear that it ought to take substantially longer to peel a potato when a person holds the peeler
in the wrong hand. Yet the experimental results showed no compelling evidence of any treatment effects.
How could this be? There are two explanations. Either there are no substantial treatment effects, or else the
treatment effects are substantial but the experiment was not able to detect them.
As you saw earlier, an experiment that has little chance of detecting substantial treatment effects is said
to lack power. The power depends upon several attributes of the experiment, but in particular it depends
upon the sample size and the effect size.
What could the experimenter do here to gain more power? He could try to obtain more replicates by
getting more people to peel potatoes. Alternatively, he could design a more sophisticated experiment to try
and reduce unexplainable noise, denoted as . If can be decreased, power will also increase. The key
point is that by changing the experimental design, you increase the power by reducing unexplained variation
- not by changing the actual means!
The key question is then how can the unexplained variation be reduced?
The problem with this experiment is that it leaves a large source of variability uncontrolled. Some people
are likely to be faster at peeling potatoes than others, regardless of which method they use. In the completely
randomized design that was used, people were randomly allocated to groups. This means that differences
between people contributed to the chance errors (the unexplained variation) in the model. This made it
difcult to detect differences in the means.
Another way of expressing this aw is that we are comparing three means, each of which is subject
to substantial chance uctuations generated by different peoples dexterity. If we were to ask each person
to peel three potatoes, one according to each treatment protocol, then we would be able to compare each
individuals times. Differences between individuals no longer enter into these comparisons.
Experimental designs like this are called randomized block designs. The experimental observations
are collected in blocks. Here, each person provides a block of three observations. Within each block,
there should be complete randomization. For example, the assignment of potatoes, and the ordering of the
treatments ought to be randomized.
The term, randomized blocks, has particular appeal in agricultural experimentation. Experimental units
are often plots of ground. If there is considerable variability between plots, it makes sense to group them into
blocks, within each of which the conditions are more homogeneous. If the number of plots in each block is
the same as the number of treatments, then treatments can be randomly assigned to plots within each block,
with each treatment being applied to exactly one plot within each block.
The analysis of variance for a randomized block design is very similar to that for a completely random-
ized design. The main difference is that we can now estimate not only a set of treatment effects, but also a
set of block effects.
6.11.3 An agricultural example
It is well known that farms have different overall fertility levels in the soil. Suppose that you wish to compare
the yields of three varieties of wheat (denoted a, b, and c) and you have 5 farms. Each farm would be divided
into three elds, and the varieties of wheat would be randomized to the elds so that every variety appears
on every farm. One possible randomization would be:
One possible randomization in a RCB
Farm 1 Farm 2 Farm 3 Farm 4 Farm 5
a b a c b
c a b a c
b c c b a
Notice that every variety is grown in every farm, and that the randomization is done separately for each
farm. Now if you compared the average (over all ve farms) of variety a and b, both groups would have the
same set of ve farms, and presumably the inuence of the 5 farms will be then be equal in both groups,
and the difference in averages will be free of any farm inuences. The blocks are the farms; the factor is
variety; and the treatment levels are the varieties chosen in the experiment.
Compare the above randomization to that from a Completely Randomized Design. In a CRD, you
randomize the 5 replicates of the three varieties over all 15 elds, ignoring any farm boundaries. One
possible randomization would be:
One possible randomization in a CRD
Farm 1 Farm 2 Farm 3 Farm 4 Farm 5
a c a a b
b b a b c
c c c b a
Notice in a CRD, there is no guarantee that every farm receives every variety. The danger in a CRD
is that you might be unlucky and happen to randomize one variety only to high fertile farm and so any
comparison among varieties will include farm differences.
6.11.4 Basic idea of the analysis
The basic idea of an analysis for a RCB is to again analyze the differences, rather than the individual ob-
servations. Unfortunately, when you have more than 2 treatment levels, there are a large number of pairs to
analyze.
In this case, we again try to build a statistical model to try and reect reality. In a RCB design, we
again build a model that contains terms corresponding to the treatment structure (the factor of interest), the
experimental units structure (the blocking variable and an implicit term for the smaller experimental units),
and the randomization structure (again because a complete randomization is done within each block, this
term is generally left out).
The model for the single factor RCB using the simplied structure is:
Response = Treatment Blocks
where Response, Treatment, Blocks are replaced by the appropriate variable. The terms corresponding to
treatments and blocks can be in any order.
There is yet another complication in randomized block designs, that fortunately, makes no difference
in the test for treatment effect or on the estimation of the differences between mean and can be usually
ignored. This related to questions about how blocks were formed. Recall that the blocks are simply groups
of experimental units that are somehow more similar within the blocks compared to experimental units in
other blocks. How were the blocks formed and what blocks would be used if this experiment were to be
repeated? The answer to this last question has implications for the proper analysis of the data.
The simplest case is where the blocks would be reused if the experiment were to be repeated. In this
case, the blocks are known as xed effects and there is no modication to the above model. Most packages
assume that blocks are xed effects.
However, suppose that you would use a new set of blocks if the experiment were to be repeated. In
this case, it is conceptualized that the blocks chosen in a particular experiment are a random sample from
a larger, innite set of possible blocks. Then the Blocks are treated as another source of random variation
with mean 0 and variance
2
blocks
that is independent of any residual variation. This is usually specied in
the model syntax as:
Response = Treatment Blocks(R)
where the (R) informs the analyst and the computer package that blocks are a random effect. It is in this
situation that many computer packages (even today) have trouble. Fortunately, as noted earlier, the test
statistics are still computed properly, estimates of the differences between means are correct, and multiple
comparisons are done properly. Usually the only problem occurs when trying to estimate the marginal mean
for each treatment level. This whole problem belongs to a class of problems called mixed models which
even today provokes controversy among statisticians!
As noted earlier, computer packages make no distinction between treatment effects and blocking effects
and treat they symmetrically. This implies, that the total variation observed in the sample is partitioned
in sources corresponding to differences among the treatment means, differences among blocks, and unex-
plained, residual, variation. These computations are quite complex and wont be explored in much detail
except when necessary.
After the partitioning of the variation, an F-statistic is computed (much like in a single factor, completely
randomized design in the last section) as the ratio of the variation among the sample means to the unexplained
variation. [Notice that the unexplained variation will have the variation due to blocks removed before the
ratio is taken.] As before, if the F-statistic is large, this provides evidence against the hypothesis of no
difference in the means among the treatments. The strength of the evidence is measured by the p-value
which, as before, measures the probability of observing the data if the hypothesis were true, i.e., a measure
of the consistency of the data with the null hypothesis.
As in the single factor, completely randomized design, nding evidence against the null hypothesis does
not indicate which means may differ from each other. Once again, multiple comparison procedures will be
necessary.
Both methods will be illustrated for one of the examples.
6.12 Example - Comparing effects of salinity in soil
We wish to investigate the effect of salinity in the soil on the growth of salt marsh plants. Experimental
elds of land were located at an agricultural eld station, and each eld was divided into six smaller plots.
Each of the smaller plots was treated with a different amount of salt (measured in ppm) and the biomass at
the end of the experiment was recorded.
Here is the experimental layout:
One possible randomization of the experiment (RCB)
Value in each square is the salt concentration (ppm)
Field 1 Field 2 Field 3 Field 4
15 30 10 20
25 35 15 25
30 15 25 30
20 10 20 10
10 25 30 15
35 20 35 35
Notice that every block (the eld) has every treatment (all six salt concentrations) and that the randomization
in each block was done independently of every other block.
The factor salt has quite a different feel compared to other factors we have seen in the past. For example,
previous experiments had factors that had categorical levels (e.g. low, medium, or high), rather than actual
numerical values. Does this have an impact on the analysis of the data?
The key difference between treating the levels of the factor as qualitative (e.g. nominal or ordinal scale)
vs. quantitative (e.g. continuous scale) is how you think the response variable (yield) varies in relation to the
factor. If you believe that the response is linear (and only linear) then treating the salt levels as quantitative
(continuous) is appropriate this is called a regression analysis and is covered in later chapters. However,
treating the salt levels as qualitative (e.g. nominal or ordinal scale) allows for ANY response function other
than a straight line with 0 slope (which corresponds to zero effect). Hence you would be able to detect a
threshhold effect, or a quadratic effect, or any other pattern in addition to a linear effect by doing a ANOVA
rather than a REGRESSION.
Here are the raw data. Notice that when the raw data is presented, it is often sorted within each eld, but
this does not necessarily correspond to the order within each eld.
Yield (biomass - kg)
Salt Field 1 Field 2 Field 3 Field 4
10 11.8 15.1 22.6 7.1
15 21.3 22.3 19.8 9.9
20 8.8 8.1 6.1 1.0
25 10.4 8.5 8.2 2.8
30 2.2 3.3 6.1 0.7
35 8.4 7.3 5.2 2.2
Notice that the fourth block (eld 4) appears, in general, to have a lower yield than the other three elds.
This is equivalent to a person having a generally lower resting heart rate before exercise than other people
when you investigated the effect of exercise.
The raw data are available in the salt.csv le in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS using in the
usual way:
data biomass;
length block $10.;
infile salt.csv dlm=, dsd missover firstobs=2;
input salt biomass block;
run;
Obs block salt biomass
1 Field1 10 11.8
2 Field1 15 21.3
3 Field1 20 8.8
4 Field1 25 10.4
5 Field1 30 2.2
6 Field1 35 8.4
7 Field2 10 15.1
8 Field2 15 22.3
9 Field2 20 8.1
10 Field2 25 8.5
11 Field2 30 3.3
12 Field2 35 7.3
Obs block salt biomass
13 Field3 10 22.6
14 Field3 15 19.8
15 Field3 20 6.1
16 Field3 25 8.2
17 Field3 30 6.1
18 Field3 35 5.2
19 Field4 10 7.1
20 Field4 15 9.9
21 Field4 20 1.0
22 Field4 25 2.8
23 Field4 30 0.7
24 Field4 35 2.2
The data must be entered in stacked column format. Notice that you must have a column corresponding
to Blocks (although you can name it differently).
If using SAS, both of these variables MUST be specied as CLASS variables in the procedures that do
the analysis.
Failing to declare that numeric variables are to be treated as factors is likely the most common error in
analyzing blocked designs using statistical packages.
If you leave salt specied as a continuous variable, you will get a regression analysis rather than an
ANOVA (see the discussion at the start of this example). But unless you look at the output carefully, you
will not know that the analysis is WRONG!
6.12.1 Model building - tting a linear model
The analysis of complex experimental designs is generally called linear modelling. The linear term refers
to the way in which effects in the model are used to predict the response variable and does not imply that
a straight line is being t. A linear model can be used for the analysis of experiment designs, regression
models, and combinations of experimental factors and continuous variables.
The two primary procedures for the analysis of design experiments are Proc GLM and Proc Mixed. For
reasons that are beyond this course, I prefer the latter.
Many textbooks (and this course) have a tendency to treat the analysis of experimental data in a cook-
book fashion, i.e., identify the design and then use an analysis method specic for this design. Unfortu-
nately, this approach hides a very general theory that subsumes all of these specic analyses. The linear
model platforms in most software is a general platform that can be be used for the analysis of ANY and
EVERY design. As such it is very powerful, and produces much output, some of which is not applicable for
some analyses.
The key step in the use of this platform is the writing of an appropriate statistical model.
1. Statistical model
The statistical model should identify the treatment structure, experimental unit structure, and random-
ization structure in the experiment. As noted earlier, this leads to the model:
Biomass = salt block
This short hand notation indicates that the response variable (biomass) is affected by salt concentration
and the blocks. Again, the effect of experimental units is implicitly assumed, and there are no terms
corresponding to randomization effects as randomization was complete within each block.
2. Formulate the hypothesis:
We are interested in testing if the mean biomass is the same at the various salt concentrations:
H:
10
=
15
=
20
= . . . =
35
or all means are equal
A: not all the means are equal or at least one mean is different from the rest
Notice this is the same hypothesis as in the Single factor - CRD - ANOVA hypothesis. In general,
ANOVA almost always tests the equality of means and you have to look very carefully at the design
to distinguish between the various ANOVA procedures.
3. Collect some data and compute summary statistics
The data is entered in the stacked column format as before. Notice that you must have a column
corresponding to Blocks (although you can name it differently). The CLASS statement is used in
Proc Mixed to indicate that variables are to be treated as factors and not as continuous variables.
We rst construct a prole plot to show how the biomass varies as a function of salt for each block
using Proc Sgplot:
proc sgplot data=biomass;
title2 preliminary plot to check additivity;
series y=biomass x=salt / group=block;
run;
There appears to be an effect of salt concentration, particularly when the concentration is 20 ppm or
higher. There is considerable variability within each salt concentration. However this variability has
two components - one due to different blocks used in the experiment, and the other due to simple
random variation.
It appears the there is not much a difference between Blocks 1-3, but Block 4 consistently appears to
have a consistently lower response than the other blocks. This also appears to show that the treatment
and block effects are additive as the lines are roughly parallel.
This is exactly the situation envisioned in a randomized block design.
We also get the rawmeans and standard deviations and check to see that the sample standard deviations
are about equal in all factor levels:
proc tabulate data=biomass;
title2 Summary of means and std dev;
class salt;
var biomass;
table salt, biomass
*
(n
*
f=5.0 mean
*
f=7.2 std
*
f=7.2);
run;
yield
N Mean Std
herb 4 1.82 0.17
24D TCA
Dn/CR 4 1.90 0.09
Sesin 4 1.74 0.14
control 4 1.33 0.29
4. Find the test-statistic and compute a p-value:
We t the model using Proc Mixed. You would could also use Proc GLM.
ods graphics on;
proc mixed data=biomass plot=all;
title2 Analysis using Proc Mixed;
class salt block; /
*
the two effects in the model
*
/
model biomass = salt block / ddfm=kr;
lsmeans salt / diff adjust=tukey; /
*
multiple comparisons - tukey adjustment
*
/
run;
ods graphics off;
This gives the following output:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
salt 5 15 17.71 <.0001
block 3 15 9.42 0.0010
This gives the F-test for the effect of Salt concentration. The F-ratio is 17.7084 and the p-value is
< 0.0001. We ignore the test for blocks as these are NOT an experimental factor in the design.
We can check the basic assumptions of the analysis by looking at the residual summary panel:
There is some evidence that the residuals are not evenly distributed over the salt levels, but this isnt
too bad.
5. Make a decision
Because the p-value is small, we conclude that there is evidence that the mean biomass differs among
the salt concentrations.
Once again, we have not proved that the means are not all the same. We have only collected good
evidence against them being the same. We may have made a Type I error, but the changes of it are
rather small (this is what the p-value measures).
At this point we still dont know which means are different.
6. If you nd evidence against the null hypothesis, use a multiple comparison procedure
Examine rst, the section on least-square means. We obtain the least-square means and differences
among the means from the LSmeans statement in Proc Mixed. Notice that we used a Tukey multiple
comparison procedure to control the overall error rate.
Effect salt Estimate
Standard
Error DF
t
Value
Pr
>
|t|
salt 10 14.1500 1.3865 15 10.21 <.0001
salt 15 18.3250 1.3865 15 13.22 <.0001
salt 20 6.0000 1.3865 15 4.33 0.0006
salt 25 7.4750 1.3865 15 5.39 <.0001
salt 30 3.0750 1.3865 15 2.22 0.0424
salt 35 5.7750 1.3865 15 4.17 0.0008
Effect salt _salt Estimate
Standard
Error DF
t
Value
Pr
>
|t| Adjustment
Adj
P
salt 10 15 -4.1750 1.9608 15 -2.13 0.0502 Tukey 0.3242
salt 10 20 8.1500 1.9608 15 4.16 0.0008 Tukey 0.0089
salt 10 25 6.6750 1.9608 15 3.40 0.0039 Tukey 0.0374
salt 10 30 11.0750 1.9608 15 5.65 <.0001 Tukey 0.0005
salt 10 35 8.3750 1.9608 15 4.27 0.0007 Tukey 0.0072
salt 15 20 12.3250 1.9608 15 6.29 <.0001 Tukey 0.0002
salt 15 25 10.8500 1.9608 15 5.53 <.0001 Tukey 0.0007
salt 15 30 15.2500 1.9608 15 7.78 <.0001 Tukey <.0001
salt 15 35 12.5500 1.9608 15 6.40 <.0001 Tukey 0.0001
salt 20 25 -1.4750 1.9608 15 -0.75 0.4635 Tukey 0.9715
salt 20 30 2.9250 1.9608 15 1.49 0.1565 Tukey 0.6742
salt 20 35 0.2250 1.9608 15 0.11 0.9102 Tukey 1.0000
salt 25 30 4.4000 1.9608 15 2.24 0.0403 Tukey 0.2749
salt 25 35 1.7000 1.9608 15 0.87 0.3996 Tukey 0.9488
salt 30 35 -2.7000 1.9608 15 -1.38 0.1887 Tukey 0.7394
There is a subtle difference between the Least Square Mean and the Mean column that is a concern
when the data is unbalanced, i.e., some blocks are missing some treatments. This is beyond the scope
of this course.
In the above example, the estimated standard errors are all around 1.4. It appears that the mean
response from salt concentration of 10 ppm and 15 ppm are different from the mean response at the
other concentrations, but there is no clear distinction among the mean responses within each group.
NOTE that the se are NOT computed as simply s/
n. Every design leads to a different formula

for computing standard errors. However, as mentioned before, there is a unied theory for doing this
which is a major component of the education of a statistician.
A formal multiple comparison procedure can also be done.
The joined-lines plot is obtained using the pdmix800 macro as seen in previous chapters using the
output from Proc Mixed:
%pdmix800(diffs3,lsmeans3,alpha=0.05,sort=yes);
Effect=salt Method=Tukey(P<0.05) Set=1
Obs salt Estimate Standard Error Letter Group
1 15 18.3250 1.3865 A
2 10 14.1500 1.3865 A
3 25 7.4750 1.3865 B
4 20 6.0000 1.3865 B
5 35 5.7750 1.3865 B
6 30 3.0750 1.3865 B
The output looks different, but contains enough information to make the same types of inference as
before. In particular, the condence interval for the difference will enable you to test if the difference
in the means is 0. The joined lines plot is available at the bottom, but comparison circles are not
available from this platform.
6.13 Example - Comparing different herbicides
An experiment was conducted to investigate the effect of different herbicides on the spike weight of gladio-
lus. The idea is that the herbicides will kill competing weeds and allow more growth to occur.
Four elds were randomly selected from a set of elds scattered around the province at various experi-
mental farms, and each eld was split into 4 smaller elds. Four different herbicides were randomly assigned
to the smaller plots in a RCB design.
You should draw a possible experimental layout and think how a blocked designs layout is different
from that of a completely randomized design.
Biomass - kg
Herbicide Field 1 Field 2 Field 3 Field 4
Control 1.05 1.53 1.62 1.11
2-4D TCA 2.05 1.86 1.68 1.69
DN/CR 1.95 2.00 1.82 1.81
Sesin 1.75 1.93 1.70 1.59
The raw data are available in the herbicide.csv le in the Sample Program Library at http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS us-
ing in the usual way:
data yield;
length herb field $10.;
infile herbicide.csv dlm=, dsd missover firstobs=2;
input herb field yield;
run;
The rst 12 lines of the raw data are shown below:
Obs herb eld yield
1 control Field 1 1.05
5 24D TCA Field 1 2.05
9 Dn/CR Field 1 1.95
13 Sesin Field 1 1.75
1. Statistical Model The simplied model is:
Weight = Herbicide Field
where the response variable Weight is affected by both treatment effects Herbicide and blocking effects
Field. The effects of individual experimental units is not specied (but implicitly assumed to exist),
and nor is there any term corresponding to randomization effects.
We are interested in testing if the mean biomass per plot is the same for all 4 herbicides
H:
control
=
24D
=
DN/CR
=
sesin
A: not all the means are equal, i.e., at least one differs from the rest.
3. Collect data and preliminary sample statistics:
The data must be stacked into three columns corresponding to the weight response variables, the
herbicide used, and the eld (blocking) variable.
We rst construct a prole plot to show how the weight varies as a function of herbicide for each eld
using Proc Sgplot:
proc sgplot data=yield;
title2 preliminary plot to check additivity;
series y=yield x=field / group=herb;
run;
All herbicide appears to be equally effective and better than control. There appear to be some eld
effects, but they are not completely consistent (of course with a small sample size, it is hard to know
if these are just random variation).
We also get the rawmeans and standard deviations and check to see that the sample standard deviations
are about equal in all factor levels:
proc tabulate data=yield;
title2 Summary of means and std dev;
class herb;
var yield;
table herb, yield
*
(n
*
f=5.0 mean
*
f=7.2 std
*
f=7.2);
run;
yield
N Mean Std
herb 4 1.82 0.17
yield
N Mean Std
24D TCA
Dn/CR 4 1.90 0.09
Sesin 4 1.74 0.14
control 4 1.33 0.29
4. Compute a test statistic and p-value:
We t the model using Proc Mixed. You would could also use Proc GLM.
ods graphics on;
proc mixed data=yield plot=all;
title2 Analysis using Proc Mixed;
class field herb; /
*
the two effects in the model
*
/
model yield = field herb / ddfm=kr;
lsmeans herb / diff adjust=tukey; /
*
multiple comparisons - tukey adjustment
*
/
run;
ods graphics off;
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
eld 3 9 1.74 0.2283
herb 3 9 8.52 0.0054
We can check the basic assumptions of the analysis by looking at the residual summary panel:
There is some evidence that the residuals are not evenly distributed over the herbicides, but this isnt
too bad.
Only the line in the Effect tests table corresponding to the test of the herbicide effect is of interest.
The F-statistic is 8.52 and the p-value is 0.0054.
5. Make a decision:
Because the p-value is small, we conclude that there is evidence of a difference among the popula-
tion means, i.e., there is evidence of a difference among the mean spike weight from the different
herbicides. However, at this point, we cant say much else.
6. Because we found evidence against the null hypothesis, do a multiple comparison procedure.
The table of estimated means and standard errors seems to indicate that the mean response from the
control is lower than the mean weight when the herbicides is applied, but there doesnt appear to be
much difference in the mean response among the three herbicides.
A formal multiple comparison conrms this result:
We obtain the least-square means and differences among the means from the LSmeans statement in
Proc Mixed. Notice that we used a Tukey multiple comparison procedure to control the overall error
rate.
Effect herb Estimate
Standard
Error DF
t
Value
Pr
>
|t|
herb 24D TCA 1.8200 0.08685 9 20.95 <.0001
herb Dn/CR 1.8950 0.08685 9 21.82 <.0001
herb Sesin 1.7425 0.08685 9 20.06 <.0001
herb control 1.3275 0.08685 9 15.28 <.0001
Effect herb _herb Estimate
Standard
Error DF
t
Value
Pr
>
|t| Adjustment
Adj
P
herb 24D TCA Dn/CR -0.07500 0.1228 9 -0.61 0.5566 Tukey 0.9262
herb 24D TCA Sesin 0.07750 0.1228 9 0.63 0.5438 Tukey 0.9195
herb 24D TCA control 0.4925 0.1228 9 4.01 0.0031 Tukey 0.0134
herb Dn/CR Sesin 0.1525 0.1228 9 1.24 0.2458 Tukey 0.6183
herb Dn/CR control 0.5675 0.1228 9 4.62 0.0013 Tukey 0.0056
herb Sesin control 0.4150 0.1228 9 3.38 0.0081 Tukey 0.0340
There is a subtle difference between the Least Square Mean and the Mean column that is a concern
when the data is unbalanced, i.e., some blocks are missing some treatments. This is beyond the scope
of this course.
NOTE that the se are NOT computed as simply s/
n. Every design leads to a different formula

for computing standard errors. However, as mentioned before, there is a unied theory for doing this
which is a major component of the education of a statistician.
The joined-lines plot is obtained using the pdmix800 macro as seen in previous chapters using the
output from Proc Mixed:
Effect=herb Method=Tukey(P<0.05) Set=1
Obs herb Estimate Standard Error Letter Group
1 Dn/CR 1.8950 0.08685 A
2 24D TCA 1.8200 0.08685 A
3 Sesin 1.7425 0.08685 A
4 control 1.3275 0.08685 B
This shows that there is no evidence of a difference in the population mean yield among the three
herbicides, but all three herbicides appear to have different mean yield than the control application.
6.14 Example - Comparing turbidity at several sites
This is an generalization of the paired analysis that we did in a previous section where we compared the
log(turbidity) at two sites.
French Creek is located on Vancouver Island, B.C. and runs from rural through urban areas out to the
sea. Measurements of water quality are taken at several points along the creek. Interest lies in a comparison
of the mean water quality among the sites. For example, is there a difference in mean turbidity (NTU) among
the sites.
These measurements are taken approximately monthly, but twice a year, ve measurements are taken
within 30 days. The readings are synoptic at the sites, i.e. taken at roughly the same time (usually within a
day of each other). Consequently, events such as a heavy rain would be expected to impact the water quality
at all the sites.
Here is the raw data on turbidity at ve sites. The sample time is the YYYY.MMDD format. Notice that
there is a missing value for the NewHwy site in 2010.1116.
Turbidity (NTU) at ve sites
SampleTime BB Coombs Grafton NewHwy WinchRd
2010.0427 1.9 1.1 0.9 1.9 1.4
2010.0525 1 0.7 0.6 0.7 1
2010.0622 1.8 0.6 0.3 0.7 1
2010.0726 0.7 0.3 0.8 0.5 0.8
2010.0816 0.7 0.3 1.5 0.6 0.9
2010.0824 0.8 0.5 0.4 0.4 1.4
2010.0831 0.9 0.3 1.1 0.6 1.6
2010.0907 0.7 0.3 1.6 0.5 2.5
2010.0914 0.8 0.3 0.2 0.4 0.7
2010.1020 0.5 0.4 0.2 0.3 0.6
2010.1026 4.3 2 1.2 2.3 1.1
2010.1102 7.4 4 2.1 4.2 1.9
2010.1109 2.5 1.3 1.3 2 1.3
2010.1116 1.9 1.1 0.6 0.8
2010.1229 5.3 3.8 4.3 6.2 3.1
We will now analyze all of the data simultaneously.
The raw data are available in the TurbidityCompare.csv le in the Sample Program Library at http:
data turbidity;
length SampleTime $10.;
infile TurbidityCompare.csv dlm=, dsd missover firstobs=2;
input SampleTime BB Coombs Grafton NewHwy WinchRd;;
run;
Obs SampleTime BB Coombs Grafton NewHwy WinchRd
1 2010.0427 1.9 1.1 0.9 1.9 1.4
2 2010.0525 1.0 0.7 0.6 0.7 1.0
3 2010.0622 1.8 0.6 0.3 0.7 1.0
4 2010.0726 0.7 0.3 0.8 0.5 0.8
5 2010.0816 0.7 0.3 1.5 0.6 0.9
6 2010.0824 0.8 0.5 0.4 0.4 1.4
7 2010.0831 0.9 0.3 1.1 0.6 1.6
8 2010.0907 0.7 0.3 1.6 0.5 2.5
9 2010.0914 0.8 0.3 0.2 0.4 0.7
10 2010.102 0.5 0.4 0.2 0.3 0.6
11 2010.1026 4.3 2.0 1.2 2.3 1.1
12 2010.1102 7.4 4.0 2.1 4.2 1.9
13 2010.1109 2.5 1.3 1.3 2.0 1.3
14 2010.1116 1.9 1.1 0.6 . 0.8
15 2010.1229 5.3 3.8 4.3 6.2 3.1
When there are 3 or more treatment levels, the General Linear Model approach must be used. This is the
most general approach to the analysis of experimental data. As such, great care must be taken to correctly
specify the model to obtain the correct analysis. If you specify the wrong model, you will get incorrect
results!
This approach can automatically adjust for a few missing values if there are extensive missing values
or if the blocks were designed to be incomplete from the start, methods for incomplete block designs may
be preferable.
Variation in the response variable (the log(turbidity)) is caused by several sources, some of which we
can explicitly identify. As in the previous chapter, terms in the model will correspond to the treatment
structure, the experimental unit structure, and the randomization structure.
The rst source of variation comes from the treatment structure - here the sites on the creek may give
rise to differences in the log(turbidity). The second source of variation is from the sampling dates where
events such as a major rainfall inuence the readings at both sites.
log(turbidity) = Site SamplngDate
The terms Site and SamplingDate can be written in any order.
The hypothesis can be specied in term of the mean log(turbidity) for each site
H:
BB
=
Combs
= . . .
A: at least one mean differs
If there was no effect of the site upon the mean log(turbidity), we would expect that the mean
log(turbidity) would be equal for all sites. In the case of 3 or more treatment levels, the concept of
one- or two-sided tests doesnt make sense the alternative is simply that a difference exists some-
where among the population means
2. Collect data
We need to Stack the data and then compute the log(turbidity) in the stacked dataset.
data turbidity3;
set turbidity;
length SiteName $10;
sitename = "BB"; turbidity = BB; output;
sitename = Coombs; turbidity = Coombs; output;
sitename = Grafton; turbidity = Grafton; output;
sitename = NewHwy; turbidity = NewHwy; output;
sitename = WinchRd; turbidity = WinchRd; output;
keep SampleTime SiteName turbidity;
run;
data turbidity3; /
*
compute log(turbidity)
*
/
set turbidity3;
logTurbidity = log(turbidity);
run;
part of the transposed raw data
Obs SampleTime SiteName turbidity logTurbidity
1 2010.0427 BB 1.9 0.64185
2 2010.0427 Coombs 1.1 0.09531
3 2010.0427 Grafton 0.9 -0.10536
4 2010.0427 NewHwy 1.9 0.64185
5 2010.0427 WinchRd 1.4 0.33647
6 2010.0525 BB 1.0 0.00000
7 2010.0525 Coombs 0.7 -0.35667
8 2010.0525 Grafton 0.6 -0.51083
9 2010.0525 NewHwy 0.7 -0.35667
10 2010.0525 WinchRd 1.0 0.00000
ods graphics on;
proc mixed data=turbidity3 plots=all;
title2 Modelling approach as an RCB using MIXED;
class SampleTime SiteName;
model logturbidity = SampleTime SiteName/ /
*
ddfm=KR
*
/;
lsmeans SiteName / diff pdiff adjust=tukey;
run;
ods graphics off;
This gives:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
SampleTime 14 55 12.13 <.0001
SiteName 4 55 5.62 0.0007
The statistical signicance of each effect in the model is summarized by a series of F-statistics. In this
case the p-value for the effect of Site is .0007 and so there is strong evidence that not all population
mean log(turbidity) are equal. At this point, we dont know which means could be different each
other.
appropriate.
22
Estimates of the population marginal means and all the pairwise differences in the marginal means
are obtained by the lsmens statements in the SAS code seen earlier. Dont forget to specify the ad-
just=tukey option to control the overall error rate.
This gives:
Effect SiteName Estimate
Standard
Error DF
t
Value
Pr
>
|t|
SiteName BB 0.3734 0.1181 55 3.16 0.0026
SiteName Coombs -0.3086 0.1181 55 -2.61 0.0115
SiteName Grafton -0.2049 0.1181 55 -1.73 0.0884
SiteName NewHwy -0.05290 0.1233 55 -0.43 0.6694
SiteName WinchRd 0.1847 0.1181 55 1.56 0.1235
Effect SiteName _SiteName Estimate
Standard
Error DF
t
Value
Pr
>
|t| Adjustment
Adj
P
SiteName BB Coombs 0.6820 0.1670 55 4.08 0.0001 Tukey-Kramer 0.0013
SiteName BB Grafton 0.5783 0.1670 55 3.46 0.0010 Tukey-Kramer 0.0089
SiteName BB NewHwy 0.4263 0.1707 55 2.50 0.0155 Tukey-Kramer 0.1061
SiteName BB WinchRd 0.1886 0.1670 55 1.13 0.2636 Tukey-Kramer 0.7903
SiteName Coombs Grafton -0.1037 0.1670 55 -0.62 0.5371 Tukey-Kramer 0.9711
SiteName Coombs NewHwy -0.2557 0.1707 55 -1.50 0.1398 Tukey-Kramer 0.5681
SiteName Coombs WinchRd -0.4934 0.1670 55 -2.95 0.0046 Tukey-Kramer 0.0357
SiteName Grafton NewHwy -0.1520 0.1707 55 -0.89 0.3772 Tukey-Kramer 0.8993
SiteName Grafton WinchRd -0.3896 0.1670 55 -2.33 0.0233 Tukey-Kramer 0.1501
SiteName NewHwy WinchRd -0.2376 0.1707 55 -1.39 0.1695 Tukey-Kramer 0.6352
22
SASs Proc Mixed does not automatically create the joined-lined plots as with Proc GLM nor as seen
in other software such as JMP. The pdmix800.sas macro (available from the WWW) can do this using
the results from Proc Mixed.
This gives:
Effect=SiteName Method=Tukey-Kramer(P<0.05) Set=1
Obs SiteName Estimate Standard Error Letter Group
1 BB 0.3734 0.1181 A
2 WinchRd 0.1847 0.1181 AB
3 NewHwy -0.05290 0.1233 ABC
4 Grafton -0.2049 0.1181 BC
5 Coombs -0.3086 0.1181 C
A plot to check the model assumptions is also available:
and does not show any evidence of problems.
These results are identical to previous analyses.
4. Make a decision
There is strong evidence that not all sites have the same mean log(turbidity). It is difcult to distin-
guish the mean for sites BB, WinchRd, and NewHwy and for Sites NewHwy, Grafton, and Coombs.
Note that there is considerable overlap in the joined letter plots refer back to the chapter on the
completely randomized design for the intuitive analogy of paint chips to explain this phenomenum.
6.15 Power and Sample Size in RCBs
Power and sample size determinations proceed in a similar fashion as in a single-factor CRD.
The key difference is that should be the residual variation AFTER ADJUSTING FOR BLOCKS. This
is usually obtained from the
MSE term (obtained from the ANOVA table) in a pilot study or a literature
search. There is a slight complication in that blocking reduces the degrees of freedom for residual variance
(the MSE term) which should be taken into account, but this correction is usually small and can often be
ignored.
If there is no information on the residual variation, use a value not adjusted for blocks - this gives an
upper bound as to the number of blocks needed.
Notice the gains in power from using a RCB design come from the reduction in unexplained variation.
Consequently, you will want to determine blocks that have experimental units as similar as possible within
blocks to reduce the noise within the blocks.
The dangers associated with retrospective power analyses are still present.
As in the CRD design, the conguration of the means has an effect. If available, use estimates of the
k means. If not available, select values such that the difference between the largest and smallest mean is
biologically important and place the means of the other groups at the average of these two values. The actual
means are not important, only the difference between the largest and smallest value.
For example, consider the herbicide example in a previous section. A follow up study is to be conducted
ONLY on the 3 herbicides (i.e. the control treatment is to be ignored). The estimated value of the standard
deviation from an RCB analysis is 0.174. [This is obtained from the Root Mean Square Error value in the
Summary of Fit section.] Suppose that a difference of 0.2 is biologically important.
We use Proc Power with the values from the experiment as noted above.
proc power;
title2 Using a base difference of 0.2 mm;
onewayanova
groupmeans = 0 | .1 | .2 /
*
list the group means that differ by biological effect
*
/
stddev = .174 /
*
*
/
alpha = .05 /
*
*
/
power = .80 /
*
target power
*
/
ntotal = . /
*
solve for power
*
/
; /
*
*
/
footnote This configuration has the worst power and so the largest possible sample size;
footnote2 This power computation is APPROXIMATE as not adjustment has been made;
footnote3 for the loss in df due to blocking;
run;
which gives
Mean1 Mean2 Mean3
Std
Dev
Nominal
Power
Actual
Power
N
Total
0 0.1 0.2 0.174 0.8 0.811 48
About 48 observations are needed IN TOTAL, which corresponds to 16 observations are needed for each
herbicide, which implies that about 16 blocks should be used. A more rened analysis that accounts for the
loss of degrees of freedom in MSE from blocking would indicate that about 17 blocks is actually needed.
You may also wish to refer to the MOF publication 50 on power and sample size in RCB with sub-
sampling available at http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/index.html
6.16 Example - BPK: Blood pressure at presyncope
6.16.1 Experimental protocol
The data for this experiment was provided by Clare Protheroe, a M.Sc. candidate in BPK at SFU.
Fifteen subjects took place in an experiment to measure their orthostatic tolerance, the blood pressure of
the patients just before presyncope (the symptoms experienced before a faint). During presyncope, patients
experience lightheadedness, muscular weakness, and feeling faint (as opposed to a syncope, which is actually
fainting). In many patients, lightheadedness is a symptom of orthostatic hypotension which occurs when
blood pressure drops signicantly such as when the patient stands from a supine or sitting position
Each subject was measured on a tilt tests three times - one with a compression stocking, one with a
placebo stocking, and another with a different placebo stocking. The subjects were randomized to stocking
conditions on three different days.
For each test, the subject was subject to a 20 minute supine (lying on the back) period, followed by
a 20 minute tilt period, followed by a 10 minute period of 20 mmHg of lower body negative pressure
(LBNP), a 10 minute period of 40 mmHg LBNP, followed by a 10 minute period of 60 mmHg LBNP.
However, not all patients made it through the entire test before reaching the pre-syncope stage. As noted at
http://advan.physiology.org/content/31/1/76.full
During LBNP, participants lie in a supine position with their legs sealed in a LBNP chamber at
the level of the iliac crest. Air pressure inside the chamber is reduced by a vacuum pump, mak-
ing the pressure inside the chamber less than atmospheric pressure. This causes blood to shift
from an area of relatively high pressure (i.e., the upper body, which is outside the chamber) to-
ward an area of relatively low pressure (i.e., the legs inside the chamber). Without physiological
compensations, blood is shunted away from the thoracic cavity and ultimately pools in the lower
limbs and the lower abdomen. Normally, the body compensates by peripheral vasoconstriction
and an increase in heart rate, which serve to maintain normal circulation. Inadequate physiolog-
ical compensations in response to increasing negative pressure results in falling arterial blood
pressure and, ultimately, syncope.
Systolic blood pressure was measured every 2 minutes for 20 minutes during the supine phase; again
measured every 2 minutes during the tilt phase; and nally every 2 minutes during the LBNP phases until
the patient ended the trial at the onset of presyncope. The blood pressure at the onset of presyncope was
measured.
The rawdata available on the Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/
Here is a (very small) snippet of the raw data for two patients.
Condition Time Placebo 1 Experimental Placebo 2
Subject 23395 29902 23395 29902 23395 29902
Supine 2 136 87 114 106 137 111
4 129 81 114 94 128 97
6 125 84 114 103 118 98
8 132 81 115 104 124 99
10 125 76 119 101 118 98
12 131 89 121 107 120 97
14 128 86 118 106 110 113
16 124 103 120 109 123 119
18 134 107 122 104 125 114
20 132 94 124 112 134 109
Tilt 22 129 105 113 106 117 103
24 130 111 113 116 125 100
26 130 89 112 103 115 93
28 145 89 126 99 131 98
30 135 89 142 103 127 103
32 128 102 133 100 117 107
34 124 101 143 89 122 103
36 119 93 129 88 124 102
38 79 92 129 114 113 107
40 102 89 131 106 128 108
-20mmHg 42 98 123 100 120 107
44 128 102 127 97 114 100
46 149 88 123 91 116 100
48 152 90 133 93 111 105
50 158 96 127 86 114 93
-40mmHg 52 153 104 128 95 92
54 149 97 126 89 94
Continued. . .
56 147 101
58 143
60 145
-60mmHg 62 147
64
66
68
70
Presyncope BP 74.3 52.3 68.6 76.2 54.8 70.0
Patient 23395 started off with a systolic blood pressure (SBP) of 136 mmHg at the end of the second
minute while supine wearing the Placebo 1 stocking, and the blood pressure varied over the next 18 minutes,
with the nal blood pressure (at minute 20) of 132 mmHg. Then the patient was tilted. At minute 22 (2
minutes into the tilt) the SBP was 129 and at minute 40 the SBP was 102. The LBNP was applied. The
blood pressure was missing for this patient at minute 42. This missing value was likely due to an equipment
malfunction or calibration problem. The blood pressure increased and ended at 158 mmHg at minute 50.
The LBNP was increased and at minute 60 the SBP was 145 mmHg. The LBNP was again increased. At
minute 62 the SBP was 147. At this point blood pressure readings were terminated. The patient experienced
presyncope at blood pressure 74.3.
This patient underwent similar testing under the Experimental and Placebo 2 conditions.
Patient 29902 underwent a similar protocol, but the SBP terminated at minute 54 under the Placebo 1
condition. This patient experience pre syncope at a blood pressure of 52.2 mmHg.
Whew the data is quite large with over 1000 values.
The hypotheses of interest are;
Is there a difference in the mean blood pressure just before presyncope between the different treat-
ments?
Is there a difference in the mean blood pressure between the different phases and treatments.
We will split the analysis into two parts an analysis of he blood pressure prior to presyncope and an
analysis on the blood pressure in the different phases. These are presented in different chapters.
6.16.2 Analysis
This part of the experiment has a single factor (type of stocking) with 3 levels. Every subject has received
each treatment in random order this implies that each subject serves as its own control, i.e. each subject is
a block. All treatments occur for each subject, so this is a complete block design. Each patient has a single
blood pressure reading just prior to presyncope so there is no subsampling or pseudo-replication involved.
The data are read into SAS in the standard structure with one variable representing each of the treatment,
the subject number, and the blood pressure prior to presyncope.
proc import file="BPatPresyncope.csv" dbms=csv out=bp replace;
run;
Part of the raw data on the blood pressure at presyncope is shown below:
Obs treatment subject bp_at_presyncope
1 PLACEBO CONDITION 1 23395 74.3
A preliminary tabulation (not shown here) shows that every subject had every treatment and that every
treatment appears in every block exactly one. This is a standard randomized complete block design.
We also look at the means and standard deviations:
proc tabulate data=bp;
title2 A preliminary tabulation;
class treatment;
var bp_at_presyncope;
table treatment, n
*
f=5.0
bp_at_presyncope
*
(mean
*
f=6.2 std
*
f=6.2);
run;
which gives:
N bp_at_presyncope
Mean Std
treatment 15 66.82 14.25
EXPERIMENTAL CONDIT
PLACEBO CONDITION 1 15 72.18 9.87
We notice that one of the treatments has a much larger standard deviation than the other treatments. This
is potentially worrisome and may indicate an outlier in the data.
A prole plot is used to check the assumption of additivity between treatments and blocks, i.e. that the
effect of the treatment is roughly the same in all subjects.
proc sgplot data=bp;
title2 Profile plot to check for additivity and outliers;
series y=bp_at_presyncope x=subject / group= treatment;;
run;
which gives:
There is an obvious outlier. We checked to make sure that this was not just a typing error. Because this
outlier only occurred for one subject and is quite extraordinary, we decide to remove the outlier:
data bp_nooutlier;
set time;
if bp_at_presyncope < 40 then delete;
run;
The revised prole plot (not shown) is much improved and the large standard deviation seen earlier is
now reduced:
N bp_at_presyncope
Mean Std
treatment 14 69.23 11.17
EXPERIMENTAL CONDIT
N bp_at_presyncope
Mean Std
We are now ready to do the analysis. The formal statistical model is
BP = Treatment Subject(R)
where BP is the response variable (the blood pressure at presyncope); Treatment represents the treatment
levels (the three types of stockings); and Subject represents the blocking variable for the individual subjects.
There is NO interaction term between treatments and blocks as a key assumption of RCBs is the additivity
between treatments and blocks. There is no information in the study to test this assumption. Subject is
declared as a random term because we are not interested only in these subjects, but wish to extrapolate to all
people in general who might need treatment.
Note that if the design was complete, i.e. if every subject did every treatment, it would not make a
difference to the test of treatment effects if subjects were declared as xed or random effects. It still makes
a difference for the se of the marginal means (but not of the treatment contrasts).
In the cases where blocks are not complete, you will see small differences in the results for testing
treatment effects if subjects (blocks) are treated as xed or random. In the case of a single missing value,
the differences are expected to be small. However, if there are many missing values (e.g. incomplete block
designs), there is extra information that can be extracted from the analysis when blocks are random effects
(the inter-block information). See the chapter on incomplete blocks for more information.
We use Proc Mixed to t the above model:
proc mixed data=bp_nooutlier plots=all;
title2 Analysis;
class subject treatment;
model bp_at_presyncope = treatment;
random subject;
lsmeans treatment / diff cl adjust=tukey;
ods output tests3 =tests;
ods output diffs =LSMeansDiffs;
ods output lsmeans=LSmeans;
ods output covparms=Covparms;
run;
The results of the analysis of variance:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
treatment 2 27 0.44 0.6484
fail to detect an effect of the different treatments on the mean blood pressure at presyncope (p = 0.65).
Estimates of the mean blood pressure at presyncope for each treatment are:
treatment Estimate
Standard
Error
EXPERIMENTAL CONDIT 69.2297 2.4433
PLACEBO CONDITION 1 72.1774 2.3605
PLACEBO CONDITION 2 71.7936 2.3605
The standard errors for each estimated mean are quite large which masks the smallish difference among the
mean.
A comparison of the mean (after a Tukey adjustment) shows the estimated differences in the mean blood
pressure prior to presyncope is small among the various treatments.
treatment _treatment Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
EXPERIMENTAL CONDIT PLACEBO CONDITION 1 -2.9477 3.3973 Tukey-Kramer 0.6649 -11.3707 5.4754
EXPERIMENTAL CONDIT PLACEBO CONDITION 2 -2.5639 3.3973 Tukey-Kramer 0.7334 -10.9870 5.8591
PLACEBO CONDITION 1 PLACEBO CONDITION 2 0.3838 3.3382 Tukey-Kramer 0.9927 -7.8928 8.6603
Finally, examination of the residual plots and other diagnostics show no problems.
6.16.3 Power and sample size
We failed to detect a difference in the mean blood pressure prior presyncope in the above experiment. How
many subjects would be needed to detect a difference?
As noted in earlier chapters, a power analysis requires the following information:
Biologically important difference to detect. This is the hardest part of any power/sample size deter-
mination. What difference in the means is biologically important to detect? Suppose that a difference
in ve mmHg in the mean blood pressure at presyncope was clinically important. Most power pro-
grams will require you choose means whose difference between the minimum and maximum mean
equals this clinical difference with the rest of the means scattered in the interval. Based on the previ-
ous experiment, one possible set of means for which a power analysis is requested may be (65, 65, and
70) where the values of 65 represent the means of the placebo stockings and the value of 70 represents
the mean under the treatment stocking.
Variation in data values. This represent the noise in the experiment. This can be obtained from past
data (such as above) or expert opinion. From the above analysis we nd that the estimated residual
variance is about 81, or the residual standard deviation is about 9.
Cov
Parm Estimate
subject 0
Residual 83.5767
Note that the subject variation was estimated to be 0; this indicates that given the noise in the data,
there is no evidence that subjects are responding generally higher or lower across the three treatments.
Even if the subject variance was non-zero, this would not affect the power analysis. The key advantage
of blocked designs is that the blocking variation is removed from the comparison of means of the
different treatments by conducting all treatments within each block. Each subject serves as its own
control!.
It is good precise to try a number of values to see how sensitive the results are to this value.
The level. Common choices are = 0.05 and = 0.10 for a target power of 0.80 and 0.90
respectively.
Target power. We will aim for a target power of 80%.
These values are then fed into the power/sample size program. Be careful as different programs require
the data to be entered in different ways. For example, some programs require the variation to be entered as
the variance while other require the variation to be entered as the standard deviation.
We use Proc Power with the above information
proc power;
title2 Using a base difference of 5 minutes;
onewayanova
groupmeans = 65 | 65 | 70 /
*
*
/
stddev = 9 /
*
*
/
alpha = .05 /
*
*
/
power = .80 /
*
target power
*
/
ntotal = . /
*
solve for power
*
/
; /
*
*
/
plot x=power min=0.05 max=0.90;
ods output output=poweroutput;
run;
Notice that this the same power program for a standard single-factor CRD the program doesnt care that
this design is blocked because the adjustment for blocking has been done because the residual variation has
the blocking variation removed.
The POWER Procedure
Overall F Test for One-Way ANOVA
Computed N Total
Actual Power N Total
0.801 144
Note that SAS reports the TOTAL sample size, i.e. the total number of observations over all groups. Because
each subject does three tests, the number of subjects will be the total sample size/3.
Almost 150 observations in total, or 50 subjects 3 treatments/subject would be needed to detect a 5
minute difference in the means!
A plot of the power vs. the number of subjects shows that with only 15 subjects (or 45 observations), the
power is extremely low.
6.17 Final notes
Match analysis with design. It is important that the proper analysis be used with a particular experi-
ment. A very common error (even made by experienced scientists) is to always use the Single-factor
-CRD -ANOVA model even when the design is not a CRD; or to use a two-sample t-test when the
design happens to be paired. Any statistical package expects the user to select the appropriate design.
Modied RCB designs. There are many modication to RCB designs. For example, you could have
multiple occurances of treatments within each block. The analysis of these experiments is tricky - and
beyond the scope of this course.
Missing values. If your experiment has missing values in some of the blocks, this complicates the
analysis greatly if the computations are done by hand. Most modern packages (i.e. NOT Excel) should
be able to deal with the missing data quite easily unless there are large number of missing values. You
may wish to seek some help if you have substantial numbers of missing values.
Pseudo-replication. Beware of pseudo-replication.
Incomplete randomization. There is a subtle problem if the order of the treatments cannot be ran-
domized, e.g., many experiments are interested in how the response changes over time. The technical
name for the problem is a lack of compound symmetry in the covariance matrix responses that are
close together in time are more highly related than responses that are far apart in time. If there are only
two time measurements (e.g. before vs. after), the problem of a lack of compound symmetry does not
occur as there are no longer measurements close and far apart in time all measurements are taken
the same time distance apart. If there are more than two measurement then you can run into problems.
In these cases, the design is more properly analyzed as a repeated measures design.
In any case, a lack of randomization in time also means that if the treatment is also not randomized,
then the effects of the treatment and of time are completely confounded. For example, if you measured
the blood pressure of subjects on Friday, gave them the drug to take over the weekend, and then
measured their blood pressure on Monday, the effect of the drug is completely confounded with the
Friday vs. Monday effect.
Fixed vs. Random Blocks. In the next chapter, you will be introduced to xed and random effects.
Essentially, xed effects are used whenever inference is to be limited to the levels actually appearing
in the experiment - if the levels are to be used again in a future experiment, the effects would be a
xed effect.
The role of blocks is a bit ambiguous. If the experiment were to be repeated, would the same blocks
be used? In the analyzes above this was an implicit assumption. However, in many cases, the blocks
used in an experiment are purely artefacts of time and space and if the experiment were to be repeated,
different blocks would be used. In these cases blocks should be treated as random effects.
For the simple designs presented above, this only has implications for estimation of marginal means
- the ses presented are too small. All other test statistics, multiple comparisons, etc., are unaffected.
Consequently, for most purposes, this subtle distinction between xed and random blocks is unimpor-
tant. However, in more complex designs, it can cause signicant problems.
The paper:
Newman, J.A, Bergelson, J., and Grafen, A. (1997).
Blocking factors and hypothesis tests in ecology: is your statistics text wrong?
Ecology, 78, 1312-1328
http://dx.doi.org/10.2307/2266126
discusses this issue in detail. [This will not be covered in this course.]
6.18.1 Difference between pairing and confounding
What is the difference between pairing and confounding?
The concepts are quite distinct. Pairing refers to the matching of experimental values by (for example)
repeated measurements on the same unit. Confounding refers to effects of Factors that cannot be separated
from the effects of other factors. For example, surveys of students in Stat 403 show that stress levels tend
to increase in April. Is this caused by the writing of exams, or the increased sunlight as the days get longer,
or by the increased average temperature as the days get warmer? A survey of students could not separate
these effects as no experimental manipulation could be performed and the effects of all three factors are
confounded. A good experiment manipulates the levels of factors to try and remove confounding among
factors.
6.18.2 What is the difference between a paired design and an RCB design?
What is the difference between a paired design and an RCB design?
A paired design is a special case of a general RCB design when the factor has exactly two levels and
there are a pair of observations from each observational unit. As shown in the notes, a paired design could
be analyzed using an RCB design analysis.
6.18.3 What is the difference between a paired t-test and a two-sample t-test?
A student wrote:
I am a little confused over the difference between a paired t-test and a two sample t-test! Here is
what I understand: You use the paired t procedure when the data are from a common source, e.g.
from the same subject before and after, whereas in a two sample experiment you are comparing
results from different subjects, e.g., you give a placebo to one of them and the actual drug to the
other. Is this correct?
Yes. That is correct. If there is a logical connection between the two data values, e.g. before and after,
then you use a paired test. If there is complete randomization of treatments to different experimental units
so that every unit received one and only one treatment, then you use a two-sample t-test.
She continues:
But I really do not see a difference in the analysis of the data! The same hypotheses are being
tested, and it seems only that the test-statistic is computed differently.
Yes, that is correct again. The hypothesis in both experiments is that the mean response is the same for
both levels of the treatment. In the paired case, we often express this in terms of the difference in means,
i.e., H:
di
= 0 rather than in terms of the equality of group means.
The test statistic and the se of the test statistic are computed differently in either method.
Once you have the p-value, you use it in the same fashion to make a decision, i.e., if the p-value is very
small, then there is evidence against the null hypothesis, etc.
6.18.4 Power in RCB/matched pair design - what is root MSE?
The power for a matched paired design can be found in two ways - one specic for a matched pair design
and the other as for a general RCB design (as noted above the matched pair design is a special case of an
RCB).
In the rst instance, the DIFFERENCE in response between the two treatments is rst computed. This
"reduces" the problem to a single mean case, i.e., testing if the mean difference is zero. In JMP, use the
DOE->PowerSampleSize platform and select the single mean option. You must ll in the standard deviation
OF THE DIFFERENCE and the size of the difference in means that is of biological interest. The sample
size is the number of pairs required.
In the second instance, use the DOE->PowerSampleSize platform and select the k-means case. You will
need an estimate of the standard deviation of values within a treatment group AFTER ADJUSTING FOR
BLOCKS. One estimate of this is Root MSE. The MSE is a estimate of the residual variation after adjusting
for other effects (treatments, experimental units such as blocks, etc) in the model. The square root of MSE
is an estimate of the standard deviation after adjusting for blocks and other factors. Then select values for
the mean such that the difference between the largest and smallest mean is of biological interest and place
the other means in the mid-point to be conservative. A further renement is to specify a value in the "Extra
parameters" box [usually number of blocks -1], but this is rarely necessary.
23
6.18.5 Testing for block effects
One question that has often come to mind when doing a analysis with blocking is, we always
get a p-value for the "block" but does this tell us any information?
Tests for blocks are not examined. There is a subtle problem that invalidates the test for blocks - the
blocks were used in the experiment to reduce variations. Often these blocks are deliberately chosen - where
is the randomization process in choosing blocks gone? What does it mean that blocks are representative of
some larger population of blocks; is there such a thing as a larger population of blocks?
In short, testing for block effects is NOT something that is normally done. There have been some very
technical discussions in the statistical literature- please see me for more details.
23
This is required to adjust for the loss in degrees of freedom of the error term due to estimating the block effects.
6.18.6 Presenting results for blocked experiment
I am a bit confused with the situation of blocking data and presenting data. In some examples,
we plot the original data to see if it requires blocking (i.e., much variation among plots). Be-
cause there is, we decide to block. Then we estimated the marginal mean and se. When we do
this, do we present the means and se without the blocking as this would be more representative
of the population? For the multiple comparison procedure, we would, of course, use the blocked
design as it is this data that we are performing the analysis on. The same applies when calcu-
lating the power for this question. When entering in prospective means should I be entering in
the means without blocking or after blocking? I would think it is without blocking as this is the
mean that we are interested in.
The point of the graphs showing the prole over blocks is to see if the assumption of no block -treatment
interaction is tenable. It is NOT used to decide if blocking is appropriate. The design was blocked, and
must be analyzed as a blocked design. The rst Analyze->Fit Y-by-X without specifying blocks is ONLY
for display purposes.
The marginal mean and se must be computed AFTER adjusting for blocks.
The sample means are the same regardless if a design is blocked or not blocked. All that happens is that
the residual variation is reduced by removing the effects of blocks.
6.18.7 What is a marginal mean?
Could you explain the origin of the term marginal means, what they are and what marginal
refers to. Also since the means reported in the Means ANOVA table and the Means and Std Dev
table are thee same, I am not sure what makes them marginal means. Also when you ask for
"marginal means and their standard errors" does it matter if this is from the Means and Std Dev
table or the Means ANOVA table?
The term Marginal Mean implies a mean taken over all other factors (whether explicit or implicit) in the
experiment. For example, in a single factor CRD, there is only a single explicit factor and the means for
each group are found by averaging overall all other "factors" that were randomized in the experiment. In a
single factor RCB, there is a single factor, but there are likely blocking effects. A marginal mean will be
taken over all blocks in the experiment. This may or may not be a sensible thing to do, e.g. if blocks are
purposely chosen and dont reect the natural distribution in the larger population. For example, recall the
density of sh vs. slope in streams. A researcher may deliberately choose 5 high productive streams and 5
low productive streams even though only 10% of streams are naturally highly productive. In this example, a
simple marginal mean that gave equal weight to low- and high-production streams would not reect what is
going on in the population where the weights should be closer to 90% and 10% respectively..
The raw sample means and estimated marginal means will often be equal - for example, in single factor
CRD designs or when the design is balanced in more complex designs, they will always be equal. You will
see cases in two factor design where this is not true. Similarly, the standard error of "marginal means" are
often computed assuming common standard deviation in all treatment groups hence these could also differ
from the raw standard errors in each group as the latter dont use the pooled standard deviation.
6.18.8 Multiple experimental units within a block?
I was confused about how one can have multiple experimental units for a treatment level within
a block. For example, how can two pigs in the same barn subjected to the same treatment be
independent of each other? I had the idea that a block was like a cluster and that all analyses
would use one averaged value from a block as in a cluster. In all the examples we dealt with
in class that were RCBs we only had one measurement of each treatment level in each block.
Can one really consider multiple e.u.s of the same treatment level within the same block as
independent samples?
It depends on how the experimental units are administered and how the animals react.
For example, suppose you have 4 pigs in a pen. if you inject hormone/control directly into two individual
pigs it seems unlikely that the response of one pig affects the response of the other pig and so you could have
independent responses. But if you place the hormone directly in to the feed that is shared by the pigs, then
the experimental unit is the pen and not the pig.
If food is a limited resource, then 2 pigs sharing a common pen may not be independent (even if injected
separately) because one pig may become dominant and limit the food to the other pig. But if food is relatively
abundant, then pigs sharing a pen may not compete and so could be treated as being independent (if injected
separately).
If a block is a week and you are shooting birds to sample for contaminants, then you could get multiple
birds of each species within each week. It seems unlikely that the response of one bird in a week affects the
response of the other birds also captured in the same week.
6.18.9 How does a block differ from a cluster?
How is a block different from a cluster sample?
Ablock differs from a cluster in a key way. Acluster has a group of units all subject to the same treatment
(e.g. sh within a tank and the chemical is added to the water). A block has different treatments applied to
different units (e.g. sh individually injected but held within the same tank). In both cases, you can do the
analysis on the average, i.e. average all sh within the tank that received the chemical, or average the sh
within the tank that received the same injection.
In the latter case, there is additional information if you use the individual sh within the same tank
receiving the same experimental treatment (the generalized rcb) the extra information allows you to assess
the assumption of block-treatment additivity, but adds nothing to the hypothesis test.
Chapter 7
Incomplete block designs
Contents
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
7.2 Example: Investigate differences in water quality . . . . . . . . . . . . . . . . . . . . 494
7.1 Introduction
Blocking (or stratication or pairing) is a fundamental aspect of statistics. The idea behind blocking is to
control for sources of variation that make it difcult to detect differences in the mean response between
treatments. For example, in the stream slope example, the density of sh will vary among streams due to
intrinsic differences among streams (e.g. different productivity) that is difcult to measure. Consequently,
a good experiment will examine all treatments (e.g. the different stream slopes) in all streams so that the
intrinsic differences in density among streams will cancel out when comparisons among treatments are
made.
In the examples in the earlier chapter, blocks were complete in that every block had every treatment
occurring at least once and every treatment occurred in every block at least once.
In some cases, blocks are incomplete, either by design (e.g. blocks are too small) or by accident (e.g.
missing values). Some care needs to be taken in the analysis of incomplete block designs, but a basic analysis
is straightforward with modern software.
Please refer to the earlier chapter on the analysis of complete block designs as the points made there
about examining assumptions are equally pertinent here.
493
CHAPTER 7. INCOMPLETE BLOCK DESIGNS
7.2 Example: Investigate differences in water quality
Water quality monitoring studies often take the form of incomplete block designs. For example, the fol-
lowing data represents TSS in water samples taken upstream of a development (the reference sample), at
the development (the mid-stream sample), or downstream of the development (the ds sample). Samples are
taken during storm events when water quality may be compromised by the development. Here is a small set
of data
1
:
Location Storm 1 Storm 2 Storm 3 Storm 4
Ref 25 20
Mid 51 100
DS 173 137 170 110
The represents data that is missing. We will assume that the missing data are MCAR (Missing Completely
at Random), i.e. that missingness is unrelated to the value of TSS or any other measurable covariate in the
study. One way this could be violated is if missing value indicate that the TSS reading exceed the tolerance
of the measurement device.
In many cases, some of the data may also be censored, i.e. < LDL or > UDL where LDL and UDL are
the lower and upper detection limits. If censored data are present, more advanced methods are available.
Water quality varies among the storm events in some unknown fashion, but it is though that all locations
should be inuenced in the same way. For example, events with large amounts of precipitation may increase
the TSS in all locations.
How should such data be analyzed? Looking at the raw data, it appears that water quality levels at the
DS site are about three times that at the Mid site; and in turn, the water quality at the mid site is about twice
that of the Ref site. However, a comparison of the simple average of the values in each location is an unfair
comparison because not all locations were measured on all storm events and the different averages would
compare different combinations of storm events.
An incomplete-block analysis takes into account the pattern of missing values. For example, if you wish
to look at the ref vs. mid locations should use the data from Storm 3; the comparison of ref and ds locations
should use the storm 4 event; and the comparison of the mid and ds locations should look at the storm 1 and
3 events.
The data is available in the water-quality.csv le in the Sample Program Library at http://www.
the usual way:
data water_quality;
1
Such a small set of data likely has very poor power to detect anything but very large differences in water quality among the three
locations. Before conducting such a study, please perform a power analysis to ensure that sufcient samples are taken.
infile water-quality.csv firstobs=2 dlm="," dsd missover;
length location event $10.;
input location $ event $ tss;
logTSS = log(tss);
format logTSS 7.2;
run;
Obs location event tss logTSS
1 ds storm 1 173 5.15
2 mid storm 1 51 3.93
3 ref storm 1 . .
4 ds storm 2 137 4.92
5 mid storm 2 . .
6 ref storm 2 . .
7 ds storm 3 170 5.14
8 mid storm 3 100 4.61
9 ref storm 3 25 3.22
10 ds storm 4 110 4.70
11 mid storm 4 . .
12 ref storm 4 20 3.00
In many cases, where data has a wide range and where the ratio among values is of interest, a log-
transformation of the data is often preferred. A formula is used to create a column of the log-values (the
log-function is under the transcendental option of the function groups). Note that the log function is the
natural logarithm (i.e. to base e) and not the common (i.e. to base 10) logarithm.
We computed the log(TSS) in the data set when we read in the data.
The shorthand notation for the model is:
logTSS = Event Location
This model syntax is interpreted as saying that variation in readings of logTSS may be attributable to effects
due to different location and to different storm events.
A key assumption being made in analysis of data collected under a blocked design is that the relationship
among the treatments is the same among all blocks (i.e. no block-treatment interaction). In this case, this
assumption takes place at the log-level. If it is believed that the relationship among treatments differs among
blocks, then there is no simple way to analyze this experiment. Because of the importance of this assump-
tion, it is recommended that any data collection should follow a generalized-block design where replicate
observations at some of the treatments takes place in each block
2
We use Proc Mixed in SAS to t the model
ods graphics on;
proc mixed data=water_quality;
title2 Fixed block analysis (The intra-block analysis);
class location event;
model logTSS = location event/ ddfm=kr;
lsmeans location / diff cl adjust=tukey;
ods output tests3 =Mixed1tests;
ods output lsmeans=Mixed1LSmeans;
ods output diffs =Mixed1Diffs;
run;
ods graphics off;
The hypothesis of interest is that:
H :
location ds
=
location mid
=
location ref
A :not the above
where the terms refer to the mean logTSS at each of the three locations.
In this rst analysis, blocks are treated as xed effects. This is known as the Intra-block analysis. The
programs automatically account for the missing values note that in some cases, the design is known as
non-connected, and the analysis can fail. This typically happens with extreme numbers of missing values
contact me for more details.
We start by looking at the F-test for location effects.
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
location 2 2 25.18 0.0382
event 3 2 0.96 0.5465
2
Refer to Addelman, S. (1969). The Generalized Randomized Block Design. American Statistician, 23, 35-36. http://dx.
doi.org/10.2307/2681737. and Gates C. E. (1999). What really is experimental error in block designs. American Statistician
49, 362-363. http://dx.doi.org/10.2307/2684574
The F-statistic is about 25 with a p-value of .0382. As this is somewhat smaller than = .05, there is some
evidence of a difference in the mean logTSS among the three locations.
It is instructive to look at the estimated differences among the mean logTSS: in the different locations.
location _location Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
ds mid 0.8247 0.2722 Tukey-Kramer 0.1659 -0.7790 2.4284
ds ref 1.9100 0.2722 Tukey-Kramer 0.0358 0.3063 3.5137
mid ref 1.0853 0.3254 Tukey-Kramer 0.1409 -0.8315 3.0021
As in previous chapters, a multiple-comparison procedure (the Tukey HSD procedure) should be used to
control the experimentwise error rate. Please consult earlier chapters for details.
The estimated difference in the mean logTSS between the ds and ref locations is 1.91 (SE .27). This
implies that the TSS at the ds location is estimated to be e
1.91
= 6.75 TIMES larger (on average) than at the
ref site. The se for the ratio is NOT found by simply taking the anti-log of the se on the log-scale. However,
by application of a technique called the delta-method, it is possible to show that the se of the anti-log of the
estimate is found as
SE
antilog
= SE
log
(6.75) = 1.82
By taking the anti-logarithms of the condence interval for the difference in mean logTSS, we nd that we
are 95% condence that the ratio of TSS between the ds and ref sites is (e
.31
= 1.36 e
3.51
= 33.4) times
larger. Notice that while the condence interval is symmetric on the log-scale, it is not symmetric on the
anti-log scale.
The condence intervals are much wider than the usual 2se because the total sample size is only 8, but
there are 6 parameters that are estimated leaving only 2 df for the residual error. The condence interval
multiplier with 2 df is considerably larger than the multiplier of 1.96 (or about 2) used when sample sizes
are large.
Note that the estimated differences above automatically adjust for the missing values and are NOT equal
to the differences in the raw mean (see below).
The average logTSS across storm events can also be found.:
Effect location Estimate
Standard
Error DF
t
Value
Pr
>
location ds 4.9774 0.1409 2 35.33 0.0008 0.05 4.3712 5.5836
location mid 4.1527 0.2329 2 17.83 0.0031 0.05 3.1504 5.1549
location ref 3.0674 0.2329 2 13.17 0.0057 0.05 2.0651 4.0696
Notice that the estimated LSmeans or "population means" is different than the raw mean. This is because
of the adjustment by the procedure for the pattern of missing values. The precision of each marginal mean
differs because of the differing amount of samples collected at each location. The anti-logarithm of each
marginal mean would be interpreted as an estimate of the geometric mean TSS at each location.
Of course, the other assumptions made for any ANOVA need to be checked (i.e. equal variance among
treatment groups; independence of residual; no outliers; normality of residuals; X measured without error,
etc.) as in previous chapters. Dont forget to look at the residual plots. Unfortunately, with such limited
data, there is likely to be very little power to detect anything but gross violations of the assumptions.
Final Comment: A more rened analysis would treat the storm events as random effects. The analysis
would proceed as above except Events are declared as a random effect.
We again use Proc Mixed in SAS to t the model. Notice that the location of the (random) block term
changes in SAS compared to when blocks are xed.
ods graphics on;
proc mixed data=water_quality;
title2 Random block analysis - combined inter- and intra- block analysis;
class location event;
model logTSS = location / ddfm=kr;
random event;
lsmeans location / diff cl adjust=tukey;
ods output tests3 =Mixed2tests;
run;
ods graphics off;
Ironically, in cases where blocks are incomplete, there are two sources of information about treatment
effects. The major part of the information comes from the intra-block analysis (done above). Some small
amount of additional information can be extracted (known as the inter-block information). By specifying
that blocks are a random effect (i.e. that you wish to extrapolate to other events other than the observed
storms), it is possible to combine both analyses with modern software.
The hypothesis of interest is the same under the xed and random block models.
The effect test uses more information and so indicates more evidence of an effect of location upon the
mean logTSS:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
location 2 3.71 32.20 0.0045
Notice how the F-statistic and p-value changes slightly for the test of no location effects compared to the
intra-block analysis (where blocks are xed effects). This revised model extracts additional information (the
inter-block information) from the data that the rst model ignored.
The estimated DIFFERENCE of the mean logTSS are slightly changed (no dramatic change) but the
standard error is improved
3
:
location _location Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
ds mid 0.7479 0.2374 Tukey-Kramer 0.0791 -0.1295 1.6253
ds ref 1.8870 0.2374 Tukey-Kramer 0.0040 1.0096 2.7645
mid ref 1.1391 0.2891 Tukey-Kramer 0.0412 0.07068 2.2075
Similarly, there is little change in the estimated marginal means (over all storm events) but some im-
provement in the estimated standard errors.
Effect location Estimate
Standard
Error DF
t
Value
Pr
>
location ds 4.9774 0.1376 4.9 36.16 <.0001 0.05 4.6214 5.3334
location mid 4.2295 0.2135 4.98 19.81 <.0001 0.05 3.6801 4.7789
location ref 3.0904 0.2135 4.98 14.48 <.0001 0.05 2.5410 3.6398
3
If there were many blocks, the standard error of the combined inter- and intra-block analysis could be dramatically improved
With modern software, the analysis of incomplete block designs is fairly straightforward. In some cases
you can run into problems if there are substantial missing data in a systematic pattern (the design is not
connected). Please consult a statistician for details on such models.
Chapter 8
Estimating an over all mean with
subsampling
Contents
8.1 Average agellum length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
8.1.1 Average-of-averages approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
8.1.2 Using the raw measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
8.1.3 Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
In some experiments, multiple measurements are taken from each experimental unit. These multiple
measurements are termed pseudo-replicates (Hurlburt, 1984), and should not be treated as independent mea-
surements when estimating the overall mean. If this pseudo-replication is ignored, then the standard error of
the mean is typically underreported (i.e. the reported se will underestimate the true uncertainty in the mean)
and the condence intervals will be too narrow (i.e. the actual coverage will be (substantially) less than the
nominal level).
This chapter will demonstrate the proper way to estimate an overall mean in the presence of pseudo-
replication. There are two usual ways of proceeding:
First nd the averages of the pseudo-replicates for each experimental unit. This reduces the data to
a single value for each experimental unit. Then apply the standard statistical methods (a one-sample
t-test or t-interval) to the averaged data by taking the average and standard deviation of the averages
and nding the standard error and condence intervals.
Fit a more complex model that recognizes (and adjusts) for the pseudo-replication.
The rst option will give identical results to the second option in the case of balanced data (i.e. the same
501
CHAPTER 8. ESTIMATING AN OVER ALL MEAN WITH SUBSAMPLING
number of pseudo-replicates for each experimental unit). In the case of unbalanced data, the rst option
(the average of the averages) is only approximate, but may be close-enough for practical purposes. The
key advantage of the second approach is that it is applicable in all cases (balanced or unbalanced) and also
provides more information (the relative variation among- and within- experimental units).
8.1 Average agellum length
This example is based on work conduced in the laboratory of Prof. Lynn Quarmby at Simon Fraser Univer-
sity (http://www.sfu.ca/mbb/People/Quarmby/). Her research focus is on the the mechanism
by which cells shed their cilia (aka agella) in response to stress working with the unicellular alga Chlamy-
domonas.
Microphotographs of the algae are taken, and the length of one or two agellum of each cell is measured.
This example is based on the data collected by a student working on this project during a Undergraduate
Student Research Award. The data have been jittered to protect condentially. The data tables are available
in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.
The data is available in the le agella1.csv. The data are read into SAS in the usual way:
data flagella;
infile flagella1.csv dlm=, dsd missover firstobs=2;
length cell $20.;
input cell $ length1 length2;
run;
proc print data=flagella(obs=20);
title2 Part of the raw data;
run;
Estimate mean flagella length
Part of the raw data
Obs cell length1 length2
1 1 21.040000 .
2 2 16.550000 22.310000
3 3 9.000000 .
4 4 38.650000 27.210000
5 5 21.460000 .
6 6 24.300000 .
7 7 20.330000 13.530000
8 8 27.450000 .
9 9 18.060000 15.390000
10 10 23.680000 24.270000
11 11 19.170000 .
12 12 11.370000 16.040000
13 13 18.260000 .
14 14 25.380000 25.090000
15 15 3.650000 4.420000
16 16 19.950000 14.880000
17 17 15.080000 6.010000
18 18 26.340000 .
19 19 25.010000 .
20 20 20.090000 .
5
0
3
The experimental unit is the cell (each row of the data), and pseudo-replicates are the multiple agella
measured on each cell. Not every cell has two agella measured because in some case the agellum was
hidden by other cells, broken, or not clearly visible in the microphotograph. A total of 61 cells had their
agella measured.
The objective is to estimate the mean length of agella for this variant of the cell line along with a
measure of precision (the se) and a condence interval.
Notice that there is a large variation among individual cells with some cells having much longer agella
than other cells, and there is also variation within each cell. But, if a cell tends to have longer agella
than other cells, then both agella also tend to be longer. This lack-of-independence is what makes a naive
analysis using all agella lengths in a pooled sample inappropriate.
8.1.1 Average-of-averages approach
We start using the average-of-averages approach even though the data is not balanced (i.e. not every cell had
both agella measured).
The average agellum length for each cell is computed. The mean() function is used in a datastep.
1
Notice that the mean() function automatically ignores any missing values.
data flagella;
set flagella;
avg_length = mean(length1,length2);
run;
proc print data=flagella (obs=10);
title2 Part of the raw data after avg taken;
run;
1
For more complicated cases, Proc Transpose followed by a Proc Means is often used. Send me an email for more details.
Part of the raw data after avg taken
Obs cell length1 length2 avg_length
1 1 21.040000 . 21.040000
2 2 16.550000 22.310000 19.430000
3 3 9.000000 . 9.000000
4 4 38.650000 27.210000 32.930000
5 5 21.460000 . 21.460000
6 6 24.300000 . 24.300000
7 7 20.330000 13.530000 16.930000
8 8 27.450000 . 27.450000
9 9 18.060000 15.390000 16.725000
10 10 23.680000 24.270000 23.975000
5
0
5
Now the data have been reduced to a single measurement for each experimental unit.
The standard one-sample t-test methods are now used to estimate the overall mean, the se of the mean,
and a 95% condence interval.
Proc Univariate gives us the information needed.
proc univariate data=flagella cibasic;
title2 Estimate the mean using average of averages;
var avg_length;
histogram avg_length;
run;
Selected portions of the output from Proc Univariate are shown on the next page.
Estimate the mean using average of averages
The UNIVARIATE Procedure
Estimate the mean using average of averages
Variable: avg_length
Moments
N 61 Sum Weights 61
Mean 20.2304918 Sum Observations 1234.06
Std Deviation 6.09045118 Variance 37.0935956
Skewness -0.5001532 Kurtosis -0.0243775
Uncorrected SS 27191.2565 Corrected SS 2225.61574
Coeff Variation 30.1053046 Std Error Mean 0.77980237
Variable: avg_length
Basic Confidence Limits Assuming Normality
Parameter Estimate 95% Confidence Limits
Mean 20.23049 18.67065 21.79033
Std Deviation 6.09045 5.16903 7.41473
Variance 37.09360 26.71882 54.97825
5
0
7
The plots identify one potential outlier with small values for the mean agella length in that cell, but
given the large sample size, its inuence is minimal. [You could now delete this outlier and recompute the
mean to see how much it changes.]
The estimated mean agellum length for the cell variant is 20.2 m with a se of 0.78 m. The 95%
condence interval for the true mean in the population is between 18.7 and 21.8 m.
8.1.2 Using the raw measurements
The analysis is repeated using a statistical model to account for the the pseudo-replicates.
The development of such statistical models must specify elements that account for the population pa-
rameter (the overall mean), and the two sources of variation present in this individual measurements. In this
case the two sources of variation are the cell-to-cell variation and the within-cell variation.
The statistical model (in standard shorthand notation) is
Length = Mean Cell R
The Length term is read as individual measurements of length vary according to the following sources. The
Mean term (often not specied in statistical packages because it is implicit) is the parameter of interest -
the overall mean. The Cell R term represents the cell-to-cell variation and is known as a random effect
(i.e. you are not particularly interested in these 61 cells, and want to generalize to the entire population). The
within-cell variation is the lowest level in the experiment and is implicit (i.e. does not appear explicitly in
the model).
We need to stack the original data so that all length measurements are in the same column The standard
data step is used to stack the data. In more complicated cases, Proc Transpose will be useful.
data stack_flagella;
set flagella;
length = length1; output;
drop length1 length2 avg_length;
run;
proc print data=stack_flagella(obs=20);
title2 Stacked data;
run;
The rst few lines of the stacked data are shown on the next page.
Stacked data
Stacked data
Obs cell length
1 1 21.040000
2 1 .
3 2 16.550000
4 2 22.310000
5 3 9.000000
6 3 .
7 4 38.650000
8 4 27.210000
9 5 21.460000
10 5 .
11 6 24.300000
12 6 .
13 7 20.330000
14 7 13.530000
15 8 27.450000
16 8 .
17 9 18.060000
18 9 15.390000
19 10 23.680000
20 10 24.270000
5
0
9
The model is then specied.
Proc Mixed is used to specify the model.
proc mixed data=stack_flagella;
title2 Analyze the raw measurements;
class cell;
model length = ; /
*
intercept is implicit
*
/
random cell;
estimate Overall mean intercept 1 / cl;
run;
Selected portions of the output are shown on the next page:
Analyze the raw measurements
The Mixed Procedure
Covariance
Parameter
Estimates
Cov Parm Estimate
cell 31.0678
Residual 9.1424
The Mixed Procedure
Estimates
Label Estimate
Standard
Error DF t Value Pr > |t| Alpha Lower Upper
Overall mean 20.1647 0.7867 60 25.63 <.0001 0.05 18.5911 21.7383
5
1
1
The estimated mean length is 20.2 m (se .79 m) which is very similar to the results of the average-of-
averages in the previous section. The 95% condence interval for the true population mean is 18.6 to 21.7
m which is also similar to the previous results.
This analysis also provides information on the relative sizes of the two sources of variation. The cell-to-
cell variance is estimated as 31.1 while the within cell variance is estimated to be 9.1. The total variation in
agella lengths is 40.2. The cell-to-cell variation is roughly 3 larger than the within-cell variation.
The residual plot again identies a potential outlier in the data.
8.1.3 Followup
Both methods gave similar results in this example and will tend to give similar values when the sample sizes
are reasonable large the variation in number of pseudo-replicates is not that large. The results from the two
methods could be substantially different if, for example, some cells had 1000 agella measured while other
cells only had a single agella measured.
Natural extensions to this project would be to compare the mean lengths among different variants of the
cells to see if the means differ. It would be of interest in future research to see if the ratio of among-cell to
within-cell variation is consistent among the different cell variants.
Chapter 9
Single Factor - Sub-sampling and
pseudo-replication
Contents
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
9.2 Example - Fat levels in sh - balanced data in a CRD . . . . . . . . . . . . . . . . . . 514
9.2.1 Analysis based on sample means . . . . . . . . . . . . . . . . . . . . . . . . . . 516
9.2.2 Analysis using individual values . . . . . . . . . . . . . . . . . . . . . . . . . . 519
9.3 Example - fat levels in sh - unbalanced data in a CRD . . . . . . . . . . . . . . . . . 524
9.4 Example - Effect of UV radiation - balanced data in RCB . . . . . . . . . . . . . . . . 525
9.4.1 Analysis on sample means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
9.4.2 Analysis using individual values . . . . . . . . . . . . . . . . . . . . . . . . . . 531
9.5 Example - Monitoring Fry Levels - unbalanced data with sampling over time . . . . . 535
9.5.1 Some preliminary plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
9.5.2 Approximate analysis of means . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
9.5.3 Analysis of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
9.5.4 Planning for future experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
9.6 Example - comparing mean agella lengths . . . . . . . . . . . . . . . . . . . . . . . 547
9.6.1 Average-of-averages approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
9.6.2 Analysis on individual measurements . . . . . . . . . . . . . . . . . . . . . . . . 562
9.6.3 Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
9.7 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
513
CHAPTER 9. SINGLE FACTOR - SUB-SAMPLING AND
PSEUDO-REPLICATION
9.1 Introduction
A common feature of experimental designs that is often not recognized is that of sub-sampling or pseudo-
replication. In many cases, the sub-samples or pseudo-replicates are treated as independent experimental
units, when they are not.
This typically leads to inated test statistics, deated p-values, and incorrect conclusions about the
strength of evidence against the null hypothesis. This typically results in a false positive or Type I error.
A simple example of a sub-sampling design is tank experiments. Multiple sh are placed within a tank.
A chemical is added to the tank and the subsequent growth of sh is measured. The experimental unit is the
tank; the observational unit is the sh. Not all sh within a tank are independent of each other.
It turns out that the basic idea of the analysis is to average over sub-samples and analyze the average
only. For balanced data, i.e., the same number of sub-samples in each experimental unit, this approach is
identical to the more formal approach which models the sub-samples. For unbalanced data, the averaging
approach is only an approximate analysis.
In many cases, it is possible to arrange the experiment in many different ways, i.e., fewer experimental
unit with more sub-samples, or more experimental units and fewer sub-samples. The optimal design (best
precision for a xed cost) can be determined based on the ratio of the variation among experimental units to
the variation among sub-samples.
This chapter will illustrate the analysis of a CRD and an RCB with sub-sampling. More complex models
involving sub-sampling are possible. If the sub-sampling takes place at the lowest level of the experiment,
the method of averaging will always work. If the sub-sampling takes place at a different level in the
experiment, the model can become quite complex and the analysis quite messy!
9.2 Example - Fat levels in sh - balanced data in a CRD
An investigator wishes to determine if the mean fat levels differ among four species of sh. She selects three
sh of each species, and takes three samples of esh from each sh to determine the fat content.
PSEUDO-REPLICATION
Species Fish Sample Fat
a 1 1 11.2
a 1 2 11.6
a 1 3 12.0
a 2 1 16.5
a 2 2 16.8
a 2 3 16.1
a 3 1 18.3
a 3 2 18.7
a 3 3 19.0
b 1 1 14.1
b 1 2 13.8
b 1 3 14.2
b 2 1 19.0
b 2 2 18.5
b 2 3 18.2
b 3 1 11.9
b 3 2 12.4
b 3 3 12.0
c 1 1 15.3
c 1 2 15.9
c 1 3 16.0
c 2 1 19.5
c 2 2 20.1
c 2 3 19.3
c 3 1 16.5
c 3 2 17.2
c 3 3 16.9
d 1 1 7.3
d 1 2 7.8
d 1 3 7.0
d 2 1 8.9
d 2 2 9.4
d 2 3 9.3
d 3 1 11.2
d 3 2 10.9
d 3 3 10.5
PSEUDO-REPLICATION
The data is available in the fat.csv le in the Sample Program Library at http://www.stat.sfu.
ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in the usual way:
data fish;
infile fat.csv dlm=, dsd missover firstobs=2;
input species $ fish sample fat;
run;
The rst few records are:
Obs species sh sample fat
1 a 1 1 11.2
2 a 1 2 11.6
3 a 1 3 12.0
4 a 2 1 16.5
5 a 2 2 16.8
6 a 2 3 16.1
7 a 3 1 18.3
8 a 3 2 18.7
9 a 3 3 19.0
10 b 1 1 14.1
H:
a
=
b
=
c
=
d
A: not all mean fat levels are equal, that is, at least one groups mean fat level differs from the rest
where
a
. . .
d
are the mean fat levels for species a . . . d.
9.2.1 Analysis based on sample means
The usual cure for subsampling is to analyze the averages over the sub-samples.
We compute the averages by sorting the data and then using Proc Means to compute the mean over the
sub-samples:
proc sort data=fish;
by species fish;
PSEUDO-REPLICATION
run;
proc means data=fish noprint; /
*
compute the average over the sub-samples for each fish
*
/
by species fish;
var fat;
output out=mean_fat mean=mean_fat;
run;
The rst few averages are:
Obs species sh mean_fat
1 a 1 11.6000
2 a 2 16.4667
3 a 3 18.6667
4 b 1 14.0333
5 b 2 18.5667
6 b 3 12.1000
7 c 1 15.7333
8 c 2 19.6333
9 c 3 16.8667
10 d 1 7.3667
Once the averages are computed, the analysis can be done using Proc Glm or Proc Mixed using a single-
factor CRD ANOVA model, i.e.
Mean_fat = Species
I again prefer to use Proc Mixed because of the nicer output available from Proc Mixed; you will have to use
the pdmixed800.sas macro to get the joined line plots:
ods graphics on;
proc mixed data=mean_fat plots=all;
title2 Analysis using averages - balanced;
class species;
model mean_fat=species / ddfm=kr;
lsmeans species / diff adjust=tukey cl;
ods output tests3 =Mixed1Tests;
run;
ods graphics off;
PSEUDO-REPLICATION
The following results are produced:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
species 3 8 4.91 0.0321
Standard
Error DF
t
Value
Pr
>
species a 15.5778 1.6120 8 9.66 <.0001 0.05 11.8604 19.2952
species b 14.9000 1.6120 8 9.24 <.0001 0.05 11.1826 18.6174
species c 17.4111 1.6120 8 10.80 <.0001 0.05 13.6937 21.1285
species d 9.1444 1.6120 8 5.67 0.0005 0.05 5.4271 12.8618
Effect species _species Estimate
Standard
Error Adjustment
Adj
P Alpha
Adj
Low
Adj
Upp
species a b 0.6778 2.2798 Tukey 0.9902 0.05 -6.6228 7.9784
species a c -1.8333 2.2798 Tukey 0.8508 0.05 -9.1339 5.4673
species a d 6.4333 2.2798 Tukey 0.0855 0.05 -0.8673 13.7339
species b c -2.5111 2.2798 Tukey 0.6986 0.05 -9.8117 4.7895
species b d 5.7556 2.2798 Tukey 0.1299 0.05 -1.5450 13.0562
species c d 8.2667 2.2798 Tukey 0.0277 0.05 0.9661 15.5673
Effect=species Method=Tukey(P<0.05) Set=1
1 c 17.4111 1.6120 0.05 13.6937 21.1285 A
2 a 15.5778 1.6120 0.05 11.8604 19.2952 AB
3 b 14.9000 1.6120 0.05 11.1826 18.6174 AB
4 d 9.1444 1.6120 0.05 5.4271 12.8618 B
The F-statistic for the hypothesis of no difference between the mean levels of fat among species is 4.91
with a p-value of 0.0321.
PSEUDO-REPLICATION
The estimated mean fat level for species a is 15.58 with an estimated standard error of 1.61.
The estimated difference in the mean fat levels between species a and species b is .678 with an estimated
standard error of 2.28.
The model diagnostic plots:
doesnt indicate any problems with the t.
9.2.2 Analysis using individual values
To analyze the individual data values, a statistical model must be developed. As before, the statistical model
must include terms corresponding to the treatment structure (Species); the experimental unit structure (it is
now necessary to differentiate among the sh measured for each species and the replicate samples from each
sh); and the randomization structure (again, because of complete randomization at every possible turn,
these terms do not exist).
If the raw data are examined, you will notice that the sh are NOT labelled uniquely, i.e. there is one sh
PSEUDO-REPLICATION
numbered 1 for each species even though these are NOT the same sh. In general this is bad form - try and
use unique codes for separate experimental units. However, this type of numbering is quite common and a
specialized syntax for models has been developed for these cases.
In the simplied syntax the model is written as:
Fat = Species Fish(Species)(R)
The rst term represents the treatment structure (Species). The second term represents the the experiment
unit (Fish) to differentiate it from the observational unit (sub-samples). The term Fish(Species) is called a
nested effect. A nested effect is one where the levels change as the levels of other factors change. In this
case, sh 1 of Species a is clearly a different sh than sh 1 of Species b, even though they are both numbered
1. Because it is an experimental unit, it is a random effect the (R) portion of the term. The reason that
experimental units are random effects is that if the experiment were to be repeated, different sh would be
used and we dont wish to limit inferences to these particular sh sampled in this particular experiment.
If the sh are individually labelled (which is always a good idea), the model syntax has a simpler form:
Fat = Species Fish(R)
and we leave it up to the computer package to gure out the nesting structure as needed.
There is no explicit term for the observational unit (the sub-samples) - this is subsumed into the residual
variation.
We again use Proc Mixed and specify the nested term representing the individual sh on the Random
statement. If each sh has been assigned and unique label (which is good practice), you could simply
specify Random Fish.id; and SAS would gure out the nesting. Dont forget that you will have to use the
pdmixed800.sas macro to get the joined line plots:
ods graphics on;
proc mixed data=fish plots=all;
title2 Analysis using raw data using MIXED - balanced;
/
*
Notice in MIXED random effects do NOT appear in model statement
*
/
/
*
If you create a unique label for each fish, you could also specify
the random effect as simply fish.id
*
/
class species fish sample;
model fat=species / ddfm=kr;
random fish(species);
lsmeans species / diff cl adjust=tukey;
ods output covparms=Mixed2CovParms;
run;
ods graphics off;
PSEUDO-REPLICATION
The following output is obtained for the tests on the FIXED effect of species:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
species 3 8 4.91 0.0321
Standard
Error DF
t
Value
Pr
>
species a 15.5778 1.6120 8 9.66 <.0001 0.05 11.8604 19.2952
species b 14.9000 1.6120 8 9.24 <.0001 0.05 11.1826 18.6174
species c 17.4111 1.6120 8 10.80 <.0001 0.05 13.6937 21.1285
species d 9.1444 1.6120 8 5.67 0.0005 0.05 5.4271 12.8618
Effect species _species Estimate
Standard
Error Adjustment
Adj
P Alpha
Adj
Low
Adj
Upp
species a b 0.6778 2.2798 Tukey 0.9902 0.05 -6.6228 7.9784
species a c -1.8333 2.2798 Tukey 0.8508 0.05 -9.1339 5.4673
species a d 6.4333 2.2798 Tukey 0.0855 0.05 -0.8673 13.7339
species b c -2.5111 2.2798 Tukey 0.6986 0.05 -9.8117 4.7895
species b d 5.7556 2.2798 Tukey 0.1299 0.05 -1.5450 13.0562
species c d 8.2667 2.2798 Tukey 0.0277 0.05 0.9661 15.5673
1 c 17.4111 1.6120 0.05 13.6937 21.1285 A
2 a 15.5778 1.6120 0.05 11.8604 19.2952 AB
3 b 14.9000 1.6120 0.05 11.1826 18.6174 AB
4 d 9.1444 1.6120 0.05 5.4271 12.8618 B
Notice that there usually is NO test for the effect of sh(species). As this is an experimental unit, the
PSEUDO-REPLICATION
hypothesis of no effect
1
really does not make much sense experimental units ALWAYS differ in their re-
sponse to the same treatment. Many packages do not differential between treatment effects and experimental
unit effects and some packages will give some output which is not appropriate (e.g. the F-statistic for the
sh(species) term).
The F-statistic and p-value for the test of no difference in the mean fat levels among species are iden-
tical to those obtained previously. You will also nd the the population marginal mean (the LSmeans) and
estimated differences among the species means are identical to the previous results.
There is some useful information in the sh(species) term. This can be used to see the relative sizes of
variation among sh and among samples within sh. These variance components can be determined from
the following output:
Cov
Parm Estimate
sh(species) 7.7550
Residual 0.1233
This indicates the variance among sh (
2
fish
) is 7.755; the variance among samples within a sh
(
2
samples
) is .123. The total variation is 7.878. Over 98% of the variation in the data is among sh, rather
than among samples within a sh. This makes sense - we would expect that three samples taken from one
sh would be much more similar than samples taken from two different sh.
How can this information be used to plan for another experiment?
It turns out the the precision of the experiment is a function of
2
fish
n
fish
+
2
sample
n
fish
n
sample/fish
The smaller the value of this function, the better the experiment. You can set up a spreadsheet to
examine the tradeoff between sampling more sh (increase the value of n
fish
and decrease the value of
n
sample/fish
) or taking more samples per sh (increase the value of n
sample/fish
and decrease the value
of n
fish
) while keeping the total number of samples (n
fish
n
samples/fish
) constant. For example, the
following results are obtained:
7.755 sigma^2(fish)
0.123 sigma^2(residual)
Total
samples
1
More specicialy, that the variance of the measurements among sh is zero.
PSEUDO-REPLICATION
n_fish n_sample/fish species "Precision"
12 1 12 0.657
6 2 12 1.303
4 3 12 1.949
3 4 12 2.595
2 6 12 3.888
1 12 12 7.765
In this case,
2
fish
is much larger than
2
samples
. The best option, ignoring costs, is to sample as many
sh as possible and only take one sample from each sh.
However, this ignores the costs of obtaining a new sh vs. the cost of taking an additional sample from
the same sh. Usually, the cost of obtaining a new sh is much larger than the cost of taking another sample
from the same sh. How can this cost information be used?
The easiest way is to set up an Excel spreadsheet where the costs of the various options are specied, a
total cost function is determined, and then the variables n
fish
and n
fish/sample
are manipulated using the
SOLVER tool to minimize the precision function while keeping the cost xed at your budget level.
An analytical solution is also available. It turns out that the optimal number of sub-samples is found
as:
n
samples
=
c
fish
c
sample
2
sample
2
fish
This formula shows that it is the ratios of costs and of variances that are important; that more sub-samples
are taken if the cost of new sh is large relative to the cost of sampling; and that more sub-samples are taken
if the variance of the sub-samples is large relative to variance among sh. Ignoring costs (i.e., if the cost
ratio is 1), then unless the variation of the sub-samples is at least as large at the variation among experiment
units, it never pays to sub-sample.
Please refer to the Excel spreadsheet called fat-optimize.xls in the Sample Program Library available
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms for an example of
this.
PSEUDO-REPLICATION
9.3 Example - fat levels in sh - unbalanced data in a CRD
In some cases, the number of sub-samples available differs for the experiment units.
In this case, the method of analyzing the averages will not be exact, but will only give an approximate
answer.
The second method of tting the model including the nested terms proceeds exactly in the same fashion.
It turns out that in some cases, no exact test can be constructed, but an approximate F-statistic (called a
pseudo-F-statistic) can be constructed. These approximate tests may have fractional degrees of freedom -
an oddity but mathematically valid.
Additionally, in the case of unbalanced data, not all the LSmeans have the same precision, nor do esti-
mates of the differences between the species have the same precision.
PSEUDO-REPLICATION
You should repeat the above analyses after randomly deleting some of the data.
9.4 Example - Effect of UV radiation - balanced data in RCB
An experiment was conducted to investigate the effects of different amounts and types of UV radiation upon
the subsequent growth of sh.
This experiment had a single factor with three levels denoted:
Control - which received normal amounts of sunlight
UVA - which received additional UVA radiation over the control
UVAB - which received additional UVA and UVB radiation over the control
A total of 9 umes were placed across a stream into blocks of 3 umes. Within each block of 3 umes,
the UV levels were randomly assigned to the three umes.
Five individually marked sh were placed within each ume.
A schematic of the experimental design is:
After about 150 days, several variables were measured on each sh this example will look at the weight
gain of each sh.
PSEUDO-REPLICATION
Block Treatment Flume Fish Weight Gain (g)
1 Control 1 L< 1.1
1 Control 1 LF 1.3
1 Control 1 LR 1.9
1 Control 1 RR 0.9
1 Control 1 RV 2.1
1 UVAB 2 L< 1.8
1 UVAB 2 LF 1.7
1 UVAB 2 LR 1.8
1 UVAB 2 RF 2
1 UVAB 2 RR 1.8
1 UVA 3 LF 1.2
1 UVA 3 LR 2.7
1 UVA 3 RF 3.2
1 UVA 3 RR 1.7
1 UVA 3 RV 2.2
2 Control 4 LF 3.5
2 Control 4 LR 3
2 Control 4 RF 3.8
2 Control 4 RR 1.7
2 Control 4 RV 2.8
2 UVAB 5 LF 0.6
2 UVAB 5 LR 0.4
2 UVAB 5 RF 0.5
2 UVAB 5 RR 1.2
2 UVAB 5 RV 0.4
2 UVA 6 LF 1.4
2 UVA 6 LR 2.1
2 UVA 6 RF 0.9
2 UVA 6 RR 2.4
2 UVA 6 RV 2.1
3 Control 7 LF 4
3 Control 7 LR 4.6
3 Control 7 RF 2.6
3 Control 7 RR 2.7
3 Control 7 RV 4.7
3 UVAB 8 L< 1.2
3 UVAB 8 LF 1.8
3 UVAB 8 LR 1.1
3 UVAB 8 RF 1
3 UVAB 8 RR 1.5
3 UVA 9 L< 2.9
3 UVA 9 LF 3
3 UVA 9 LR 3.1
3 UVA 9 RF 3.2
3 UVA 9 RR 3.6
PSEUDO-REPLICATION
The data is available in the uvexp.csv le in the Sample Program Library at http://www.stat.
way:
data gain;
infile uvexp.csv dlm=, dsd missover firstobs=2;
input block $ trt $ flume $ fishid $ gain;
run;
Obs block trt ume shid gain
1 1 Control 1 L< 1.1
2 1 Control 1 LF 1.3
3 1 Control 1 LR 1.9
4 1 Control 1 RR 0.9
5 1 Control 1 RV 2.1
6 1 UVAB 2 L< 1.8
7 1 UVAB 2 LF 1.7
8 1 UVAB 2 LR 1.8
9 1 UVAB 2 RF 2.0
10 1 UVAB 2 RR 1.8
The treatment structure is a simple single factor design with three levels (Control, UVA, UVAB).
The experimental unit structure has two components. First, the experimental units are the umes, as the
treatment is applied to the entire ume and not to individual sh. These are arranged into blocks of three
umes. Each block is complete, i.e., each block contains all treatments. Note that the umes were numbered
so that the ume in each block assigned to the Control treatment had the lowest number, the ume assigned
to the UVAB treatment had the next lowest number etc. this makes the data appear to be non-random, but
the experiment was randomized to umes within each block.
The observational units are the sh within each ume.
The randomization occurred at two levels. First the treatments were completely randomized to umes
within each block. Second, the sh were completely randomized to each ume.
This experiment is an example of sub-sampling as the observational unit is different from the experimen-
tal unit.
PSEUDO-REPLICATION
H: Mean weight gain is equal in all treatment groups
A: At least one treatment group has a different mean weight gain.
or (in symbols):
H:
Control
=
UV A
=
UV AB
A: at least one mean differs.
Because the design is balanced, the analysis can be performed either on the averages of the ve sh
within each ume or on the individual sh.
9.4.1 Analysis on sample means
We compute the averages by sorting the data and then using Proc Means to compute the mean over the
sub-samples:
proc sort data=gain; by flume; run;
proc means data=gain noprint;
by flume;
var gain;
output out=mean_gain mean=mgain;
id block trt ; /
*
keep these variables with the dataset
*
/
run;
Obs ume block trt mgain
1 1 1 Control 1.46
2 2 1 UVAB 1.82
3 3 1 UVA 2.20
4 4 2 Control 2.96
5 5 2 UVAB 0.62
6 6 2 UVA 1.78
7 7 3 Control 3.72
8 8 3 UVAB 1.32
9 9 3 UVA 3.16
PSEUDO-REPLICATION
After summarizing to the ume level, this experiment looks like a randomized complete block design.
factor RCB ANOVA model, i.e.
Mean_gain = Block Trt
ods graphics on;
proc mixed data=mean_gain plots=all;
title3 ;
class block trt;
model mgain = block trt / ddfm = satterth;
lsmeans trt / pdiff cl adjust=tukey;
run;
ods graphics off;
This gives:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
block 2 4 1.30 0.3681
trt 2 4 2.65 0.1852
Effect trt Estimate
Standard
Error DF
t
Value
Pr
>
trt Control 2.7133 0.4702 4 5.77 0.0045 0.05 1.4079 4.0187
trt UVA 2.3800 0.4702 4 5.06 0.0072 0.05 1.0746 3.6854
trt UVAB 1.2533 0.4702 4 2.67 0.0561 0.05 -0.05207 2.5587
Effect trt _trt Estimate
Standard
Error Adjustment
Adj
P Alpha
Adj
Low
Adj
Upp
trt Control UVA 0.3333 0.6649 Tukey 0.8747 0.05 -2.0364 2.7031
trt Control UVAB 1.4600 0.6649 Tukey 0.1857 0.05 -0.9097 3.8297
PSEUDO-REPLICATION
Standard
Error Adjustment
Adj
P Alpha
Adj
Low
Adj
Upp
trt UVA UVAB 1.1267 0.6649 Tukey 0.3129 0.05 -1.2431 3.4964
Effect=trt Method=Tukey(P<0.05) Set=1
Obs trt Estimate Standard Error Alpha Lower Upper Letter Group
1 Control 2.7133 0.4702 0.05 1.4079 4.0187 A
2 UVA 2.3800 0.4702 0.05 1.0746 3.6854 A
3 UVAB 1.2533 0.4702 0.05 -0.05207 2.5587 A
There is no evidence, based on this experiment that the population mean weight gain differs among the
treatments. NOTE: the conclusion is still about the population means of the treatment groups despite
the fact that the analysis was done on sample means rather than individual data points.
PSEUDO-REPLICATION
9.4.2 Analysis using individual values
A model must be built if individual data values are to be used in an analysis. This model must contain
terms corresponding to the treatment structure (the TRT variable); terms corresponding to the experimental
unit structure (both blocks and the experimental unit need to be specied); and the randomization structure
(again, because of complete randomization whenever possible, these terms do not exist).
In the simple model syntax this can be expressed as:
Gain = Trt Block Block*Trt(R)
The term Trt is straightforward - it represent the treatment structure.
The Block term represents the blocking factor used in this experiment. As mentioned in earlier chapters,
one consideration when building a model is to ask if blocking terms are xed or random effects. A xed
PSEUDO-REPLICATION
effect would be used if the blocks would be reused. A random effect would be used if a new set of blocks
would be created for a new experiment. Here the blocks are the sets of umes across a river. It could be
argued that if the experiment were to be repeated, the same sets of umes would be used. In this case the
block effect is xed, and this approach was adopted for this analysis. Fortunately, in this case it turns out
that treating blocks as xed or random has only minor effects on the subsequent analysis the F-statistics
and estimates of differences of means are unaffected.
The third term looks a bit strange. It is supposed to represent the experimental units of the umes. Based
on the previous example, one would expect that the experimental unit effects were represented by a nested
term FLUMES(TRT). Sigh... if only life were so simple. For historical reasons, this simple syntax is not
used. For well built computer packages this syntax will work, but for many packages it will fail. Because
the computer treats all terms in the model equally, i.e. it makes no distinction between terms representing
experimental units and terms representing treatment structures, the standard historical convention is to repre-
sent experimental unit effects in a blocked design as an interaction between BLOCKS and TREATMENT.
This just means that because every treatment occurs once in every block, the combination of block and
treatment levels is sufcient to identify the ume used. In addition, a key assumption of a blocked design
is the LACK of interaction between blocks and treatments - so this is, unfortunately, extremely misleading.
Historical precedent is more powerful than simple logic so this syntax will be around for quite some time!
[This is an illustration of why, in complex designs, great care must be used to correctly specify terms in the
model for computer packages.]
If the umes had unique labels, the following model could also be t:
Gain = Trt Block Flume(R)
In this case, the computer package should be able to gure out that umes are the experimental units and
sh are the sub-sampling units.
factor RCB ANOVA model, i.e.
Mean_gain = Block Trt
ods graphics on;
proc mixed data=gain plots=all;
title2 Analysis of all values;
class block trt fishid;
model gain = block trt / ddfm=satterth;
random block
*
trt; /
*
random component for flumes
*
/
lsmeans trt / diff cl adjust=tukey;
PSEUDO-REPLICATION
run;
ods graphics off;
There is a large amount of output. The xed effect tests are presented below:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
block 2 4 1.30 0.3681
trt 2 4 2.65 0.1852
The F-statistics are identical to the simple analysis presented before. Again, there really isnt much point in
testing for a Block effect as inference about blocks is not of interest.
The LSmeans and the condence intervals for the difference obtained in the usual fashion are shown
below:
Effect trt Estimate
Standard
Error DF
t
Value
Pr
>
trt Control 2.7133 0.4702 4 5.77 0.0045 0.05 1.4079 4.0187
trt UVA 2.3800 0.4702 4 5.06 0.0072 0.05 1.0746 3.6854
trt UVAB 1.2533 0.4702 4 2.67 0.0561 0.05 -0.05207 2.5587
Standard
Error Adjustment
Adj
P Alpha
Adj
Low
Adj
Upp
trt Control UVA 0.3333 0.6649 Tukey 0.8747 0.05 -2.0364 2.7031
trt Control UVAB 1.4600 0.6649 Tukey 0.1857 0.05 -0.9097 3.8297
trt UVA UVAB 1.1267 0.6649 Tukey 0.3129 0.05 -1.2431 3.4964
Effect=trt Method=Tukey(P<0.05) Set=1
Obs trt Estimate Standard Error Alpha Lower Upper Letter Group
1 Control 2.7133 0.4702 0.05 1.4079 4.0187 A
2 UVA 2.3800 0.4702 0.05 1.0746 3.6854 A
3 UVAB 1.2533 0.4702 0.05 -0.05207 2.5587 A
PSEUDO-REPLICATION
The results are again identical to those earlier.
Finally, this analysis also gives estimates of the variance components:
Cov
Parm Estimate
block*trt 0.5909
Residual 0.3616
This shows that the largest amount of variation is among the umes but the sh within a ume also show
substantial variation. The variance components could be used in a more detailed cost-benet analysis for
planning an experiment as shown earlier.
PSEUDO-REPLICATION
9.5 Example - Monitoring Fry Levels - unbalanced data with sam-
pling over time
This example illustrates the analysis of an experiment where repeated measurements are taken over time at
various location to monitor the population levels of sh fry.
CAUTION Whenever Time is a factor, great care must be taken to avoid problems caused by the ob-
vious inability to randomize time, i.e. readings in 2004 must be taken after readings in 2003. One of the
assumptions of CRD and RCB designs is that randomization of treatments to experiment units takes place.
This randomization implies that the correlation of responses between any pair of units is equal. It often turns
out that in experiments with repeated measurements taken over time, that measurements closer together in
time are more highly correlated than measurement far apart in time. This creates a condition called au-
tocorrelation which can play havoc with the computation of proper p-values and standard errors of effect
sizes.
A more rened analysis of experiments with repeated measurements over time is called a repeated mea-
sures analysis and is beyond the scope of this course. Unfortunately, such an analysis often requires more
experimental units than are typically available in many studies, and so is not even feasible. In recent years,
modern statistical software has developed the ability to model the autocorrelation structure in time (e.g.
Proc Mixed in SAS). Again, this is beyond the scope of this course.
Fortunately, (or unfortunately depending on your point of view), many experiments have such large levels
of variation and few subjects that both of these rened analyses are infeasible, and the methods presented in
this section are good rst close approximation.
A common monitoring for the health of streams is to measure the density of fry (small sh). If this
density declines over time, it may be an indication tha the health of the stream is declining and/or the
stock of sh that inhabit the stream in under stress.
This example is based on an real survey from British Columbia, Canada of salmon bearing streams.
These surveys were taken on 5 to 20 locations per stream (depending on the length and size of the stream)
for a few years in attempts to track stock status over time (and detect changes in densities). Typically these
juvenile surveys are only on those streams that do not lend themselves well to adult snorkel surveys (where
you get instant indices of escapement - the number of salmon returning to spawn) rather than having to back
calculate from fry densities.
This survey looked at one particular stream. The analyst examined a few productive (and easy to sample)
tributary streams (locations) and tracked them over time.
The key objectives the study attempts to address are:
1. what are the fry densities in a given year
2. how have densities changed from one year to the next?
PSEUDO-REPLICATION
3. use the data to determine how many locations and sites per stream are required to detect a change in
fry densities of in the neighborhood of say 25% from one year to the next.
Here are the raw data from the survey and it is available on the web site:
Fry Density in each year
Location Site 2000 2001 2002 2003 2004
A 1 4 55 28 12 9
A 2 25 11 45 84 27
A 3 . . 27 . .
B 1 139 234 496 349 209
B 2 272 262 102 90 35
B 3 . 127 . . .
C 1 34 249 91 79 124
C 2 122 . . . .
D 1 128 213 . 97 .
E 1 184 47 131 107 103
E 2 413 508 204 323 115
E 3 70 . . . .
F 1 140 307 189 243 110
F 2 181 326 361 468 186
The data is available in the fry.csv le in the Sample Program Library at http://www.stat.sfu.
data fry;
infile fry.csv dlm=, dsd missover firstobs=2 expandtabs;
input location $ site $ y2000 y2001 y2002 y2003 y2004; /
*
read in the data
*
/
year = 2000; fry=y2000; output; /
*
convert the data to a list
*
/
year = 2001; fry=y2001; output;
keep location site year fry;
run;
data fry; /
*
compute the log(fry)
*
/
set fry;
lfry = log(fry);
attrib lfry label=log(fry) format=7.2;
location_site = compress(location || . || site);
PSEUDO-REPLICATION
run;
Obs location site year fry lfry location_site
1 A 1 2000 4 1.39 A.1
2 A 2 2000 25 3.22 A.2
3 A 3 2000 . . A.3
4 B 1 2000 139 4.93 B.1
5 B 2 2000 272 5.61 B.2
6 B 3 2000 . . B.3
7 C 1 2000 34 3.53 C.1
8 C 2 2000 122 4.80 C.2
9 D 1 2000 128 4.85 D.1
10 E 1 2000 184 5.21 E.1
The Locations will serve as blocks in the experiment to group readings of fry density that should be
similar over time. For example, Location A looks as if it is relatively poor habitat for fry compared to
Location B. Similarly, Sites within each Location are also sub-blocks within each of the locations this is
where the sub-sampling occurs.
Note that in this experiment, the same sites were repeatedly measured over time. The experimental
layout would be quite different if a different set of sites was measured in each year within each location.
The factor is Year - we are interested in how fry densities change over time in response to year effects.
The response variable is the fry density measured at each site within a location at each year. There appear
to be a number of missing values - some sites were not measured in every year of the study.
There are several layers of experimental units. In one sense, locations are one unit - six different locations
were measured at this stream. At the next level, between 1 and 3 sites were measured at each location.
Finally, each site-location pair was measured between 1 and 5 times over the course of the experiment.
The randomization structure also has three components. It wasnt very clear fromthe study howlocations
were chosen. Are these a random sample of many locations in the stream (in which case locations would be
a random effect) or where the locations chosen for ease of accessibility (in which case the locations would be
xed effects). It is beyond the scope of this course, but as long we are interested in the year-to-year changes,
it really doesnt matter how locations were chosen.
Sites within locations were chosen at random.
PSEUDO-REPLICATION
Years, as mentioned earlier, are problematic as time cannot be randomized.
9.5.1 Some preliminary plots
Before doing any analysis, it is always wise to do some preliminary plots and to think carefully about how
the response variable will change as the treatment (the different years) change.
A plot was constructed of each individual data point joined in time sequence for each site-location
combination:
proc sgplot data=fry;
title2 Preliminary plot;
series y=fry x=year / group=location_site;
xaxis offsetmin=.05 offsetmax=0.05;
run;
giving:
PSEUDO-REPLICATION
There appears to be tremendous scatter in the data and an apparent overall decline over time.
Is the data measured on the right scale? It seems reasonable that yearly effects should be multiplicative
rather than additive. For example, in a very bad year, all the readings might be half of the densities in
good years rather than a simple arithmetic decline. For this reason, a logarithmic transform of the response
variable is often used.
A plot of log(density) over time is much easier to interpret:
2
2
From the previous data table, you will have to create a new variable which is the logarithm of the fry density.
PSEUDO-REPLICATION
Most of the lines appear to be parallel with some evidence of a decline and the variation appears to be much
smaller than in the previous graph. The parallelism of response is important because one assumption of
blocked designs is that responses should have a similar prole (i.e. parallel effects) and that blocks do not
interact with treatments.
For these reasons, further analysis will be done on the transformed response variable log(density).
9.5.2 Approximate analysis of means
Because the data are unbalanced, i.e. not every site in a location was measured every year, an analysis on
the averages will only be approximate. A more exact analysis is presented in the next section.
The resulting data table will look similar to:
We compute the averages by sorting the data and then using Proc Means to compute the mean multiple
sites measured in each Location
PSEUDO-REPLICATION
proc sort data=fry;
by year location;
run;
proc means data=fry noprint;
by year location;
var lfry;
output out=meanlfry mean=meanlfry;
run;
Obs year location meanlfry
1 2000 A 2.30
2 2000 B 5.27
3 2000 C 4.17
4 2000 D 4.85
5 2000 E 5.16
6 2000 F 5.07
7 2001 A 3.20
8 2001 B 5.29
9 2001 C 5.52
10 2001 D 5.36
There should be ONE average value for each combination of year and location.
The model to be t to the data is:
log(density) = Location Year
where the Location term serves as the blocking factor, the Year term serves as the treatment factor, and the
random variation is assumed.
factor block ANOVA model. I again prefer to use Proc Mixed because of the nicer output available from
Proc Mixed; you will have to use the pdmixed800.sas macro to get the joined line plots:
ods graphics on;
proc mixed data=meanlfry plots=all;
title2 approximate analysis of mean(log fry) using a RCB analysis;
PSEUDO-REPLICATION
class location year;
model meanlfry = location year / ddfm=kr;
lsmeans year / cl adjust=tukey;
estimate 2004 vs prev years year -.25 -.25 -.25 -.25 1 / cl; /
*
compare avg of prev 4 years to 2004
*
/
ods output estimates=Mixed1Estimates;
run;
ods graphics off;
The effect test output
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
location 5 18 29.52 <.0001
year 4 18 3.35 0.0323
shows some evidence of a year effect (p-value just over 3%). Furthermore, investigation of the actual year
effects shows some differences, notably that 2001 was a good year for fry.
Effect year Estimate
Standard
Error DF
t
Value
Pr
>
year 2000 4.4704 0.1428 18 31.30 <.0001 0.05 4.1703 4.7705
year 2001 5.0280 0.1428 18 35.20 <.0001 0.05 4.7279 5.3281
year 2002 4.8480 0.1608 18 30.16 <.0001 0.05 4.5102 5.1857
year 2003 4.7709 0.1428 18 33.40 <.0001 0.05 4.4708 5.0710
year 2004 4.3683 0.1608 18 27.17 <.0001 0.05 4.0306 4.7060
Effect year _year Estimate
Standard
Error Adjustment
Adj
P Alpha
Adj
Low
Adj
Upp
year 2000 2001 -0.5576 0.2020 Tukey-Kramer 0.0834 0.05 -1.1684 0.05321
year 2000 2004 0.1021 0.2150 Tukey-Kramer 0.9887 0.05 -0.5482 0.7523
PSEUDO-REPLICATION
Effect year _year Estimate
Standard
Error Adjustment
Adj
P Alpha
Adj
Low
Adj
Upp
year 2001 2004 0.6597 0.2150 Tukey-Kramer 0.0458 0.05 0.009422 1.3099
Effect=year Method=Tukey-Kramer(P<0.05) Set=1
Obs year Estimate Standard Error Alpha Lower Upper Letter Group
1 2001 5.0280 0.1428 0.05 4.7279 5.3281 A
2 2002 4.8480 0.1608 0.05 4.5102 5.1857 AB
3 2003 4.7709 0.1428 0.05 4.4708 5.0710 AB
4 2000 4.4704 0.1428 0.05 4.1703 4.7705 AB
5 2004 4.3683 0.1608 0.05 4.0306 4.7060 B
The residual and diagnostic plots:
PSEUDO-REPLICATION
doesnt show anything unusual about the t.
9.5.3 Analysis of raw data
The raw data can be used directly in an analysis. Now the subsampling issue must be accounted for.
The appropriate model is:
log(density) = location site(location)-R Year
where the Location and Year terms have the same meaning as in the previous section, but nowthe site(location)
term represents the sub-sampling that takes place each year. This latter effect is a random effect as the sites
were randomly chosen in each location.
We use Proc Mixed because of the nicer output available from Proc Mixed; you will have to use the
pdmixed800.sas macro to get the joined line plots:
PSEUDO-REPLICATION
ods graphics on;
proc mixed data=fry plots=all;
title2 analysis on log(fry) counts using individual values;
class location year site;
model lfry = location year / ddfm=satterth;
random site(location);
lsmeans year / cl adjust=tukey; /
*
estimate year changes on log scale
*
/
estimate 2004 vs prev years year -.25 -.25 -.25 -.25 1 / cl; /
*
compare avg of prev 4 years to 2004
*
/
ods output estimates=Mixed2Estimates;
run;
ods graphics off;
The effect tests from the output is:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
location 5 7.91 7.28 0.0078
year 4 37.7 2.12 0.0971
This more rened analysis gives a p-value for the year effect of around 10%.
Dont be alarmed that the approximate and more rened analyses seeming gave different answers. If
you examine the output from the previous analysis, you will see that it is very sensitive to one or two of the
missing values. For example, if you remove the locations with a limited number of years, the results are in
much closer agreement.
9.5.4 Planning for future experiments
In order to plan for future studies, we need estimates of variability. This is provided by the variance compo-
nent decomposition from the second analysis above:
Cov
Parm Estimate
site(location) 0.1447
Residual 0.3775
PSEUDO-REPLICATION
This shows that about 70% of the variation is yet unexplained and only about 30% of the variation is
found among sites within locations. This is somewhat worrisome as it implies that densities are not very
consistent at the same site over time. Hence, there is little advantage to always returning to the same site
within each location over time.
The residual variation of a data point is approximately 0.38 on the log-scale. This corresponds to a
standard deviation of approximately
0.38 = .62.
A 25% change in density between years corresponds to a change of log(1.25) = .22 on the log-scale.
The standard power-sample computations:
proc power;
title Power analysis for fry example;
/
*
We vary the size of the difference to see what sample size is needed
*
/
twosamplemeans
test=diff /
*
*
/
meandiff=0.22 /
*
*
/
stddev=0.62 /
*
*
/
power=.80 /
*
target power of 80%
*
/
alpha=.05 /
*
*
/
sides=2 /
*
*
/
ntotal=. /
*
*
/
; /
*
*
/
run;
giving:
Power analysis for fry example
Obs Alpha MeanDiff StdDev Sides NullDiff NominalPower Power NTotal
1 0.05 0.22 0.62 2 0 0.8 0.801 252
shows that over 250 sites would need to be sampled over two year (i.e. over 100 sites in each year) to detect
this relatively small change with high power! How depressing.
PSEUDO-REPLICATION
9.6 Example - comparing mean agella lengths
The analysis of the agella measurements for a single mean can be extended to a comparison of mean lengths
across a number of variants.
In some experiments, multiple measurements are taken from each experimental unit. These multiple
measurements are termed pseudo-replicates (Hurlburt, 1984), and should not be treated as independent mea-
surements when estimating the overall mean. If this pseudo-replication is ignored, then the standard error of
the mean is typically underreported (i.e. the reported se will underestimate the true uncertainty in the mean)
and the condence intervals will be too narrow (i.e. the actual coverage will be (substantially) less than the
nominal level).
Similarly, when comparing the means across several populations, the analysis needs to account for the
pseudo-replicates. As before, there are two usual ways of proceeding:
PSEUDO-REPLICATION
First nd the averages of the pseudo-replicates for each experimental unit. This reduces the data to a
single value for each experimental unit. Then apply standard single-factor CRD ANOVA models.
Fit a more complex model that recognizes (and adjusts) for the pseudo-replication.
The rst option will give identical results to the second option in the case of balanced data (i.e. the same
number of pseudo-replicates for each experimental unit). In the case of unbalanced data, the rst option
(the average of the averages) is only approximate, but may be close-enough for practical purposes. The
key advantage of the second approach is that it is applicable in all cases (balanced or unbalanced) and also
provides more information (the relative variation among- and within- experimental units).
This example is based on work conduced in the laboratory of Prof. Lynn Quarmby at Simon Fraser
University (http://www.sfu.ca/mbb/People/Quarmby/). Her research focus is on the the mech-
anism by which cells shed their cilia (aka agella) in response to stress working with the unicellular alga
Chlamydomonas.
Microphotographs of the algae are taken, and the length of one or two agellum of each cell is measured.
The lengths of several variants were measured.
This example is based on the data collected by a student working on this project during a Undergraduate
Student Research Award. The data have been jittered (i.e. random noise was added to the measurements)
to protect condentiality. The data tables are available in the Sample Program Library at http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The data is available in the le agella2.csv. The data are read into SAS in the usual way:
data flagella;
infile flagella2.csv dlm=, dsd missover firstobs=2;
length variant cell $20.;
input variant $ cell $ length1 length2;
run;
proc print data=flagella(obs=20);
run;
Compare mean flagella length among variants
Obs variant cell length1 length2
1 A 1 26.900000 .
2 A 2 29.800000 .
3 A 3 30.900000 .
4 A 4 29.000000 .
5 A 5 24.900000 27.000000
6 A 6 32.300000 .
7 A 7 28.300000 .
8 A 8 31.400000 .
9 A 9 28.500000 35.000000
10 A 10 25.800000 32.400000
11 A 11 30.300000 29.000000
12 A 12 25.600000 .
13 A 13 34.100000 35.000000
14 A 14 31.400000 30.800000
15 A 15 38.200000 20.700000
16 A 16 28.900000 32.700000
17 A 17 26.400000 23.000000
18 A 18 30.400000 30.800000
19 A 19 27.900000 35.200000
20 A 20 37.400000 28.900000
5
4
9
PSEUDO-REPLICATION
The experimental unit is the cell (each row of the data), and pseudo-replicates are the multiple agella
measured on each cell. Not every cell has two agella measured because in some case the agellum was
hidden by other cells, broken, or not clearly visible in the microphotograph.
The objective is to compare the mean length of agella among the variants to see if there is evidence of
a difference in the means.
Notice that there is a large variation among individual cells with some cells having much longer agella
than other cells, and there is also variation within each cell. But, if a cell tends to have longer agella
than other cells, then both agella also tend to be longer. This lack-of-independence is what makes a naive
analysis using all agella lengths in a pooled sample inappropriate.
9.6.1 Average-of-averages approach
We start using the average-of-averages approach even though the data is not balanced (i.e. not every cell had
both agella measured).
The average agellum length for each cell is computed and the data is reduced to a single value for each
cell.
The mean() function is used in a datastep.
3
Notice that the mean() function automatically ignores any
missing values.
data flagella;
set flagella;
avg_length = mean(length1,length2);
run;
proc print data=flagella (obs=10);
title2 Part of the raw data after avg taken;
run;
3
For more complicated cases, Proc Transpose followed by a Proc Means is often used. Send me an email for more details.
Obs variant cell length1 length2 avg_length
1 A 1 26.900000 . 26.900000
2 A 2 29.800000 . 29.800000
3 A 3 30.900000 . 30.900000
4 A 4 29.000000 . 29.000000
5 A 5 24.900000 27.000000 25.950000
6 A 6 32.300000 . 32.300000
7 A 7 28.300000 . 28.300000
8 A 8 31.400000 . 31.400000
9 A 9 28.500000 35.000000 31.750000
10 A 10 25.800000 32.400000 29.100000
5
5
1
PSEUDO-REPLICATION
Before doing the formal analysis, we should check for outliers, unusual points, etc.
Begin with a simple side-by-side dot plot:
proc gplot data=flagella;
title2 side-by-side dot plots - all of data;
axis1 label=(a=90 r=0 Average length);
axis2 label=( Variant) offset=(1 cm, 1 cm);
plot avg_length
*
variant / vaxis=axis1 haxis=axis2;
run;
A
v
e
r
a
g
e

l
e
n
g
t
h
0
10
20
30
40
Variant
A B C D
side-by-side dot plots - all of data
5
5
3
PSEUDO-REPLICATION
There is an obvious outlier in the results for Variant A. Exclude that row and redo the graph.
Next nd the sample size, mean, and standard deviation for each variant to check the assumption of equal
standard deviations among groups.
proc tabulate data=flagella;
title2 summary statistics after removing outliers;
class variant;
var avg_length;
table variant, avg_length
*
(n
*
f=4.0
mean
*
f=6.2 std
*
f=6.2);
run;
summary statistics after removing outliers
summary statistics after removing outliers
avg_length
N Mean Std
variant
A 25 29.42 3.08
B 61 20.38 6.04
C 56 11.47 1.99
D 41 9.45 1.94
5
5
5
PSEUDO-REPLICATION
The sample standard deviations vary among the groups by about a factor of 2 to 3 with variant B apparently
having a much larger standard deviation. This amount of variation in the standard deviations is acceptable.
The assumption of independence must be assessed based on expert knowledge. Each cell was selected
independently from each variant and we hope that the length of agella for one cell do not inuence the
length of agella on other cells.
We now apply a single-factor CRD ANOVA as outlined in previous chapters.
ods graphics on;
proc mixed data=flagella;
title2 Analyze the average length;
class variant cell;
model avg_length = variant / residual;
lsmeans variant / cl diff adjust=tukey;
run;
ods graphics off;
Analyze the average length
The Mixed Procedure
Type 3 Tests of Fixed Effects
Effect
Num
DF
Den
DF F Value Pr > F
variant 3 179 183.10 <.0001
5
5
7
PSEUDO-REPLICATION
The null hypothesis is that the mean agella length is the same in all variants. The p-value for this
hypothesis is 0.001 indicating strong evidence against the hypothesis, i.e. there is strong evidence that not
all the means are the same.
We conduct a Tukey multiple-comparison test to compare the means between all pairs of Variants.
The Mixed Procedure
Least Squares Means
Effect variant Estimate
Standard
variant A 29.4180 0.7889 179 37.29 <.0001 0.05 27.8612 30.9748
variant B 20.3820 0.5051 179 40.36 <.0001 0.05 19.3853 21.3786
variant C 11.4670 0.5271 179 21.75 <.0001 0.05 10.4268 12.5072
variant D 9.4488 0.6161 179 15.34 <.0001 0.05 8.2331 10.6644
The Mixed Procedure
Differences of Least Squares Means
Effect variant _variant Estimate
Standard
Error DF t Value Pr > |t| Adjustment Adj P Alpha Lower Upper
Adj
Lower
Adj
Upper
variant A B 9.0360 0.9368 179 9.65 <.0001 Tukey-Kramer <.0001 0.05 7.1875 10.8845 6.6067 11.4653
variant A C 17.9510 0.9488 179 18.92 <.0001 Tukey-Kramer <.0001 0.05 16.0787 19.8234 15.4904 20.4117
variant A D 19.9692 1.0010 179 19.95 <.0001 Tukey-Kramer <.0001 0.05 17.9940 21.9444 17.3734 22.5651
variant B C 8.9150 0.7300 179 12.21 <.0001 Tukey-Kramer <.0001 0.05 7.4744 10.3556 7.0218 10.8082
variant B D 10.9332 0.7966 179 13.72 <.0001 Tukey-Kramer <.0001 0.05 9.3612 12.5052 8.8673 12.9991
variant C D 2.0182 0.8108 179 2.49 0.0137 Tukey-Kramer 0.0650 0.05 0.4182 3.6181 -0.08447 4.1208
5
5
9
PSEUDO-REPLICATION
The multiple comparison test shows that the means for variants A and B could be distinguished from the
mean of C and D, but that the mean agella length for variants C and D could not be distinguished.
The estimated difference in the means is also presented. We see (as expected) that the condence in-
terval for the difference in means between variants C and D includes 0 indicating that they could not be
distinguished.
The residual diagnostics do not show any problems.
The Mixed Procedure
The Mixed Procedure
5
6
1
PSEUDO-REPLICATION
9.6.2 Analysis on individual measurements
The analysis is repeated using a statistical model to account for the the pseudo-replicates.
The development of such statistical models must specify elements that account for the treatment groups,
and the two sources of variation present in this individual measurements. In this case the two sources of
variation are the cell-to-cell variation and the within-cell variation.
The statistical model (in standard shorthand notation) is
Length = V ariant Cell(V ariant) R
The Length term is read as individual measurements of length vary according to the following sources. The
V ariant term represents the differencs in means among the four variants. The Cell(V ariant) R term
represents the cell-to-cell variation and is known as a random effect (i.e. you are not particularly interested
in these set of cells, and want to generalize to the entire population). Because the cell numbers are repeated
labelled starting with the value 1, you must specify that cell 1 in variant 1 is different than cell 1 in variant
2 etc. This is done using the nesting notation (cell(variant)) which indicates that each cell number differs
within each variant. The within-cell variation is the lowest level in the experiment and is implicit (i.e. does
not appear explicitly in the model).
I assume that the outlier cell identied in the earlier analysis has already been removed from the dataset.
The data needs to be stacked with all of the lengths in a single column.
data stack_flagella;
set flagella;
drop length1 length2 avg_length;
run;
proc print data=stack_flagella(obs=20);
title2 Stacked data;
run;
Stacked data
Stacked data
Obs variant cell length
1 A 1 26.900000
2 A 1 .
3 A 2 29.800000
4 A 2 .
5 A 3 30.900000
6 A 3 .
7 A 4 29.000000
8 A 4 .
9 A 5 24.900000
10 A 5 27.000000
11 A 6 32.300000
12 A 6 .
13 A 7 28.300000
14 A 7 .
15 A 8 31.400000
16 A 8 .
17 A 9 28.500000
18 A 9 35.000000
19 A 10 25.800000
20 A 10 32.400000
5
6
3
PSEUDO-REPLICATION
The statistical model is then t. Proc Mixed is used to t the model using the standard syntax:
ods graphics on;
proc mixed data=stack_flagella;
title2 Analyze the raw measurements;
class variant cell;
model length = variant / residual;
random cell(variant);
lsmeans variant / cl diff adjust=tukey;
run;
ods graphics off;
The Mixed Procedure
Covariance
Parameter Estimates
Cov Parm Estimate
cell(variant) 9.5245
Residual 9.4503
The Mixed Procedure
Effect
Num
DF
Den
DF F Value Pr > F
variant 3 179 178.67 <.0001
5
6
5
PSEUDO-REPLICATION
The null hypothesis is that the mean agella length is the same for all variants. The p-value for this
hypothesis is .0001 indicating strong evidence that not all the mean lengths are the same.
The variance component analysis shows that the cell-to-cell variation (9.52) is roughly the same or-
der of magnitude as the agella-to-agella variation (9.44) when all four variants are considered together.
Hence, when all variants are considered together, the agella within a cell are just a variable as cell-to-cell
differences within a variant.
The p-value for testing the equality of means (if available) was very small, but does not identify which
pairs of Variant means could be different. A Tukey multiple-comparison procedure is used:
The Mixed Procedure
Least Squares Means
Effect variant Estimate
Standard
variant A 29.4285 0.7913 179 37.19 <.0001 0.05 27.8671 30.9900
variant B 20.2045 0.5159 179 39.16 <.0001 0.05 19.1864 21.2227
variant C 11.4662 0.5285 179 21.70 <.0001 0.05 10.4233 12.5090
variant D 9.3645 0.6334 179 14.79 <.0001 0.05 8.1146 10.6143
The Mixed Procedure
Differences of Least Squares Means
Effect variant _variant Estimate
Standard
Error DF t Value Pr > |t| Adjustment Adj P Alpha Lower Upper
Adj
Lower
Adj
Upper
variant A B 9.2240 0.9446 179 9.76 <.0001 Tukey-Kramer <.0001 0.05 7.3600 11.0880 6.7743 11.6737
variant A C 17.9623 0.9515 179 18.88 <.0001 Tukey-Kramer <.0001 0.05 16.0847 19.8400 15.4947 20.4300
variant A D 20.0641 1.0135 179 19.80 <.0001 Tukey-Kramer <.0001 0.05 18.0640 22.0641 17.4356 22.6925
variant B C 8.7384 0.7386 179 11.83 <.0001 Tukey-Kramer <.0001 0.05 7.2809 10.1958 6.8230 10.6537
variant B D 10.8401 0.8169 179 13.27 <.0001 Tukey-Kramer <.0001 0.05 9.2280 12.4521 8.7215 12.9586
variant C D 2.1017 0.8249 179 2.55 0.0117 Tukey-Kramer 0.0562 0.05 0.4739 3.7295 -0.03751 4.2409
5
6
7
PSEUDO-REPLICATION
The multiple comparison results show that the mean for variants A, B and (C,D) appear to be different,
but we cannot distinguish between the means for variants C and D. The top part of the output estimates the
difference in mean among the variant pairs. As expected, the condence interval for the C vs. D comparison
includes the value of 0 as this pair of means could not be distinguished from each other.
9.6.3 Followup
A more sophisticated analysis could be conducted to see test if the cell-to-cell and agella-to-agella vari-
ance components differed among the variants. This might demonstrate that certain variants are less- or
more-variable than other variants.
9.7 Final Notes
Hypotheses still about population means. The hypotheses of interest and conclusions about the
hypotheses are ALWAYS in terms of the population means. This is UNAFFECTED by the choice of
analysis, i.e., analyzing sample means or individual data points does NOT affect the hypotheses or
conclusions.
Losing information from averaging. It is rather counter-intuitive that averaging doesnt loose
any information in sub-sampling designs. What happens is that the average values (e.g. the average of
the three samples for each sh) give a more precise estimate of the average level for the experimental
unit (the sh) than a single sample. This reduced variation makes it easier to detect differences.
The amount of sub-sampling necessary depends upon the variation within the experimental unit.
Clearly if all sub-samples are identical, only a single sub-sample need be taken. On the other hand, if
there is tremendous variation among sub-samples, many sub-samples are needed to get precise results.
Ignoring sub-sampling. If sub-sampling is ignored and the data are analyzed as if they came from a
complete randomized design, then typically what happens is that the reported standard errors are too
small, and the p-value is also too small. This leads to an increase chance of a Type I error, i.e., a false
positive.
Planning experiments. When determining initial sample size, the variation among experimental
units and not among sub-samples usually is the determining factor. The standard deviation among
experimental units should be used in the sample size programs to determine the initial number of
experimental units needed. Then the more detailed analysis to see how to split the sampling among
unit and within units would be done to see what degree of sub-sampling is needed.
Chapter 10
Two Factor Designs - Single-sized
Experimental units - CR and RCB
designs
Contents
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
10.1.1 Treatment structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
10.1.2 Experimental unit structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
10.1.3 Randomization structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
10.1.4 Putting the three structures together . . . . . . . . . . . . . . . . . . . . . . . . . 582
10.1.5 Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
10.1.6 Fixed or random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
10.1.7 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
10.1.8 General comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
10.2 Example - Effect of photo-period and temperature on gonadosomatic index - CRD . . 586
10.2.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
10.2.2 Preliminary summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
10.2.3 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
10.2.4 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
10.2.5 Hypothesis testing and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 595
10.3 Example - Effect of sex and species upon chemical uptake - CRD . . . . . . . . . . . . 603
10.3.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
10.4 Power and sample size for two-factor CRD . . . . . . . . . . . . . . . . . . . . . . . . 619
569
CHAPTER 10. TWO FACTOR DESIGNS - SINGLE-SIZED EXPERIMENTAL
UNITS - CR AND RCB DESIGNS
10.5 Unbalanced data - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
10.6 Example - Stream residence time - Unbalanced data in a CRD . . . . . . . . . . . . . 626
10.6.2 The Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
10.6.4 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
10.7 Example - Energy consumption in pocket mice - Unbalanced data in a CRD . . . . . 641
10.7.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
10.7.6 Adjusting for unequal variances? . . . . . . . . . . . . . . . . . . . . . . . . . . 656
10.8 Example: Use-Dependent Inactivation in Sodium Channel Beta Subunit Mutation -
BPK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
10.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
10.8.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
10.9 Blocking in two-factor CRD designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
10.10FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
10.10.1 How to determine sample size in two-factor designs . . . . . . . . . . . . . . . . 669
10.10.2 What is the difference between a block and a factor? . . . . . . . . . . . . . . 669
10.10.3 If there is evidence of an interaction, does the analysis stop there? . . . . . . . . . 670
10.10.4 When should you use raw means or LSmeans? . . . . . . . . . . . . . . . . . . . 671
10.1 Introduction
So far weve looked at two different experimental designs, the single-factor completely randomized design
(1-factor CRD), and the single-factor randomized compete block design (1-factor RCB).
Both designs investigated if differences in the mean response could be attributed to different levels of a
single factor. However, in many experiments, interest lies not only in the effect of a single factor, but in the
joint effects of 2 or more factors.
For example:
Yield of wheat. The yield of wheat depends upon many factors - two of which may be the variety
and the amount of fertilizer applied. This has two factors - (1) variety which may have three levels
representing three popular types of seeds, and (2) the amount of fertilizer which may be set at two
levels.
Pesticide levels. The pesticide levels may be measured in birds which may depend upon sex (two
levels) and distance of the wintering grounds from agricultural elds (three levels).
Performance of a product. The strength of paper may depend upon the amount of water added (two
levels) and the type of wood ber used in the mix (three levels).
There are many ways to design experiments with multiple factors - we will examine three of the most
common designs used in ecological research - the completely randomized design (this chapter), the random-
ized block design (this chapter), and the split-plot design (next chapter).
As noted many times in this course, it is important to match the analysis of the data with the way the
data was collected. Before attempting to analyze any experiment, the features of the experiment should be
examined carefully. In particular, care must be taken to examine
the treatment structure;
the experimental unit structure;
the randomization structures;
the presence or absence of balance;
if the levels of factors are xed or random effects; and
the assumptions implicitly made for the design.
If these features are not identied properly, then an incorrect design and analysis of an experiment will be
made.
10.1.1 Treatment structure
The treatment structure refers to how the various levels of the factors are combined in the experiment.
The rst step in any design or analysis is to start by identifying the factors in the experiment,their asso-
ciated levels, and the treatments in the experiment. Treatments are the combinations of factor levels that are
applied
1
to experimental units.
The two-factor design has, as the name implies, two factors. We generically call these Factor A and
Factor B with a and b levels respectively. We will examine only factorial treatment structures, i.e. every
treatment combination appears somewhere in the experiment. For example, if Factor A has 2 levels, and
Factor B has 3 levels, then all 6 treatment combinations appear in the experiment.
1
Recall that in analytical surveys, the factor levels cannot be assigned to units (e.g. you cant assign sex to an animal) and so the
key point is that units are randomly selected from the relevant population.
Why factorial designs?
Why do we insist on factorial treatment structures? There is a temptation to investigate multi-factor effects
using a change-one-at-time structure. For example, suppose you are investigating the effects of process
temperature (at two levels, H & L), ber type (at two levels - deciduous and coniferous) and initial pulping
method (at two levels - mechanical or chemical) upon the strength of paper. In the change-one-at-a-time
treatment structure, the following treatment combinations would be tested:
1. L deciduous mechanical
2. H deciduous mechanical
3. L coniferous mechanical
4. L deciduous chemical
The researcher then argues that the effect of ber type could be found by examining the difference in strength
between treatments (1) and (3); the effect of pulping method could be found by examining the difference in
strength between treatments (4) and (1); and the effect of process temperature could be found by examining
the difference in strength between treatments (1) and (2).
This is valid provided that the researcher is willing to assume the treatment effects are additive,
i.e., that the effect of process temperature is the same at all levels of the other factors; that the effect of ber
type is the same at all levels of the other factors; and that the effect of initial pulping method is the same at
all levels of the other factors. Unfortunately, there is no method available to test this assumption with the set
of treatments listed above.
It is usually not a good idea to make this very strong assumption - what happens if the assumption is not
true? In the previous example, it means that your effects are only valid for the particular levels of the other
factors that happened to be present in the comparison. For example, the process temperature effect would
only be valid for deciduous ber sources that are mechanically pulped.
A superior treatment structure is the factorial treatment structure. In the factorial treatment structure, ev-
ery combination of levels appears in the experiment. For example, referring back to the previous experiment,
all of the following treatments would appear in the experiment:
1. L deciduous mechanical
2. H deciduous mechanical
3. L coniferous mechanical
4. H coniferous mechanical
5. L deciduous chemical
6. H deciduous chemical
7. L coniferous chemical
8. H coniferous chemical
Now, the main effects of each factor are found as:
main effect of temperature - treatments 1, 3, 5, 7 vs. 2, 4, 6, 8
main effect of source - treatments 1, 2, 5, 6 vs. 2, 4, 7 and 8
main effect of method - treatments 1, 2, 3, 4 vs. 5, 6, 7, 8
Each main effect would be interpreted at the average change over the levels of the other factors.
In addition, it is possible to investigate if interactions exist between the various factors. For example,
is the effect of process temperature the same for mechanical and chemical pulping methods? This would
be examined by comparing the change in (1)+(3) vs. (2)+(4) [representing the effect of temperature for
mechanically pulped wood] and the change in (5)+(7) vs. (6)+(8) [representing the effect of temperature for
chemically pulped wood]. Can you specify how you would investigated the interaction between temperature
and source? What about between source and method of pulping? All of these are known as two factor
interactions.
The concept of a two-factor interaction can also be generalized to three-factor and higher interaction
terms in much the same way.
Why not factorial designs?
While a factorial treatment structure provides the maximal amount of information about the effects of factors
and their interactions, there are some disadvantages. In general, the number of treatments that will appear
in the experiment is equal to the product of the levels from all of the factors. In an experiment with many
factors, this can be enormous. For example, in a 10 factor design, with each factor at 2 levels, there are 1024
treatment combinations. It turns out that in such large experiment, there are better ways to proceed that are
beyond the scope of this course - an example of which is a fractional factorial design which selects a subset
of the possible treatments to run with the understanding that the subset chosen loses information on some of
the higher order interactions. If you are contemplating such an experiment, please seek competent help.
As well, in some cases, interest lies in estimating a response surface, e.g. factors are continuous variables
(such a temperature) and the experimenter is interested in nding the optimal conditions. This gives rise to
a class of designs called response surface designs which are beyond the scope of this course. Again, seek
competent help.
Displaying and interpreting treatment effects - prole plots
An important part of the design and analysis of experiment lies in predicting the type of response expected -
in particular, what do you expect for the size of the main effects and do you expect to see an interaction.
During the design phase, these are useful to determining the power and needed sample sizes for an
experiment. During the analysis phase, these values and plots help in interpreting the results of the statistical
analysis.
With two factors (A and B) each at two levels, you can construct a prole plot. These prole plots show
the approximate effect of both factors simultaneously.
The key thing to look for is the parallelism of the two lines.
Prole plots with no interaction between factors
For example, consider the theoretical [it is theoretical because it shows the true population means which are
never known exactly] prole plot of the mean responses below:
In this plot, the vertical distance between the two parallel line segments is the effect of Factor B, i.e.,
what happens to the mean response when you change the level of Factor B, but keep the level of Factor
A constant. The main effect of Factor B is the AVERAGE vertical distance between the two lines when
averaged over all levels of Factor A. Notice that if the lines are parallel, the vertical distance between the
two lines is constant - this implies that the effect of Factor B (the vertical distance between the two lines)
is the same regardless of the level of Factor A and the effect of Factor B and the main effect of Factor B
are synonymous. In this case, we say that there is NO INTERACTION between Factor A and Factor B.
Similarly, the effect of Factor A is the change in the line between the two levels of Factor A at a particular
value of Factor B, i.e., the vertical change in each each line segment. The main effect of Factor A is the
AVERAGE change when averaged over all levels of Factor B. Notice that if the lines are parallel, the vertical
change is the same for both lines - this implies that the effect of Factor A is the same regardless of the level
of Factor B and that the effect of Factor A is synonymous with the main effect of Factor A. Once again, there
is no interaction between A and B.
Prole plots with interaction between factors
Now consider the following theoretical prole plot:
In this plot, the vertical distance between the line segments CHANGES depending on where you are
in Factor A. This implies that the effect of Factor B changes depending upon the level of A, i.e., there is
INTERACTION between Factor A and B. The main effect of Factor A is the average effect when averaged
over levels of B. In this case the main effect is not very interpretable (as will be seen in the plots below).
Similarly, the vertical change for each line segment is different for each segment - again the effect of Factor
A changes depending upon the level of Factor B - once again there is interaction between A and B.
The plots from an actual experiment must be interpreted with a grain of salt because even if there was no
interaction, the lines may not be exactly parallel because of sampling variations in the sample means. The
key thing to look for is the degree of parallelism. And it doesnt matter which factor is plotted along the
bottom - the plots may look different, but you will come to the same conclusions.
If there is interaction, the line segments may even cross rather than remaining separate.
Illustrations of various theoretical prole plots
No main effect of Factor A (average of lines is at); small main effect of Factor B (if there was no
main effect of Factor B the lines would coincide); and no interaction of Factors A and B.
Large main effect of Factor A; small main effect of Factor B (average difference between lines is
small); and no interaction between Factors A and B.
No main effect of Factor A; large main effect of Factor B; and no interaction between Factors A and
B.
Large main effect of Factor A; large main effect of Factor B; and no interaction between Factors A
and B.
No main effect of Factor A; no main effect of Factor B; but large interaction between Factors A and
B. This illustrates the dangers of investigating main effects in the presence of interaction (why? - a
good exam question!).
Large main effect of Factor A; no main effect of factor B: slight interaction. Again, this diagram
illustrates the folly of discussing main effects in the presence of an interaction (why?).
No main effect of Factor A; large main effect of Factor B; large interaction between Factor A and B.
As before, there may be problems in interpreting main effects in the presence of an interaction (why?).
Small main effect of factor A; large main effect of Factor B; large interaction between Factors A and
B. See previous notes about interpreting main effects in the presence of an interaction.
Further examples of prole plots
Discuss the example of an environmental impact study. Here interaction indicates that there was an
impact.
Discuss prole plots for three factors.
Here is the prole plot for an experiment to investigate the effect of wing depth and wing width upon
the ight of paper airplanes. Based upon the prole plot below, what do you conclude?
The MOF publication Displaying factor relationships, Pamphlet 57 also has a discussion on prole plots
and is available at the MOF library at http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/
index.html.
10.1.2 Experimental unit structure
The experimental unit structure is the key element to the design of the experiment and the recognition of an
existing experimental design.
In single-factor experiments, there is usually only a single size of experimental unit and the only design
choice is blocking or not. However, in multifactor designs, the choices are much greater and the potential
problems in design and analysis multiply!
For example, consider the following experimental designs to investigate the effect of light level (at two
levels - High and Low), and the amount of water (also at two levels - Dry or Wet) upon the growth of pine
tree seedlings in a greenhouse. There are a total of 8 seedlings.
Design 1. Each seedling is put into its own pot. One pot is placed in each of 8 separate growth
chambers. Two grow chambers are each assigned to each combination of light level or water amount.
Design 2. Two seedlings are placed in each pot. One pot is placed in each of 4 separate growth
chambers. Each grow chamber is given one combination of light level or water amount.
Design 3. Each seedling is put into its own pot. Two pots are placed into each of 4 separate growth
chambers. Each growth chamber is assigned one combination of light level and water amount.
Design 4. Each seedling is put into its own pot. Two pots are placed in each of 4 separate growth
chambers. Two of the growth chambers are assigned the high light level; two are assigned the low
light level. Within each chamber, one pot receives the wet water level; one pot the dry water level.
Design 5. Two seedlings are placed in each pot. Two pots are placed in each of 4 separate growth
chambers. Two of the growth chambers are assigned the high light level; two are assigned the low
light level. Within each chamber, one pot receives the wet water level; one pot the dry water level.
The growth of each seedling is measured.
Without much difculty, more ways could be found to run this experiment! Each different way of design
requires a different analysis!
Which experimental unit structure is better? Design 1 requires the most growth chambers, but is easiest
to run. Design 2 requires fewer growth chambers, but suppose that the particular pot had an effect on growth
(e.g. the previous researcher used a herbicide in a previous experiment and didnt clean the pot properly).
Design 3 requires that two pots be placed in each chamber - is the chamber big enough? There is no best
design that ts all problems!
In all cases, the data will consist of 8 measures of growth along with the light level and water level
received. The data will not tell you about the experimental unit structure! Consequently, it is imperative
that you think very carefully about the experimental unit structure and give explicit instructions on how to
perform the experiment so that there is no ambiguity. You will see later in the course that every one of the
previous designs would be analyzed in a different way!
In this course, we will look at two popular experimental design choices:
1. the Single-size of experimental unit (with and without blocking)
2. the Split-plot Design (with and without blocking) The Split-plot design (which takes its name from its
agricultural heritage) is the most common complicated design and, unfortunately, the design that is
most often analyzed incorrectly. It is discussed in a different chapter.
The simplest designs have a single size of experimental unit and the observation unit is the same as the
experimental unit, i.e. only one measurement is taken on each experimental unit. The greatest advantage of
using a single sized unit is that loss of that unit only entails the loss of one data point. If you are conducting
multiple measurements on the same unit (e.g. following a unit over time), then the loss of that unit entails
the potential loss of much more information. The greatest disadvantage of using a single-sizes unit is that
variation in responses may give poor power. However, the simple strategy of blocking is often sufcient to
improve power without making the design too complicated.
A common problem is pseudo-replication where the observational unit is not the same as the experimen-
tal unit. Hurlbert (1984) should be reread at this point.
In some designs, multiple measurements are taken on the SAME experimental units - typically repeated
measurements over time.
2
The most common reason for multiple measurements on the same unit is to
have each unit serve as its own control and thereby have greater power to detect changes over the repeated
measurements.
10.1.3 Randomization structure
We have already seen two randomization methods:
Complete randomization where treatments are assigned completely at random to experimental units
Complete-Block randomization where experimental units are rst grouped into blocks, every block
has every treatment, and treatments are randomized to units within blocks.
The actual randomization in practice is a straightforward generalization of that done before for a single-
factor design and I wont spend too much time on it but it will be discussed in class.
These will be the only two randomization structures considered in this course. In both cases, there is
complete randomization over all units or over all units within a block. This is makes TIME a particularly
difcult factor - there is often no randomization to new units at different time points. The problem is that
non-randomization often introduces more complex covariance structures among the responses. For example,
in repeated measurements over time, measurements that are close together in time would be expected to be
more highly correlated than measurements that are far apart in time. In a complete-randomization scheme,
the correlation would not expect to change as a function of time separation.
Here are some example of experiments that are NOT completely randomized designs.
Measuring plankton levels at various locations and distances from the shore. At each location, samples
are taken at 1, 5, and 10 m from the shore. Here the distance from the shore are not randomized to
different locations - each location has all three distances from shore.
The concentration of a chemical in the blood stream is measured on each rat at 1, 5, and 10 minutes
after injection. The time of measurement is not randomized to individual rats - each rat is measured
three times.
2
Many experiments have TIME as one of their factors. If the same units are measured repeatedly over time, this is denitely NOT
a completely randomized design. A more appropriate analysis would be a repeated measures design or a split-plot-in-time design. The
former is beyond the scope of this course; the latter will be covered in a later section.
How could these experiments be redesigned to be CRDs?
10.1.4 Putting the three structures together
We will examine the four most popular experimental designs based upon the above three structures. You
may wish to draw a picture of the experimental layouts.
For example, consider the a plant-growth experiment. There are two factors - light level (High and Low),
and water level (Wet or Dry). Some possible designs that we will demonstrate how to analyze in this course
include:
Completely randomized design. Each seedling is randomly placed in its own pot. Each pot is ran-
domly placed in its own greenhouse. Each greenhouse is randomly assigned one of the four treatments
at random. The experiments are all run at the same time or run in random order.
Randomized complete block design. Each seedling is randomly placed in its own pot. Four pots are
placed in each greenhouse. Within each greenhouse, each of the four pots is randomly assigned to one
of the four treatment combinations. [This may require some modication to the greenhouse so that
the two light levels can be applied. within each greenhouse]
Split-plot - variant A. It may be too difcult to modify the greenhouses to have both light levels in
each greenhouse. Therefore, two greenhouses are randomly assigned to each light level. Within each
greenhouse, two pots are used. These are randomly assigned to the two watering levels.
Split-plot - variant B. Four green houses are not available at one site, but we have two sites available,
each with two greenhouses. Therefore, one greenhouse at each site is randomly assigned to each light
level. Within each greenhouse, two pots are used. These are randomly assigned to the two watering
levels.
Further reading Refer to MOF publication What is the design?, Pamphlet 17 from the MOF library
available at http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/index.html for more de-
tails.
10.1.5 Balance
Balance is a statistical property of a design. Balance used to be much more important in the days of hand
computations when the computations for balanced designs were particularly easy to do. This is less impor-
tant today in the age of computers but the unwary traveler may hit a few pot holes as will be seen in later
sections.
The very simplest balanced design has an equal number of replicates assigned to each treatment com-
bination. Balance in the experiment will give the greatest power to detect differences among the various
treatments. As well, it makes the analysis particularly simple and most computer packages will do a good
job of the analysis.
However, because of deliberate decision or demonic interference, unbalanced designs (unequal numbers
of replicates for each treatment) can occur. Fortunately, the analysis of such designs in the simple case of
a two-factor completely-randomized design is straightforward, but there are a few subtle problems that will
be pointed out by example. Some computer packages will give incorrect answers in the case of unbalanced
data.
As you will see in a future section, the greatest danger in unbalanced designs is the lack of a complete
factorial treatment structure. These type of experiments are extremely difculty to analyze properly.
10.1.6 Fixed or random effects
In some cases, the choice of levels for a factor is also of concern. If the experiment were to be repeated,
would the same levels be chosen (xed effects) or would a new set of levels be chosen (random effects).
Or, is interested limited to the effects of the levels that actually occurred in the experiment (xed effects) or
do you wish to generalize to a large population of levels from which you happened to choose a few for this
experiment (random effects).
As an illustration, consider an experiment on the effects of soil compaction on subsequent tree growth.
Suppose that the experimenter obtained seedlings from several different seed sources. This experiment could
be viewed as having two factors - the level of soil compaction, and the seeding source.
Presumably, if the experiment were to be repeated, the same levels of compaction would be of interest.
As well, these levels of compaction are of interest in their own right. Hence, compaction would be treated
as a xed effect.
However, what about the factor seed source. If the experiment were to be repeated, would the same
sources of seeds be used? Are these the only sources of seeds available, or are there many other sources, of
which only a few were chosen to be in this experiment? Do you want to extend your inference to other seed
sources, or are these the only ones that you are really interested in?
Usually, you will want to argue that your conclusions should extend to other seed sources. If this is the
case, then you must be able to argue that the sources you used are in some sense typical of the ones to which
you want to extend your inference. The simplest way to do this is to argue that the sources you selected
were essentially a random sample of all possible seed sources to which you wish to extend your inference.
This factor is then a random effect.
The crucial rst step in this model building is deciding which factors are xed and which factors are
random effects.
A factor is a xed effect if :
the same levels would be used if the experiment were to be repeated;
inference will be limited ONLY to the levels used in the experiment;
A factor is a random effect if:
the levels were chosen at random from a larger set of levels
new levels would be chosen if the experiment were to be repeated
inference is about the entire set of potential levels - not just the levels chosen in the experiment.
Typical xed effects are factors such as sex, species, dose, chemical. Typical random effects are subject,
locations, sites, animals.
Examples of random effects would be subjects, batches of material, or experimental animals. Rarely are
experiments run with an interest in the effects of those particular subjects, batches, or experimental animals
- it is quite common to try and extrapolate results to future, different, subjects, batches, or animals.
We will start with demonstration of the analysis of experiments where ALL EFFECTS are xed effects.
You can still proceed to analyze experiment with random effects the same way as before up to a point.
It turns out that this seemingly innocuous change to a factor has dramatic implications for the analysis of
the experiment! As well MANY poorly written packages will give WRONG RESULTS! In addition,
contrary to the impression that statistics is a static science, the whole area of the analysis of models with
xed and random effects is undergoing a revolution in the statistical world. Many of the newer techniques
are not discussed in textbooks and certainly not in the published literature. Even experienced statisticians
have difculty in keeping up with advances in this area.
For all but the simplest cases, seek help with models containing combinations of xed and random effects
(often called mixed models). Please contact me for additional help.
10.1.7 Assumptions
Each and every statistical procedure makes a number assumptions about the data that should be veried as
the analysis proceeds. Some of these assumptions can be examined using the data at hand; other, often the
most important can only be assessed using the meta-data about the experiment. Fortunately, many of the
assumptions are identical to those seen in previous chapters. Please consult previous chapters on details on
how to verify the assumptions.
The most important assumptions to examine are:
The analysis matches the design! Enough said in past chapters.
Equal variation within treatment groups. All the populations corresponding to treatments have
equal variances. This can be checked by looking at the sample standard deviations for each group
(where each group is formed by one of the treatment combinations). Unless the ratio between the
standard deviations is larger than about 5:1, this is not likely a problem. Procedures are available for
cases where the variances are not equal in all groups. Fortunately, ANOVA is fairly robust to unequal
variances if the design is balanced.
For example, traps with an ineffective bait will typically catch very few insects. The numbers caught
may typically range from 0 to under 10. By contrast, a highly effective bait will tend to pull in more
insects, but also with a greater range. Both the mean and the standard deviation will tend to be larger.
A transformation may be called for (e.g. take logarithms of the response).
No outliers. There are no outliers or unusual points. Look at the side-by-side dot plots formed by the
treatment groups. Examine the residual plots after the model is t.
Normality within each treatment group. If the sample sizes are small in each group, then you must
further assume that each population has a normal distribution. If the sample sizes are large in all
groups, you are saved by the central limit theorem. Normal probability plots within each treatment
group, or the residuals found after the model tting procedure can be examined. However, these likely
have poor power when the sample sizes are small and will detect minute differences when sample
sizes are large. Hence, they are often not very informative.
Are the errors independent? Another key assumption is that experimental units are independent
of each other. For example, the response of one experimental animal does not affect the response of
another experimental animal.
10.1.8 General comments
The key to a proper analysis of any experiment is recognizing the design that was used and then specifying
a statistical model that incorporates the sources of variation in the design.
As you will see, this statistical model will have terms representing the main effects and interactions of
the factors and terms for every size of experimental unit in the experiment. [The latter will become important
when we analyze a split-plot design.]
Once the model is specied, then the analysis of variance method (ANOVA) partitions the total variation
in the observed responses into sources - one for each component representing a main effect or an interaction,
and one for every size of experimental unit. [Again, the latter will become more important in split-plot
designs.] For example, in the single factor CRD design, the ANOVA table consisted of a line for total
variation which is then split into sources representing the contributions from the single factor (the treatment
sum of squares), and a contribution for experimental unit effects (the error sum of squares). In the single
factor RCB design, the ANOVA table introduced yet another entry for the contribution from blocks (the
block sum of squares).
As you will see later in this chapter, two factor design will have lines in the ANOVA table corresponding
to the interaction of the two factors and their respective main effects.
Then, starting with interaction, you successively test the hypothesis of no effect from that source, i.e.,
you rst test the hypothesis of no interaction effects, and then, depending upon the results of the test, you
may or may not wish to test the main effects.
Rarely, if ever, are tests performed on experimental unit effects - it would be quite rare to expect that the
experimental units are exactly identical!
Again, the hypothesis tests only tell you that some effect exists - it doesnt tell you where the effect lies.
You may need to explore the responses using multiple comparison procedures and/or condence intervals
for the marginal means or contrasts among means.
As before, you should always assess that your model adequately ts the data, and as well before per-
forming the experiment, determine if the sample size is adequate to detect biologically important effects.
When reporting results in a paper or thesis, try not to overburden the reader - no one is interested in the
minute details - they want a broad picture - everything else can likely go into an appendix.
Be sure you carefully describe the experimental design so that some one can verify how the experiment
was done.
I would recommend that you ALWAYS show a prole-plot of the means of the various treatment com-
binations along with approximate 95% condence intervals - this often tells the entire story.
In terms of the actual statistical computations, usually the F-statistics are reported along with the
p-values, but rarely are all the ANOVA tables shown except in appendices or simple tables. In this day
of the WWW, the raw data are often made available on a Web site in case someone else wishes to verify
your analysis.
10.2 Example - Effect of photo-period and temperature on gonado-
somatic index - CRD
This is the simplest of the two-factor designs and serves as a template for the analysis of more complex
designs. As noted many times in this course, it is important to match the analysis of the data with the way
the data was collected. Before attempting to analyze any experiment, the features of the experiment should
be examined carefully. In particular, the treatment, experimental unit, and randomization structures; the
presence or absence of balance; if the levels of factors are xed or random effects; and the assumptions
implicitly made for the design.
The Mirogrex terrau-sanctae is a commercial sardine like sh found in the Sea of Galilee. A study was
conducted to determine the effect of light and temperature on the gonadosomatic index (GSI), which is a
measure of the growth of the ovary. [It is the ratio of the gonad weight to the non-gonad weight.] Two photo-
periods 14 hours of light, 10 hours of dark and 9 hours of light, 15 hours of dark and two temperature
levels 16
C and 27
C are used. In this way, the experimenter can simulate both winter and summer
conditions in the region.
Twenty females were collected in June. This group was randomly divided into four subgroups - each
of size 5. Each sh was placed in an individual tank, and received one of the four possible treatment
combinations. At the end of 3 months, the GSI was measured.
Temperature Photo-period
9 hours 14 hours
27
C 0.90 0.83
1.06 0.67
0.98 0.57
1.29 0.47
1.12 0.66
16
C 2.31 1.01
2.88 1.52
2.42 1.02
2.66 1.32
2.94 1.63
10.2.1 Design issues
There are two factors in this experiment - photo-period with 2 levels; and temperature also with 2 levels.
What is the treatment structure?
All of the 4 possible treatment combinations (which are?) appear in this study - hence it has a factorial
treatment structure.
Now the purpose of this experiment was to simulate summer and winter conditions - however, two of the
treatment combinations seem unnatural. Why were these treatment combinations used? How could you run
this experiment if you really were interested only in the summer and winter conditions? Is any confounding
taking place?
What is the experimental unit structure?
The experimental units were individual tanks and the observational units were the individual sh within
a tank. There is only one observation unit per experimental unit.
There are a total of 20 sh each of which was placed in an individual tank. This seems kind of wasteful
- 20 tanks are needed as ve of the tanks are needed for each treatment to get the same photo-period and
temperature treatment combination. What is the problem if you used only 4 tanks with 5 sh in each tank?
[Hint - what are the experimental and observation units - and is this pseudo-replication?]
What is the randomization structure?
The article was not very clear, but the treatments appear to be completely randomly assigned to the tanks,
etc.
Balance
The design is balanced as an equal number of replicates was performed for teach treatment combination.
Fixed or random factors?
Are the factors to be considered xed effects? In this case, you would use exactly the same levels of both
factors - therefore both of the factors are xed effects.
10.2.2 Preliminary summary statistics
Before doing any formal analyses, it is always advisable to do some preliminary plots and compute some
simple summary statistics - even if these dont fully tell the whole story. Here are some simple plots and
summary statistics [Note that the above data must be converted to a data le in standard format with the
appropriate scale of measurements].
FIrst the data has to be converted to columnar format with one column being the response (the GSI), and
two factors representing the two factors. It is advantageous to use alphanumeric codes for factor levels as
then there is no ambiguity if these values represent categories or if the levels represent a regression effect.
Here is the data in proper columnar format.
Temperature Photo-period GSI
27C 09h 0.90
27C 09h 1.06
27C 09h 0.98
27C 09h 1.29
27C 09h 1.12
16C 09h 2.31
16C 09h 2.88
16C 09h 2.42
16C 09h 2.66
16C 09h 2.94
27C 14h 0.83
27C 14h 0.67
27C 14h 0.57
27C 14h 0.47
27C 14h 0.66
16C 14h 1.01
16C 14h 1.52
16C 14h 1.02
16C 14h 1.32
16C 14h 1.63
The raw data is available in a datale called gsi.csv available at the Sample Program Library at http:
SAS in the usual way:
data gsi; /
*
read in the raw data
*
/
infile gsi.csv dlm=, dsd missover firstobs=2;
length temp $3. photo $3.;
input temp $ photo $ gsi;
length trt $15.;
trt = temp || - || photo; /
*
create a trt variable
*
/
run;
Obs temp photo gsi trt
1 16C 09h 2.31 16C-09h
Obs temp photo gsi trt
2 16C 09h 2.88 16C-09h
3 16C 09h 2.42 16C-09h
4 16C 09h 2.66 16C-09h
5 16C 09h 2.94 16C-09h
6 16C 14h 1.01 16C-14h
7 16C 14h 1.52 16C-14h
8 16C 14h 1.02 16C-14h
9 16C 14h 1.32 16C-14h
10 16C 14h 1.63 16C-14h
Sometimes it is easier to create a pseudo-factor consisting of the actual treatment levels to make simple
plots and to nd simple summary statistics. A new variable was created by concatenating the values of the
two factor levels together.
Because this is a completely randomized design, there is no conceptual difference between a two factor
design (each with 2 levels) and a single factor design with 4 levels. In more complex designs, this is not true.
We begin by creating side-by-side dot plots:
proc sgplot data=gsi;
title2 Side-by-side dot plots;
yaxis label=GSI offsetmin=.05 offsetmax=.05;
xaxis label=Treatmemt offsetmin=.05 offsetmax=.05;
scatter x=trt y=GSI / markerattrs=(symbol=circlefilled);
run;
which gives
We also compute the means and standard deviations for each treatment group:
proc tabulate data=gsi; /
*
proc tabulate is not for the faint of heart
*
/
title2 Summary table of means, std devs;
class temp photo;
var gsi;
table temp
*
photo, gsi
*
(n
*
f=5.0 mean
*
f=5.2 std
*
f=5.2 stderr
*
f=5.2) /rts=15;
run;
giving
gsi
N Mean Std StdErr
temp photo 5 2.64 0.28 0.12
16C 09h
14h 5 1.30 0.28 0.13
27C 09h 5 1.07 0.15 0.07
14h 5 0.64 0.13 0.06
Because the overall design is a CRD, the standard errors reported are sensible. If a blocking factor was
available, the various packages would also have computed proper standard errors after block centering. In
all other cases, the reported standard errors would not be sensible as the assumed design in most packages
is a CRD which wont match the actual design.
Hmmm.. the standard deviation seems to show that the variability at 27
C is about 1/2 of that at 16
C.
This is an interesting effect in its own right - however, the change in standard deviation is small enough
that it shouldnt be too much of a concern for this problem. [As a rough rule of thumb, unless the ratio
of standard deviations from small samples is on the order of at least 3 to 5 times different, there is likely
nothing to worry about.]
The design is balanced - every treatment has the same number of replications - this makes the analysis
easier. Also every treatment combination has some data - missing cells where some cells have no data are
a REALLY MESSY PROBLEM. Most statisticians even have difculty in analyzing such experiments -
beware!
You should also draw a preliminary prole plot to get a sense of the level of Interaction, if any. This can
be done by hand or using Excel. We can nd the sample means and condence limits for each population
mean quite simply because this is a CRD.
proc sort data=gsi;
by temp photo;
run;
proc means data=gsi noprint; /
*
find simple summary statistics
*
/
by temp photo;
var gsi;
output out=means n=n mean=mean std=stddev stderr=stderr lclm=lclm uclm=uclm;
run;
proc sgplot data=means;
title2 Profile plot with 95% ci on each mean;
series y=mean x=temp / group=photo;
highlow x=temp high=uclm low=lclm / group=photo;
yaxis label=GSI offsetmin=.05 offsetmax=.05;
xaxis label=Temperature offsetmin=.05 offsetmax=.05;
run;
and then get the prole plot:
It appears that there may be a bit of interaction between the two factors - the lines are not parallel. It
would be easier to assess interaction if the approximate 95% c.i. were drawn for each mean - why most
packages dont do this is beyond me.
Looking at the prole plot above, what is the effect of photo-period at 16
C? at 27
C? What is the effect

of temperature at 9 h? at 14 h?
10.2.3 The statistical model
The statistical model for any design has terms corresponding to the treatment, experimental unit, and ran-
domization structure. Fortunately, in simple designs, the latter two are often implicit and do NOT have to be
specied by the analyst.
Any factorial treatment structure will have terms corresponding to interactions and main effects.
In cases where there is only one size of experimental unit, and no subsampling or pseudo-replication,
and no blocking, then it is not necessary to specify any term for experimental unit effects (this corresponds
to the MSE line in the ANOVA table).
In cases of complete randomization, there is no need to specify anything further in the model. In cases
of non-randomization (e.g. repeated measurement over time, you might specify the covariance structure of
the observations).
This gives a model often written as:
GSI = temp photo temp photo
What the statistical model says is that we recognize that the observed GSI response values (left of equals
sign) are not all the same. What are the various sources of variation in the observed responses? These appear
to the right of the equal sign. Well, we expect some differences due to the main effects of temperature, some
differences due to the main effects of photo-period, some differences possibly caused by an interaction
between photo-period and temperature. Note that the * does NOT imply multiplication, but rather an
interaction between two factors. The terms can be written in any order.
There are NO terms representing experimental units (this implies there is a single size of experimental
unit), nor any terms representing randomization effects (complete randomization is assumed).
10.2.4 Fitting the model
The model is t using least squares in the usual fashion.
There are two main procedures in SAS for tting ANOVA and regression models (collectively called
linear models). FIrst is Proc GLM which performs the traditional sums-of-squares decomposition. Second
is Proc Mixed which uses restricted maximum likelihood (REML) to t models. In models with only xed
effects (e.g. those in this chapter), this gives the same results as sums-of-squares decompositions. In unbal-
anced data with additional random effects, the results from Proc Mixed may differ from that of Proc GLM
both are correct, and are just different ways to deal with the approximations necessary for unbalanced data.
Proc Mixed also has the advantage of being able to bit more complex models with more than one size of ex-
perimental unit. There is no clear advantage to using one procedure or the other it comes down to personal
preference. Both procedures will be demonstrated in this chapter I personally prefer to use Proc Mixed.
Here is the code for Proc GLM:
ods graphics on;
proc glm data=gsi plots=all;
title2 Anova;
class photo temp; /
*
class statement identifies factors
*
/
model gsi = photo temp photo
*
temp;
/
*
because interaction is significant, only get trt means and do multiple comp
*
/
lsmeans photo
*
temp / cl stderr pdiff adjust=tukey lines;
run;
ods graphics off;
Here is the code for Proc Mixed:
ods graphics on;
proc mixed data=gsi plots=all;
class photo temp;
model gsi=photo temp photo
*
temp / ddfm=kr;
lsmeans photo
*
temp / adjust=tukey diff cl;
ods output tests3 =MixedTest; /
*
needed for the pdmix800
*
/
run;
ods graphics off;
/
*
Get a joined lines plot
*
/
%include ../pdmix800.sas;
In both procedures, the Class statement species the categorical factors for the model. Notice how you
specify the statistical model in the Model statement it is very similar to the statistical model seen earlier.
The extra code at the end of Proc Mixed is to generate the joined-line plots that weve seen earlier based on
the output from the LSmeans statements (see below).
10.2.5 Hypothesis testing and estimation
The output from the tting procedure is often divided into sections corresponding to the whole model and
then sections corresponding to each individual term in the model.
The rst output below is not very useful - it is a Whole Model test which simply examines if there is
evidence of an effect for any term in the model. It is rarely useful.
Source DF Sum of Squares Mean Square F Value Pr > F
Model 3 11.19194000 3.73064667 76.07 <.0001
Error 16 0.78468000 0.04904250
Corrected Total 19 11.97662000
No such table is produced by Proc Mixed.
The Effects Table breaks down the Model line in the whole model test into the components for every
term in the model. Some packages give you a choice of effect tests. For example, SAS will print out a Type
I, II, and III tests - in balanced data these will always be the same and any can be used. In unbalanced data,
these tests will have different results - which test is the correct test is still an item of controversy among
statisticians and can (and do) results in st-ghts among the various camps (and you thought statistics was
dull!)
Here is the table for the Effect Tests computed by Proc GLM:
Source DF Type III SS Mean Square F Value Pr > F
photo 1 3.92498000 3.92498000 80.03 <.0001
temp 1 6.22728000 6.22728000 126.98 <.0001
photo*temp 1 1.03968000 1.03968000 21.20 0.0003
Here is the table for the Effect Tests computed by Proc Mixed. Because the model only consists of xed
effects, the results are identical to those from Proc GLM:
Effect Num DF Den DF F Value Pr > F
photo 1 16 80.03 <.0001
temp 1 16 126.98 <.0001
photo*temp 1 16 21.20 0.0003
Start the hypothesis testing with the most complicated effects (usually interactions) and work towards
simpler terms (main effects).
In this case, we start with the test for no interaction effects.
Our null hypothesis is
H: no interaction among photo-period and temperature in their effects on the mean GSI level
A: some interaction among photo-period and temperature in their effects on the mean GSI level.
Our test statistic is F=21.2, the p-value (.0003) is very small. There is very strong evidence of an
interaction between the two factors in their effects on the mean GSI level. This is not surprising, the prole
plots showed that the lines didnt appear to be too parallel.
What does a statistically signicant interaction mean? It implies that the effect of temperature upon the
mean GSI is different at the the various photo-period levels. Similarly, the effect of photo-period upon the
mean GSI index is different at the two temperature levels.
If you detect an interaction, it usually doesnt make much sense to continue along to test main effects
because, by denition, these are not consistent - e.g. the effect of temperature is different at the two photo-
periods.
What to do if an interaction is present?
There is no single way to proceed after this point. Some authors suggest that you now break up the data
into two mini- experiments and analyze each separately. For example, analyze each photo-period separately
and analyze each temperature level separately to estimate the effects at each of the various levels. As these
mini-experiments are now simply single-factor CRDs (in this case two-sample t-test) all the machinery that
we had before can be brought into bear. The disadvantage of this approach is that you forgo pooling of the
error variances from all four groups.
Because there was strong evidence of an interaction (non-parallism), you should not examine the main
effects unless the non-parallelism is small. Rather, you should now nd estimates of the population means
for each combination of the factor levels (along with the standard errors), and then do a multiple comparison
procedure (e.g. Tukeys) to examine which pairs of means could differ. SAS also provides a mechanism to
examine slices of the data, e.g. for each factor level separately please contact me for more details on this.
Here are estimates of the marginal means (and standard errors) and then all of the pairwise differences
are compared using Tukeys procedure. These are requested using the LSmeans statement in the GLM
procedure. The LSMeans estimates are equal to the raw sample means this is will be true ONLY in
balanced data. In the case of unbalanced data (see later), the LSMEANS seem like a sensible way to estimate
marginal means.
photo temp gsi LSMEAN Standard Error Pr > |t| LSMEAN Number
09h 16C 2.64200000 0.09903787 <.0001 1
09h 27C 1.07000000 0.09903787 <.0001 2
14h 16C 1.30000000 0.09903787 <.0001 3
14h 27C 0.64000000 0.09903787 <.0001 4
photo temp gsi LSMEAN 95% Condence Limits
09h 16C 2.642000 2.432049 2.851951
09h 27C 1.070000 0.860049 1.279951
14h 16C 1.300000 1.090049 1.509951
14h 27C 0.640000 0.430049 0.849951
Least Squares Means for effect photo*temp
Pr > |t| for H0: LSMean(i)=LSMean(j)
Dependent Variable: gsi
i/j 1 2 3 4
1 <.0001 <.0001 <.0001
2 <.0001 0.3845 0.0333
3 <.0001 0.3845 0.0012
4 <.0001 0.0333 0.0012
Least Squares Means for Effect photo*temp
i j Difference Between Means Simultaneous 95% Condence Limits for LSMean(i)-LSMean(j)
1 2 1.572000 1.171287 1.972713
1 3 1.342000 0.941287 1.742713
1 4 2.002000 1.601287 2.402713
2 3 -0.230000 -0.630713 0.170713
2 4 0.430000 0.029287 0.830713
3 4 0.660000 0.259287 1.060713
Proc GLM also provides a difference plot.
of a difference in the means.
Finally, Proc GLM can also produce the joined-line plot to make the interpretation easier as seen in the
previous chapters.
Tukey Comparison Lines for Least Squares Means of photo*temp
LS-means with the same letter are not signicantly different.
gsi LSMEAN photo temp LSMEAN Number
A 2.642 09h 16C 1
B 1.300 14h 16C 3
B
B 1.070 09h 27C 2
C 0.640 14h 27C 4
The output from Proc Mixed is not quite as extensive, and, in my opinion, organized in a much more
logical fashion than in Proc GLM. First here are the estimates of the marginal means:
Least Squares Means
Effect photo temp Estimate Standard Error DF t Value Pr > |t| Alpha Lower Upper
photo*temp 09h 16C 2.6420 0.09904 16 26.68 <.0001 0.05 2.4320 2.8520
photo*temp 09h 27C 1.0700 0.09904 16 10.80 <.0001 0.05 0.8600 1.2800
photo*temp 14h 16C 1.3000 0.09904 16 13.13 <.0001 0.05 1.0900 1.5100
photo*temp 14h 27C 0.6400 0.09904 16 6.46 <.0001 0.05 0.4300 0.8500
and then the pairwise differences:
photo temp _photo _temp Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
09h 16C 09h 27C 1.5720 0.1401 Tukey <.0001 1.1713 1.9727
09h 16C 14h 16C 1.3420 0.1401 Tukey <.0001 0.9413 1.7427
09h 16C 14h 27C 2.0020 0.1401 Tukey <.0001 1.6013 2.4027
09h 27C 14h 16C -0.2300 0.1401 Tukey 0.3845 -0.6307 0.1707
09h 27C 14h 27C 0.4300 0.1401 Tukey 0.0333 0.02929 0.8307
14h 16C 14h 27C 0.6600 0.1401 Tukey 0.0012 0.2593 1.0607
The pdmix800 macro is used to produce the joined-line plots:
Effect=photo*temp Method=Tukey(P<0.05) Set=1
Obs photo temp Estimate Standard Error Alpha Lower Upper Letter Group
1 09h 16C 2.6420 0.09904 0.05 2.4320 2.8520 A
2 14h 16C 1.3000 0.09904 0.05 1.0900 1.5100 B
3 09h 27C 1.0700 0.09904 0.05 0.8600 1.2800 B
4 14h 27C 0.6400 0.09904 0.05 0.4300 0.8500 C
The above output can be used to see which means appear to differ from each other.
This is also where the prole plot shown earlier is produced - it is a pity that the plot doesnt show the
condence intervals.
Alternately you can return to the pseudo-factors that you dened earlier. This again converts the exper-
iment from a two-factor CRD to a single factor CRD with 4 levels. This approach is only valid because the
original design is CRD.
Below is the result of such an analysis (using JMP only, but the other packages are similar).
The ANOVA table is identical to Overall Model ANOVA table earlier. Here are the estimated means for
each treatment and the estimated standard errors identical to the above analysis.
We perform a Tukey-multiple comparison procedure which gives the same results as found previously.
What can you conclude from these analyses? In particular, can you see why interaction was detected?
Which means appear to be different from the others?
Dont forget to examine the residual and other diagnostic plots (often produced automatically by pack-
ages). Here is the diagnostic plot from Proc GLM:
Here is the diagnostic plot from Proc Mixed:
There is no evidence of any gross problems in the analysis. There is some evidence that the residuals are
not equally variable, but the difference in variance is not large.s
10.3 Example - Effect of sex and species upon chemical uptake - CRD
Several persistent chemicals accumulate up the food chain. Different species may differ in the amount of of
chemicals accumulated because of different prey availability or other factors. Because of different behavior,
the accumulated amount may also vary by sex.
A survey was conducted to investigate how the amount of PCBs varied among three different species of
sh in Nunavut (the new Canadian territory just to east of the Restot and just north of Ulot). Samples
were taken from four sh of each sex and species and liver PCB levels (ppm) were measured.
PCB sex Species
21.5 m sp1
19.6 m sp1
20.9 m sp1
22.8 m sp1
14.5 m sp2
17.4 m sp2
15.0 m sp2
17.8 m sp2
16.0 m sp3
20.3 m sp3
18.5 m sp3
19.3 m sp3
14.8 f sp1
15.6 f sp1
13.5 f sp1
16.4 f sp1
12.1 f sp2
11.4 f sp2
12.7 f sp2
14.5 f sp2
14.4 f sp3
14.7 f sp3
13.8 f sp3
12.0 f sp3
The raw data is available in a datale called pcb.csv available at the Sample Program Library at http:
data pcb; /
*
read in the raw data
*
/
infile pcb.csv dlm=, dsd missover firstobs=2;
length sex $10. species $10. trt $15.;
input pcb sex $ species $;
trt = compbl(sex || "-" || species); /
*
create a pseudo factor
*
/
run;
Obs sex species trt pcb
1 f sp1 f -sp1 14.8
2 f sp1 f -sp1 15.6
3 f sp1 f -sp1 13.5
4 f sp1 f -sp1 16.4
5 f sp2 f -sp2 12.1
6 f sp2 f -sp2 11.4
7 f sp2 f -sp2 12.7
8 f sp2 f -sp2 14.5
9 f sp3 f -sp3 13.8
10 f sp3 f -sp3 12.0
There are two factors in this experiment - sex with 2 levels and species with 3 levels.
What is the treatment structure? All of the 6 possible treatment combinations (which are?) appear in this
study - hence it has a factorial treatment structure.
What is the experimental unit structure? Hmmm . . . an interesting question. In observational studies it
is often not clear what are the experimental and observational units. For example, is this like a sh tank
study where all the sh in a particular location are subjected to the same treatments (i.e., deposited PCBs).
Or is each sh subjected to its own experience?
This is very common problem in observation studies and you should be very careful about the dangers
of pseudo-replication that we explored earlier.
For now, lets treat the experimental units as individual sh and the observational units as the individual
sh. There is only one observation unit per experimental unit.
Finally, what is the randomization structure? Again, this is often not clear in observational studies. First,
it is quite impossible to randomly assign sex or species to sh. You must view the randomization as arising
from the selection process. Are these sh randomly selected from the entire population of sh of each
species and sex? Or is the sample a convenience sample - i.e., the sh closest to the research station that are
easiest to catch?
In any observational study, you must be careful that the units measured are a proper random sample from
the relevant populations.
Are the factors to be considered xed effects? Does it seem reasonable that if you were to repeat the
survey, you would select the same sexes and species? In this case, you would use exactly the same levels of
both factors - they are xed-effects.
Hence, this experiment appears to satisfy the requirements for a two-factor xed-effects CRD. In partic-
ular, the randomization was to individual experimental units and the observational unit is the same as the
experimental unit.
Again, create some simple summary statistics. We will create a pseudo-factor consisting of the combina-
tion of the factor levels from the two factors so that summary statistics can be computed on each group. A
new variable was created by concatenating the values of the two factor levels together.
proc sgplot data=pcb;
yaxis label=PCB offsetmin=.05 offsetmax=.05;
scatter x=trt y=PCB / markerattrs=(symbol=circlefilled);
run;
which gives
We also compute the means and standard deviations for each treatment group:
proc tabulate data=pcb; /
*
*
/
class sex species;
var pcb;
table sex
*
species, pcb
*
(n
*
f=5.0 mean
*
f=5.2 std
*
f=5.2 stderr
*
f=5.2) /rts=15;
run;
giving
pcb
N Mean Std StdErr
sex species 4 15.08 1.24 0.62
f sp1
sp2 4 12.68 1.33 0.66
sp3 4 13.73 1.21 0.60
m sp1 4 21.20 1.33 0.66
sp2 4 16.18 1.67 0.83
sp3 4 18.53 1.84 0.92
The standard deviations are approximately equal in all the groups. There dont appear to be any outliers
or unusual points. In general the males seem to have higher levels of PCBs than the females, but there
doesnt seem to be much of a difference among the mean PCB levels in the species.
The design is balanced - every treatment has the same number of replications - this makes life easier.
Every treatment combination has some data - again it makes our analysis task easier.
We draw the prole plots. This can be done by hand or using Excel or your package of choice. It
good practice to add approximate 95% condence intervals to the values so that the parallelism can be more
readily judged. We can nd the sample means and condence limits for each population mean quite simply
because this is a CRD.
proc sort data=pcb;
by sex species;
run;
proc means data=pcb noprint; /
*
*
/
by sex species;
var pcb;
run;
series y=mean x=species / group=sex;
highlow x=species high=uclm low=lclm / group=sex;
yaxis label=pcb offsetmin=.05 offsetmax=.05;
xaxis label=Species offsetmin=.05 offsetmax=.05;
run;
The lines appear to be roughly parallel - so we expect that there may not be an interaction between the
two factors.
Looking at the prole plot - what is the effect of sex? What is the effect of species?
The statistical model is written as:
PCB = sex species sex species
What does this statistical model tell us about the sources of variation in the observed data?
There are two main procedures in SAS for tting ANOVA and regression models (collectively called lin-
ear models). FIrst is Proc GLM which performs the traditional sums-of-squares decomposition. Second is
Proc Mixed which uses restricted maximum likelihood (REML) to t models. In models with only xed
ods graphics on;
proc glm data=pcb plots=all;
title2 Anova - balanced;
class sex species; /
*
*
/
model pcb = sex species sex
*
species ;
lsmeans sex
*
species / cl stderr pdiff adjust=tukey lines;
lsmeans sex / cl stderr pdiff adjust=tukey lines;
lsmeans species / cl stderr pdiff adjust=tukey lines;
run;
ods graphics off;
ods graphics on;
proc mixed data=pcb plots=all;
title2 Mixed m- balanced;
class sex species; /
*
*
/
*
species / ddfm=kr;
lsmeans sex
*
species / cl adjust=tukey ;
lsmeans sex / cl adjust=tukey ;
lsmeans species / cl adjust=tukey ;
*
*
/
run;
ods graphics off;
/
*
*
/
First is a Whole Model test which simply examines if there is evidence of an effect for any term in the
model. It is rarely useful. Here is the table for the overall ANOVA computed by GLM:
Model 5 200.8720833 40.1744167 19.02 <.0001
Error 18 38.0175000 2.1120833
Of more interest, are the individual Effect Tests:
sex 1 138.7204167 138.7204167 65.68 <.0001
species 2 55.2608333 27.6304167 13.08 0.0003
sex*species 2 6.8908333 3.4454167 1.63 0.2233
Here is the table for the Effect Tests computed by Proc Mixed. Because the model only consists of xed
sex 1 18 65.68 <.0001
species 2 18 13.08 0.0003
sex*species 2 18 1.63 0.2233
The above table breaks down the whole model test according to the various terms in the model. The
table below breaks down the Model line in the whole model test into the components for every term in the
model. Some packages give you a choice of effect tests. For example, SAS will print out a Type I, II, and III
tests - in balanced data these will always be the same and any can be used. In unbalanced data, these three
types of tests will have different results ALWAYS use the Type III (also known as the marginal) tests.
As noted earlier, if you are using R, the default method are what as known as Type I (incremental) tests and
can be misleading (i.e. testing the wrong hypothesis) in cases of unbalanced data.
Start the hypothesis testing with the most complicated terms and work towards simpler terms.
The rst null hypothesis is
H: no interaction between sex and species in their effect on the mean PCB levels
A: some interaction among sex and species in their effect on the mean PCB levels.
The test statistic is F = 1.6313, the p-value (.2233) is not very small. Hence there is no evidence of an
interaction in the effects of sex and species upon the mean PCB levels. This is not too surprising as the lines
are fairly parallel. What does this mean in terms of the original responses, i.e., what does no interaction say
about the differences in the mean PCB levels among the sex or among the species.
If interaction was not statistically signicant, then the analysis continues along to examine the main
effects. These can be examined in any order.
Examining main effects - sex.
What are the null and alternate hypotheses? The ANOVA table gives F = 65.6 and the p-value < 0.0001
- very small. There is very strong evidence of a difference in the mean PCB levels between the two sexes.
Because there are only two levels of sex, no multiple comparison procedure is needed. We would like
estimates of the marginal means i.e., estimates of the mean PCB levels for each sex averaged over species,
and, if possible, estimates of the mean difference in PCB levels between the two sexes averaged over species:
We obtain the marginal means, se, and 95% condence intervals of these values:
Here are estimates of the marginal means (and standard errors) and the single pairwise differences (it is
not necessary to use Tukeys procedure - why?). These are requested using the LSmeans statement in the
GLM procedure. The LSMeans estimates are equal to the raw sample means this is will be true ONLY
in balanced data. In the case of unbalanced data (see later), the LSMEANS seem like a sensible way to
estimate marginal means.
sex pcb LSMEAN Standard Error H0:LSMEAN=0 H0:LSMean1=LSMean2
Pr > |t| Pr > |t|
f 13.8250000 0.4195318 <.0001 <.0001
m 18.6333333 0.4195318 <.0001
sex pcb LSMEAN 95% Condence Limits
f 13.825000 12.943596 14.706404
m 18.633333 17.751930 19.514737
Least Squares Means for Effect sex
1 2 -4.808333 -6.054783 -3.561884
previous chapters.
Tukey Comparison Lines for Least Squares Means of sex
pcb LSMEAN sex LSMEAN Number
A 18.63333 m 2
B 13.82500 f 1
Effect sex species Estimate
Standard
Error DF
t
Value
Pr
>
sex f 13.5500 0.4944 14 27.41 <.0001 0.05 12.4897 14.6103
sex m 18.9500 0.4944 14 38.33 <.0001 0.05 17.8897 20.0103
sex species _sex _species Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
f m -5.4000 0.6991 Tukey <.0001 -6.8995 -3.9005
Effect=sex Method=Tukey(P<0.05) Set=2
Obs sex species Estimate Standard Error Alpha Lower Upper Letter Group
7 m 18.6333 0.4195 0.05 17.7519 19.5147 A
8 f 13.8250 0.4195 0.05 12.9436 14.7064 B
In multifactor designs, it may not be very useful to estimate the marginal means. Does it make sense
to take an average over the three species with equal weight given to each species? If one species is more
abundant than another species, perhaps it should be given a greater weight?
The estimated difference is -4.8 ppm (i.e., females have lower mean PCB levels on average than males)
with a se of 0.5933. A 95% condence interval for the difference is also given.
Main effects should ONLY be examined if the interaction effects are not statistically signicant or if the
non-parallelism is not very large.
Examining main effects - species
What are the null and alternate hypotheses?
The ANOVA table gives F = 13.1 and a p-value of 0.0003 - very small. There is very strong evidence
of a difference in the mean PCB levels among the three species.
Once again, the test just tells us that there is evidence of a difference in the means, but doesnt tell us
which mean appears to be different. First examine the estimates of the marginal means along with approx-
imate 95% condence intervals and standard error for each marginal mean and do a multiple-comparison
procedure as before.
species pcb LSMEAN Standard Error Pr > |t| LSMEAN Number
sp1 18.1375000 0.5138194 <.0001 1
sp2 14.4250000 0.5138194 <.0001 2
sp3 16.1250000 0.5138194 <.0001 3
species pcb LSMEAN 95% Condence Limits
sp1 18.137500 17.058005 19.216995
sp2 14.425000 13.345505 15.504495
sp3 16.125000 15.045505 17.204495
Least Squares Means for effect species
Dependent Variable: pcb
i/j 1 2 3
1 0.0002 0.0323
2 0.0002 0.0756
3 0.0323 0.0756
Least Squares Means for Effect species
1 2 3.712500 1.857969 5.567031
1 3 2.012500 0.157969 3.867031
2 3 -1.700000 -3.554531 0.154531
previous chapters.
Tukey Comparison Lines for Least Squares Means of species
pcb LSMEAN species LSMEAN Number
A 18.1375 sp1 1
B 16.1250 sp3 3
B
Tukey Comparison Lines for Least Squares Means of species
pcb LSMEAN species LSMEAN Number
B 14.4250 sp2 2
Effect sex species Estimate
Standard
Error DF
t
Value
Pr
>
species sp1 18.6125 0.6422 14 28.98 <.0001 0.05 17.2351 19.9899
species sp2 14.4250 0.5244 14 27.51 <.0001 0.05 13.3004 15.5496
species sp3 15.7125 0.6422 14 24.47 <.0001 0.05 14.3351 17.0899
sex species _sex _species Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
sp1 sp2 4.1875 0.8291 Tukey-Kramer 0.0005 2.0176 6.3574
sp1 sp3 2.9000 0.9082 Tukey-Kramer 0.0168 0.5230 5.2770
sp2 sp3 -1.2875 0.8291 Tukey-Kramer 0.2976 -3.4574 0.8824
Obs sex species Estimate Standard Error Alpha Lower Upper Letter Group
9 sp1 18.1375 0.5138 0.05 17.0580 19.2170 A
10 sp3 16.1250 0.5138 0.05 15.0455 17.2045 B
11 sp2 14.4250 0.5138 0.05 13.3455 15.5045 B
What does this table appear to show us? Again, is it sensible to average over the two sexes? As most
species have an equal sex ratio, this is likely a sensible thing to do.
How about the estimate of the differences in the mean PCB levels among the species?
ages).Here is the diagnostic plot from Proc GLM:
not equally variable, but the difference in variance is not large.
10.4 Power and sample size for two-factor CRD
As before, the rst thing for a power analysis is to estimate the biologically important difference to detect.
THIS IS HARD as it is sometimes not clear what size of difference will make an impact. You will need an
effect size for each factor and possibly for the interaction (if that is of interest).
I nd it easies to start by making reasonable guess as to the MEAN for each treatment, i.e. what is the
estimated mean for each combination of factor levels. These means determine the effect sizes for the various
main effects and interactions in the model. Then look at the estimates of the main effects and adjust the
individual means until the effect sizes match that of biological importance above. This is readily done in a
spreadsheet program.
Next you will need an estimate of the residual STANDARD DEVIATION (not the se) which represents
the (common) variation among observations within each treatment group.
These two pieces of information allow various packages to estimate the sample sizes to detect the various
effect sizes.
For example, consider a two-factor design with factor 1 (sex) at two levels (male and female) and factor
2 (species) at three levels and the PCB concentration is measured in sh selected from each combination of
sex and species.
From past studies we believe that a difference of about 4 ppm between the sexes and about 6 ppm among
species is biologically important. THere may be a slight interaction between the two factors on the effects.
Here is my rst guess as to the appropriate means:
Species
Sex sp1 sp2 sp3
M 14 16 20
F 10 12 16
The actual values are not that important what is important are the effect sizes. Here the main effect of
sex is found as
(14 10) + (16 12) + (20 16)
3
= 4
i.e. the difference between the means of the two sexes averaged over the three species.
The main effect of the third vs. the rst species is:
(20 14) + (16 10)
2
= 6
i.e. the difference in the means averaged over the two sexes.
This combination of means gives the targeted main effects for both species.
There is currently no interaction in the effects of the factors because the effect of sex (a difference of 4
units) is the same for all species. We can add a little interaction by jittering the means slightly, but keeping
the main effects the same:
Species
Sex sp1 sp2 sp3
M 15 15 20
F 9 13 16
Now the main effect of sex is found as
(15 9) + (15 13) + (20 16)
3
= 4
i.e. no change in the main effect of sex.
Similarly, The revised main effect of the third vs. the rst species (i.e. the largest difference in the means
among species) is:
(20 15) + (16 9)
2
= 6
i.e. no change in the main effect of species.
However, now there is a slight interaction as the effect of sex is not consistent across the species.
We also believe that the standard deviation is 5.
Power computations for two-factor CRD are done using Proc GLMpower.
3
This procedure requires you
read in the means for the factor level combinations. This is done in the usual fashion.
data means;
input sex $ species $ pcb;
datalines;
m s1 15
m s2 15
m s3 20
f s1 9
f s2 13
f s3 16
;;;;
The Proc GLMpower is called most of the statement are obvious:
ods graphics on;
proc glmpower data=means;
title2 GLMPower;
class sex species;
*
species;
power
stddev = 5
alpha = .05
ntotal = .
power = .80 ;
plot y=power yopts=(ref=.80 crossref=yes) min=.05 max=.95;;
footnote NOTE: You require different sample sizes for each effect;
run;
ods graphics off;
3
The Proc Power cannot be used for two-factor or more complex designs. There is a third, more exible way, using the methods
developed by Stroup that are illustrated in the SAS code on the http://www.stat.sfu.ca/ cschwarz/Stat-650/Notes/MyPrograms.
Fixed Scenario Elements
Dependent Variable pcb
Alpha 0.05
Error Standard Deviation 5
Nominal Power 0.8
NOTE: You require different sample sizes for each effect
Computed N Total
Index Source Test DF Error DF Actual Power N Total
1 sex 1 48 0.821 54
2 species 2 42 0.856 48
3 sex*species 2 360 0.802 366
NOTE: You require different sample sizes for each effect
and a plot of the tradeoff between power and sample size:
You would need a TOTAL sample size of about 54 (i.e. 9 per treatment group) to detect a 4 ppb difference
in the mean concentration between the two sexes with an 80% power. A similar sample TOTAL sample
size of about 48 (i.e. 8 per treatment group) would be needed to detect a 6 ppb difference in the mean
concentration among the three species with an 80% power. We are fortunate that the two sample sizes for
the two effects agree so well this is often not the case and you would be forced to use the larger sample
size. For example, the total sample size to detect the effects of Factor A may be 60, while the total sample
size to detect the effects of Factor B may be 90. If possible, use the larger sample size to ensure adequate
power for all factors.
The sample size requirements to detect the interaction are very large (over 350 in total). It may be
feasible to conduct this experiment to detect this small interaction.
It is not necessary to have the sample sizes equal in all treatment groups, but it can be shown that the
power of the test is maximized when the sample sizes are equal for all treatment combinations.
10.5 Unbalanced data - Introduction
Unbalanced data can take many forms. Some of the forms are easy to analyze, some are difcult.
Here are some illustrations of the common replication patterns that you will run into. In all cases there
are 2 levels of Factor A and three levels of Factor B and an x represents a replicate.
Equal replications per cell
Factor B
b1 b2 b3
+-----+-----+-----+
Factor A a1 | xx | xx | xx |
+-----+-----+-----+
a2 | xx | xx | xx |
+-----+-----+-----+
This is the easiest to deal with and two examples were given earlier in the notes.
Unequal replications per cell, but replicates in every cell
Factor B
b1 b2 b3
+-----+-----+-----+
Factor A a1 | xxx | xx | xxx |
+-----+-----+-----+
a2 | xx | xx | xxxx|
+-----+-----+-----+
In this case, all cell have some data, but the number of replicates differs among cells and every cell has
2 or more replicates. The multiple replicates within a cell are needed to estimate the MSE row in the
ANOVA table. An example of an analysis of this type of data will be given below. Because each cell
has replicates, it is possible to check that the variation is roughly equal in all treatment groups. This
type of unbalance can be analyzed easily if the computer package has been programmed correctly.
BEWARE: some packages (e.g. Excel) will give WRONG answers! If you use R, you will nd
that R ts what are known as incremental sums-of-squares which may not test hypotheses that
of interest see the examples below for more details. A key assumption is that missing values occur
completely at random (MCAR). This should be assessed before analyzing any unbalanced design
where the original design was balanced, but some experimental units were lost.
Unequal replications per cell, with some cells having only a single observation
Factor B
b1 b2 b3
+-----+-----+-----+
Factor A a1 | x | xx | xxx |
+-----+-----+-----+
a2 | xx | x | xx |
+-----+-----+-----+
In this case, all cell have some data, but the number of replicates differs among cells and some cells
have only a single observation. Theoretically, there is no difference in the analysis of this experiment
from the previous example. However, in this experiment, you must assume that the variability in the
cells with replicates is an accurate representation of that in cells with only a single observation.
One observation per cell
Factor B
b1 b2 b3
+-----+-----+-----+
Factor A a1 | x | x | x |
+-----+-----+-----+
a2 | x | x | x |
+-----+-----+-----+
If you only have a single observation per cell, it is impossible to test for interaction effects. [Tech-
nically, the design has insufcient degrees of freedom for error.] We wont discuss how to analyze
this type of data in this course, but basically you MUST ASSUME that no interaction exists, and t a
model without any interaction terms in the model. Only main effects can be tested.
One or more cells completely empty
Factor B
b1 b2 b3
+-----+-----+-----+
Factor A a1 | xx | xx | |
+-----+-----+-----+
a2 | xx | xx | xx |
+-----+-----+-----+
SEEK HELP! Most computer packages will give you completely WRONG results! This is tough
problem. Now having said this, one simple solution to this problem (if it only occurs in one cell of
the design), is to drop the column and analyze the remaining data as a two-factor, each at two levels.
In the above example, level b3 would be dropped from the experiment. However if there are many
missing cells, you may nd that you are dropping most of your data!
The analysis of an unbalanced design with all cells having at least one observation and some cells having
at least two replicates is discussed in:
Shaw, R.G. and Mitchell-Olds, T. (1993).
ANOVA for unbalanced data:an overview.
Ecology, 74, 1638-1645.
http://dx.doi.org/10.2307/1939922.
10.6 Example - Stream residence time - Unbalanced data in a CRD
This example is taken from:
Trouton, Nicole (2004). An investigation into the factors inuencing escapement estimation for
chinook salmon (Oncorhynchus tshawytscha) on the Lower Shuswap River, British Columbia.
M.Sc. Thesis, Simon Fraser University.
Spawning residence time is an important value in managing Pacic salmon. A standard procedure to
estimate the total number of sh that have returned to spawn (the escapement) is to estimate the total number
of sh-days spent by salmon in a stream by aerial ights and then divide this number by the spawning
residence time. This procedure is called the Area-under-the-curve (AUC) method of estimating escapement.
Trouton captured and inserted radio transmitters into a sample of chinook salmon on the Lower Shuswap
River in British Columbia, Canada. Then she rowed along the stream on a daily basis with a radio receiver
to see how long each sh spent on the spawning grounds.
Approximately 60, 70, and 150 sh were radio tagged in 2000, 2001, and 2002 respectively, approxi-
mately equally split between males and female in each year. Not all sh survived the radio insertion and
not all sh spawned in the reaches of the river surveyed by the radio receiver so the number of data points
actually measured each year is less than this number.
The raw data is available in a datale called residence.csv available at the Sample Program Library
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are im-
ported into SAS in the usual way:
data residence;
infile residence.csv dlm=, dsd missover firstobs=2;
length trt $8.;
input time sex $ year $;
substr(trt,1,1) = sex;
substr(trt,2,1) = -;
substr(trt,3,4) = year;
run;
Obs trt time sex year
1 f-2000 4.5 f 2000
2 f-2000 5.0 f 2000
3 f-2001 4.0 f 2001
4 f-2001 6.0 f 2001
5 f-2001 1.0 f 2001
6 f-2001 4.0 f 2001
7 f-2001 5.0 f 2001
8 f-2001 6.0 f 2001
9 f-2001 4.0 f 2001
10 f-2001 3.0 f 2001
Design issues
What are the factors in this experiment? What are their levels? What is the response variable?
Is this an experiment or an analytical survey? What is the role of randomization in this study?
What is the treatment structure? Is it a factorial treatment structure?
What are the experimental units? What are the observational units? Are they the same? Why?
The design is unbalanced, in that there are unequal number of males and females in each year, and
unequal number of sh radio tagged in each year. How did the unbalance occur? Was this a planned feature
of the survey?
Not all sh tagged were measured? Are the missing values missing completely at random (MCAR),
missing at random (MAR) or informative? What important assumption needs to be made about the missing
values in order that these results may be extrapolated to the relevant populations?
Examine the table of simple summary statistics: We compute the means and standard deviations for each
treatment group using Proc Tabulate in the usual way:
proc tabulate data=residence; /
*
*
/
class sex year;
var time;
table sex
*
year, time
*
(n
*
f=5.0 mean
*
f=5.2 std
*
f=5.2 stderr
*
f=5.2) /rts=15;
run;
giving
time
N Mean Std StdErr
sex year 2 4.75 0.35 0.25
f 2000
2001 9 4.11 1.54 0.51
2002 6 3.00 0.00 0.00
m 2000 10 6.15 1.90 0.60
2001 15 8.10 2.58 0.67
2002 26 6.29 1.33 0.26
The unbalance within years and across years is quite clear. It is interesting that all six observations on
females in 2002 had exactly the same value leading to a standard deviation for that particular year of 0.
The standard deviations are all approximately equal - it is quite difcult to tell very much for females in
2000 with only two observations.
Next look at the side-by-side dot plots:
proc sgplot data=residence;
yaxis label=Residence Time offsetmin=.05 offsetmax=.05;
xaxis label=Sex and Year offsetmin=.05 offsetmax=.05;
scatter x=trt y=time / markerattrs=(symbol=circlefilled);
run;
which gives
There is no evidence of any outliers or unusual points.
The capture of sh took place over an extended period of time and there was little evidence that schooling
or any other non-independent behavior occurred among the sh.
10.6.2 The Statistical Model
This appears to a simple two-factor completely randomized design with year and sex being the two factors.
The model for this experiment will then be:
ResidenceTime = Y ear Sex Y ear Sex
It is clear that Sex is a xed effect. It is less clear that Year is a xed effect. Do you want to restrict
inference only to these three particular years in the survey or do you wish to extrapolate to all possible
years? If the latter, then you run into the problem that the years selected in the study were not randomly
selected from all possible years of interest. In cases such as these, many scientists elect to treat Year as a xed
effect, but the consequences of this need to be understood.
4
Notice how the Year variable was previously
dened as a categorical factor when the data was imported into R.
4
The more serious consequence is that standard errors for marginal means (e.g. the mean residence time for males or females) will
Start again, but constructing prole plots:
We can nd the sample means and condence limits for each population mean quite simply because this
is a CRD.
proc sort data=residence;
by sex year;
run;
proc means data=residence noprint; /
*
*
/
by sex year;
var time;
run;
series y=mean x=year / group=sex;
highlow x=year high=uclm low=lclm / group=sex;
yaxis label=Residence time offsetmin=.05 offsetmax=.05;
xaxis label=Year offsetmin=.05 offsetmax=.05;
run;
be underestimated as the extra variation induced from the potential selection of a new set of years (if Years were a random effect) has
not been accounted for.
Note that all observations in 2002 for females had the same value giving a standard deviation of 0 and a
condence interval of width 0 obviously this is silly.
There is no evidence of any interaction between the two factors upon the mean response because the
lines are parallel.
ods graphics on;
proc glm data=residence plots=all;
title2 analysis;
class sex year;
model time = sex year sex
*
year;
lsmeans sex
*
year / cl pdiff adjust=tukey lines;
lsmeans sex / cl pdiff adjust=tukey lines;
lsmeans year / cl pdiff adjust=tukey lines;
run;
ods graphics off;
ods graphics on;
proc mixed data=residence plots=all;
title2 Mixed model balanced;
class sex year; /
*
*
/
model time = sex year sex
*
year / ddfm=kr;
lsmeans sex
*
year / cl adjust=tukey ;
lsmeans sex / cl adjust=tukey ;
lsmeans year / cl adjust=tukey ;
*
*
/
run;
ods graphics off;
/
*
*
/
The Whole Model test is not very interesting as it tests if there is an effect of Sex, or Year, or an
interaction upon the mean response. It is rarely useful. Here is the table for the overall ANOVA computed
by GLM:
Model 5 157.6422197 31.5284439 10.36 <.0001
Error 62 188.7254274 3.0439585
The Effect Test section of the output is much more informative. Here is the table for the Effect Tests
computed by Proc GLM:
sex 1 76.60591323 76.60591323 25.17 <.0001
year 2 22.31105774 11.15552887 3.66 0.0313
sex*year 2 8.65181116 4.32590558 1.42 0.2492
Here is the table for the Effect Tests computed by Proc Mixed. Because the model consists only of xed
sex 1 62 25.17 <.0001
year 2 62 3.66 0.0313
sex*year 2 62 1.42 0.2492
The Effects Test table breaks down the whole model test according to the various terms in the model.
The table below breaks down the Model line in the whole model test into the components for every term in
the model. Some packages give you a choice of effect tests. For example, SAS will print out a Type I, II,
and III tests - in balanced data these will always be the same and any can be used. In unbalanced data, these
three types of tests will have different results ALWAYS use the Type III (also known as the marginal)
tests. As noted earlier, if you are using R, the default method are what as known as Type I (incremental)
tests and can be misleading (i.e. testing the wrong hypothesis) in cases of unbalanced data.
As in the balanced case, start with tests for interaction between the two factors upon the mean response.
The null hypothesis is:
H: no interaction between the two factors upon the mean response.
A: some interaction between the two factors upon the mean response.
The test statistic is F = 1.42 with a p-value of 0.24. There is no evidence of an interaction between the
two factors upon the mean response. This is not surprising given the apparent parallelism of the lines in the
factor prole plots.
Because there was no evidence of an interaction, it is sensible to examine main effects of each factor.
In both cases, the null hypothesis is no effect of the factor upon the mean response when averaged over the
levels of the other factor. Note that in multi-factor designs, the hypotheses of main effects are always in
terms of averages over the other factors in the model. This is one reason why it is important to test for a
possible interaction between factors before testing for main effects.
There is strong evidence of a Sex effect upon the mean residence time (F = 25.2, p < .0001), and there
is evidence of a Year effect upon the mean residence time (F = 3.7, p = .031).
It is of interest to estimate the various effects as well. First start with an examination of the sex effect.
sex time LSMEAN H0:LSMean1=LSMean2
Pr > |t|
f 3.95370370 <.0001
m 6.84615385
sex time LSMEAN 95% Condence Limits
f 3.953704 2.928447 4.978960
m 6.846154 6.319631 7.372677
Least Squares Means for Effect sex
1 2 -2.892450 -4.045003 -1.739898
previous chapters.
Tukey-Kramer Comparison Lines for Least Squares Means of sex
time LSMEAN sex LSMEAN Number
A 6.8461538 m 2
B 3.9537037 f 1
Effect sex year Estimate
Standard
Error DF
t
Value
Pr
>
sex f 3.9537 0.5129 62 7.71 <.0001 0.05 2.9284 4.9790
sex m 6.8462 0.2634 62 25.99 <.0001 0.05 6.3196 7.3727
sex year _sex _year Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
f m -2.8925 0.5766 Tukey-Kramer <.0001 -4.0450 -1.7399
Effect=sex Method=Tukey-Kramer(P<0.05) Set=2
Obs sex year Estimate Standard Error Alpha Lower Upper Letter Group
7 m 6.8462 0.2634 0.05 6.3196 7.3727 A
8 f 3.9537 0.5129 0.05 2.9284 4.9790 B
The LeastSquare Mean for females is 3.95 days which differs from the raw mean of 3.79 days. This is
an artifact of the unbalance in the experiment.
If you look at the raw means for females presented earlier, you nd that they were
Year n Raw data Mean
2000 2 4.5, 5.0 4.75
2001 9 4.0, 6.0, 1.0, 4.0, 5.0, 6.0, 4,0, 3.0, 4.0 4.11
2002 6 3.0, 3.0, 3.0, 3.0, 3.0, 3.0 3.00
Overall 17 3.79
The overall raw mean of 3.79 days is found by adding up all of the 2 + 9 + 6 = 17 observations and
taking a simple average. The LSMEANS of 3.95 days is found as the average of the averages, i.e.
3.95 =
4.75 + 4.11 + 3.00
3
Which is better. There is no simple way to determine which, if any, of these two estimates is better.
The simple raw mean suffers from the fact that a year with more observation (e.g. 2001) is given more
weight in determining the average than a year with fewer observation (e.g. 2000). The LSMEANS give each
years data equal weight.
If the different sample sizes are simple artifacts of the experiment and have no intrinsic meaning, then
the LSMEANS may be preferable. If, however, the different sample sizes are related to something that is
of biological importance (for example, suppose that the spawning population in 2001 was almost ve times
as large as that in 2000), then the raw means may be more suitable. Many scientists unthinkingly use the
LSMEANS automatically because that is what the computer package automatically spits out.
The estimated difference in mean residence time (averaged across all three years) is about 2.89 days with
a standard error of .58 days.
Similarly, one can look at the Year effects in more detail.
year time LSMEAN LSMEAN Number
2000 5.45000000 1
2001 6.10555556 2
2002 4.64423077 3
year time LSMEAN 95% Condence Limits
2000 5.450000 4.099261 6.800739
2001 6.105556 5.370306 6.840805
2002 4.644231 3.854446 5.434015
Least Squares Means for effect species
Dependent Variable: pcb
i/j 1 2 3
1 0.0002 0.0323
2 0.0002 0.0756
3 0.0323 0.0756
Least Squares Means for Effect year
1 2 -0.655556 -2.502883 1.191772
1 3 0.805769 -1.073758 2.685297
2 3 1.461325 0.165154 2.757496
previous chapters.
Tukey-Kramer Comparison Lines for Least Squares Means of year
time LSMEAN year LSMEAN Number
A 6.1055556 2001 2
A
B A 5.4500000 2000 1
B
B 4.6442308 2002 3
Effect sex year Estimate
Standard
Error DF
t
Value
Pr
>
year 2000 5.4500 0.6757 62 8.07 <.0001 0.05 4.0993 6.8007
year 2001 6.1056 0.3678 62 16.60 <.0001 0.05 5.3703 6.8408
year 2002 4.6442 0.3951 62 11.75 <.0001 0.05 3.8544 5.4340
sex year _sex _year Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
2000 2001 -0.6556 0.7693 Tukey-Kramer 0.6722 -2.5029 1.1918
2000 2002 0.8058 0.7827 Tukey-Kramer 0.5613 -1.0738 2.6853
2001 2002 1.4613 0.5398 Tukey-Kramer 0.0235 0.1652 2.7575
Effect=year Method=Tukey-Kramer(P<0.05) Set=3
Obs sex year Estimate Standard Error Alpha Lower Upper Letter Group
9 2001 6.1056 0.3678 0.05 5.3703 6.8408 A
10 2000 5.4500 0.6757 0.05 4.0993 6.8007 AB
11 2002 4.6442 0.3951 0.05 3.8544 5.4340 B
Again notice that the LSMEANS differ from the raw means because of the imbalance in the data. Here
it makes sense to weight the sexes equally because of the (presumed) 50:50 sex ratio within each year.
Estimates of differences in the mean residence time between pairs of years can be readily determined.
10.6.4 Power and sample size
As before, an estimate of the standard deviation is needed. This can be chosen as somewhere in the range of
the standard deviations presented in the summary table, or the
MSE = RMSE value from the ANOVA

table can be used. In this case, the estimated standard deviation is
3.044 = 1.74.
10.7 Example - Energy consumption in pocket mice - Unbalanced
data in a CRD
Here is an example showing some of the problems that you may run into when analyzing unbalanced data.
This data was taken from:
French, A.R. 1976.
Selection of high temperature for hibernation by the pocket mouse: Ecological advantages and
energetic consequences.
Ecology, 57, 185-191
http://dx.doi.org/10.2307/1936410
He collected the following data on the energy utilization of the pocket mouse (Perognathus longimembris)
during hibernation at different temperatures:
Restricted Food Ad libitum food
8
C 18
C 8
C 18
C
62.69 72.60 95.73 101.19
54.07 70.97 63.95 76.88
65.73 74.32 144.30 74.08
62.98 53.02 144.30 81.40
46.22 66.58
59.10 84.38
61.79 118.95
61.89 118.95
62.50
All readings are in kcal/g.
The raw data is available in a datale called residence.csv available at the Sample Program Library
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are im-
data energy;
infile mouse.csv dlm=, dsd missover firstobs=2;
length diet $10 temp $3 trt $15;
input temp diet energy;
substr(trt,1,3) = temp;
substr(trt,4,1) = -;
substr(trt,5 ) = diet;
run;
Obs diet temp trt energy
1 ad 18C 18C-ad 101.19
2 ad 18C 18C-ad 76.88
3 ad 18C 18C-ad 74.08
4 ad 18C 18C-ad 81.40
5 ad 18C 18C-ad 66.58
6 ad 18C 18C-ad 84.38
Obs diet temp trt energy
7 ad 18C 18C-ad 118.95
8 ad 18C 18C-ad 118.95
9 res 18C 18C-res 72.60
10 res 18C 18C-res 70.97
What are the factors in this experiment? Their levels? What is the response variable? Is this an experiment
or an observational study? If the latter, what is the role of randomization in this study? What is the treat-
ment structure in this experiment? What is the experimental unit structure? What are the experimental and
observation units? Why is the design unbalanced? What unbalanced the design, or did it occur by chance?
What is the randomization structure?
Are the factors to be considered xed effects? Hence, does this experiment appear to satisfy the require-
ments for a two-factor xed-effects CRD?
Create some simple summary statistics. Create a pseudo-factor representing the different combinations of
the levels of the two factors and construct side-by-side dot plots and nd the means and standard deviations
of each treatment group in the usual way.
We compute the means and standard deviations for each treatment group using Proc Tabulate in the usual
way:
proc tabulate data=energy; /
*
*
/
class temp diet;
var energy;
table temp
*
diet, energy
*
(n
*
f=5.0 mean
*
f=5.2 std
*
f=5.2 stderr
*
f=5.2) /rts=15;
run;
giving
energy
N Mean Std StdErr
temp diet 8 90.30 20.28 7.17
energy
N Mean Std StdErr
18C ad
res 9 62.49 9.23 3.08
8C ad 4 112.1 39.41 19.71
res 4 61.37 5.05 2.53
The side-by-side dot plots are created:
proc sgplot data=energy;
yaxis label=Enercy offsetmin=.05 offsetmax=.05;
scatter x=trt y=energy / markerattrs=(symbol=circlefilled);
run;
which gives
There doesnt appear to be any difference in the mean energy usage between the two temperature levels
under the restricted food diet, but both appear to be less than the mean energy usage from the ad libitum
groups.
The standard deviations appear to be quite different! This is very worrisome - the ANOVA method is
fairly robust to unequal variances provided the sample sizes are equal in all groups. I would proceed with
caution in the subsequent analysis!
The design is unbalanced (the sample sizes are not equal in all groups).
Next, construct the prole plot:
is a CRD.
proc sort data=energy;
by temp diet;
run;
proc means data=energy noprint; /
*
*
/
by temp diet;
var energy;
run;
series y=mean x=diet / group=temp;
highlow x=diet high=uclm low=lclm / group=temp;
yaxis label=Energy offsetmin=.05 offsetmax=.05;
xaxis label=Diet offsetmin=.05 offsetmax=.05;
run;
The prole plot shows that some interaction may be present but it appears to be swamped by sampling
uncertainty (i.e. the wide condence intervals imply that the lines could be parallel). Indeed, this is not
unexpected given the earlier plot that showed no difference in the means between the two temperatures
under the restricted diet but some apparent differences under the ad libitum diet. Because of the strong
evidence of differential standard deviations among the groups, we may not detect this effect.
The model for this study is:
energy = food temp food temp
What does this statistical model tell us about the sources of variation in the observed data?
ods graphics on;
proc glm data=energy plots=all;
title2 Anova using GLM;
class temp diet; /
*
*
/
model energy = temp diet temp
*
diet;
/
*
Note that the type III ss do not add to model ss because
of imbalance in the design
*
/
lsmeans temp
*
diet /cl stderr pdiff adjust=tukey lines;
lsmeans temp / cl stderr pdiff adjust=tukey lines;
lsmeans diet / cl stderr pdiff adjust=tukey lines;
run;
ods graphics off;
ods graphics on;
proc mixed data=energy plots=all;
title2 Mixed analysis;
class temp diet; /
*
*
/
model energy = temp diet temp
*
diet / ddfm=kr;
lsmeans temp
*
diet / cl adjust=tukey ;
lsmeans temp / cl adjust=tukey ;
lsmeans diet / cl adjust=tukey ;
*
*
/
run;
ods graphics off;
/
*
*
/
There can be several ANOVA tables produced from the model t.
The Whole Model test which simply examines if there is evidence of an effect for any term in the model.
It is rarely useful but does tell us that we should proceed further and investigate the various effects. Here is
the table for the overall ANOVA computed by GLM:
Model 3 9092.57694 3030.85898 7.67 0.0012
Error 21 8297.83396 395.13495
The effect tests table breaks down the whole model test according to the various terms in the model.
temp 1 579.080566 579.080566 1.47 0.2395
diet 1 8374.291389 8374.291389 21.19 0.0002
temp*diet 1 711.861727 711.861727 1.80 0.1939
Here is the table for the Effect Tests computed by Proc Mixed. Because the model consists only of xed
temp 1 21 1.47 0.2395
diet 1 21 21.19 0.0002
temp*diet 1 21 1.80 0.1939
The Effects Test table breaks down the whole model test according to the various terms in the model.
The table below breaks down the Model line in the whole model test into the components for every term in
the model. Some packages give you a choice of effect tests. For example, SAS will print out a Type I, II,
and III tests - in balanced data these will always be the same and any can be used. In unbalanced data, these
three types of tests will have different results ALWAYS use the Type III (also known as the marginal)
tests. As noted earlier, if you are using R, the default method are what as known as Type I (incremental)
tests and can be misleading (i.e. testing the wrong hypothesis) in cases of unbalanced data.
As in the balanced case, start the hypothesis testing with the most complicated terms and work towards
simpler terms. In this case, start with the interaction effects.
Our null hypothesis is
H: no interaction among the effects of food and temperature on the mean response.
A: some interaction among the effects of food and temperature on the mean response.
Our test statistic is F = 1.80, the p-value (.1939) is not very small. Hence there is no evidence of an
interaction among the effects of food and temperature on the mean response. This is somewhat surprising
given the prole plots constructed earlier - it may be an artifact of the unequal standard deviations among
the groups or because our sample sizes are not very large.
Examining main effects - temperature.
The F-statistic is 1.47, the p-value is 0.2395. There is no evidence of a difference among the mean
energy requirements at the two temperature levels.
This would be conrmed by looking at the estimated marginal means and the estimated difference in the
means:
temp energy LSMEAN Standard Error H0:LSMEAN=0 H0:LSMean1=LSMean2
Pr > |t| Pr > |t|
18C 76.3956250 4.8294863 <.0001 0.2395
8C 86.7187500 7.0279349 <.0001
temp energy LSMEAN 95% Condence Limits
18C 76.395625 66.352158 86.439092
8C 86.718750 72.103359 101.334141
Least Squares Means for Effect temp
1 2 -10.323125 -28.056590 7.410340
previous chapters.
Tukey-Kramer Comparison Lines for Least Squares Means of temp
energy LSMEAN temp LSMEAN Number
A 86.71875 8C 2
A
A 76.39563 18C 1
Effect temp diet Estimate
Standard
Error DF
t
Value
Pr
>
temp 18C 76.3956 4.8295 21 15.82 <.0001 0.05 66.3522 86.4391
temp 8C 86.7188 7.0279 21 12.34 <.0001 0.05 72.1034 101.33
temp diet _temp _diet Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
18C 8C -10.3231 8.5274 Tukey-Kramer 0.2395 -28.0566 7.4103
Effect=temp Method=Tukey-Kramer(P<0.05) Set=2
Obs temp diet Estimate Standard Error Alpha Lower Upper Letter Group
5 8C 86.7188 7.0279 0.05 72.1034 101.33 A
6 18C 76.3956 4.8295 0.05 66.3522 86.4391 A
Here is where the unbalance again causes some difculty in the analysis. Notice that the LSMeans no
longer equal the raw means. In unbalanced data, simple means across the factors may be affected by unequal
sample sizes. The simple mean for the 8
C levels would include 4 mice at the restricted diet and 4 mice at

the ad libitum diet - an equal split. However, the simple mean for the 18
C diet would include 9 mice at the

restricted diet and 10 mice under the ad libitum diet - no longer an equal weighting among the two diets. The
LSMeans are computed by giving equal weights to each of the two means from the two diets. There is no
universal agreement upon this (and you thought that Statistics was so cut and dried) but, in most situations,
the least square means seem preferable.
Examining main effects - food level.
The ANOVA table gives F = 21.2 and the p-value is 0.0002 - very small. There is very strong evidence
of a difference in the mean energy requirements between the two food levels. Because we only have two
levels, it is obvious where the difference lies. But we nd the Least Square Means and estimated differences
for the food levels:
diet energy LSMEAN Standard Error H0:LSMEAN=0 H0:LSMean1=LSMean2
Pr > |t| Pr > |t|
ad 101.185625 6.086370 <.0001 0.0002
res 61.928750 5.972596 <.0001
diet energy LSMEAN 95% Condence Limits
ad 101.185625 88.528325 113.842925
res 61.928750 49.508056 74.349444
Least Squares Means for Effect diet
1 2 39.256875 21.523410 56.990340
previous chapters.
Tukey-Kramer Comparison Lines for Least Squares Means of diet
energy LSMEAN diet LSMEAN Number
A 101.1856 ad 1
B 61.9288 res 2
Effect temp diet Estimate
Standard
Error DF
t
Value
Pr
>
diet ad 101.19 6.0864 21 16.62 <.0001 0.05 88.5283 113.84
diet res 61.9287 5.9726 21 10.37 <.0001 0.05 49.5081 74.3494
temp diet _temp _diet Estimate
Standard
Error Adjustment
Adj
P
Adj
Low
Adj
Upp
ad res 39.2569 8.5274 Tukey-Kramer 0.0002 21.5234 56.9903
Effect=diet Method=Tukey-Kramer(P<0.05) Set=3
Obs temp diet Estimate Standard Error Alpha Lower Upper Letter Group
7 ad 101.19 6.0864 0.05 88.5283 113.84 A
8 res 61.9287 5.9726 0.05 49.5081 74.3494 B
Again, notice that the least square means are different than the raw means which are shown in the
last column. This will usually happen when the design is unbalanced. The least square means give equal
weight to the two temperature levels while the raw means weight the two temperature levels according to
the observed sample sizes.
10.7.6 Adjusting for unequal variances?
This is beyond the scope of this course, but a formal test for unequal variances showed clear evidence of a
problem. A more exact test, fortunately, came to similar conclusions, but the estimated standard error of the
difference is slightly different.
10.8 Example: Use-Dependent Inactivation in Sodium Channel Beta
Subunit Mutation - BPK
This dataset was provided by Csilla Egri as part of a 2011 M.Sc. Thesis from the Department of Biomed-
ical Physiology & Kinesiology at Simon Fraser University, Burnaby, BC. C121W: A thermosensitive
sodium channel mutation. Additional details are available at http://dx.doi.org/10.1016/j.
bpj.2010.12.2506.
10.8.1 Introduction
Voltage gated sodium (NaV) channels are macromolecular complexes which pass sodium specic inward
current and are the main determinants of action potential in initiation and propagation. NaV normally as-
sociate with one or more auxiliary beta subunits which modify voltage dependent properties. Mutations to
these proteins can cause epilepsy, cardiac arrhythmias, and skeletal muscle disorders.
The research question is if temperature exacerbates the functional effect of an epilepsy causing sodium
channel beta subunit mutation (C121W).
The experiment was conducted using whole-cell voltage clamp experiments on CHOcells expressing sodium
channel proteins with or without auxiliary beta subunits. On each day, a different batch of cells was used,
and the voltage dependent properties were determined at either one of two temperatures (22
C or 34
C).
Three sodium channels were examined:
NaV1.2 Sodium channel without associated subunit which served as a negative control.
NaV1.2 + B(WT) is a control
NaV1.2 + B(CW) is the experimental group where the mutation in the beta subunit causes genetic
epilepsy with febrile seizures plus, in which patients experience seizures in response to fever.
The Use-dependent inactivation (UDI) was measured for each batch of cells.
10.8.3 Analysis
The rawdata is available in the Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms. The csv le names UDI.csv is read into SAS in the usual way:
data UDI;
infile UDI.csv dlm=, dsd missover firstobs=2;
length subunit $20;
input Subunit $ Temp UDI_Yo;
run;
proc print data=UDI (obs=15);
run;
Obs subunit Temp UDI_Yo trt
1 Nav1.2 22 0.31 22.Nav1.2
2 Nav1.2 22 0.10 22.Nav1.2
3 Nav1.2 22 0.17 22.Nav1.2
4 Nav1.2 22 0.27 22.Nav1.2
5 Nav1.2 22 0.43 22.Nav1.2
6 Nav1.2 22 0.39 22.Nav1.2
7 Nav1.2 + Beta 22 0.68 22.Nav1.2 + Beta
8 Nav1.2 + Beta 22 0.51 22.Nav1.2 + Beta
9 Nav1.2 + Beta 22 0.27 22.Nav1.2 + Beta
10 Nav1.2 + Beta 22 0.19 22.Nav1.2 + Beta
11 Nav1.2 + Beta 22 0.12 22.Nav1.2 + Beta
12 Nav1.2 + Beta 22 0.41 22.Nav1.2 + Beta
13 Nav1.2 + Beta 22 0.21 22.Nav1.2 + Beta
Obs subunit Temp UDI_Yo trt
14 Nav1.2 + Beta 22 0.44 22.Nav1.2 + Beta
15 Nav1.2 + Beta(CW) 22 0.19 22.Nav1.2 + Beta(CW)
We begin by examining the experimental protocol to determine the appropriate analysis for this experi-
ment.
Treatment structure. There are two factors for this experiment (subunit type with 3 levels and temper-
ature with 2 levels). All subunits were tested at both temperatures. This is a factorial experiment. We are
interested only in the levels tested of each factor both factors are xed effects.
Experimental structure. A separate batch of cells was tested on each day and each batch of cells was
only tested once. The batch of cells is the experimental unit.
Randomization structure. Each batch of cells of subunit type was randomized (we hope) to a separate
temperature and day to be tested. There was no blocking by week.
Consequently, this experiment appears to be two-factor completely randomized design (Two-factor CRD).
Before performing any analysis, it is always wise to examine the data to see if the assumptions required
for the analysis are approximately satised, to search for outliers or unusual points, and to check for any
data anomalies.
We begin by examining the sample sizes, means, and standard deviations of each treatment (combination
of subunit and temperature) group.
The sample size, mean, and standard deviation for each treatment group are found using Proc Tabulate:
proc tabulate data=UDI;
class Temp Subunit;
var UDI_Yo;
table temp
*
subunit, UDI_Yo
*
(n
*
f=4.0 mean
*
f=7.2 std
*
f=7.2);
run;
UDI_Yo
N Mean Std
Temp subunit 6 0.28 0.13
22 Nav1.2
Nav1.2 + Beta 8 0.35 0.19
Nav1.2 + Beta(CW) 6 0.28 0.08
UDI_Yo
N Mean Std
34 Nav1.2 5 0.54 0.16
Nav1.2 + Beta 5 0.47 0.09
Nav1.2 + Beta(CW) 7 0.62 0.09
The design is unbalanced (the number of batches of cells under each treatment combination is unequal).
CAUTION: The analysis of unbalance data can be tricky as different packages make different assumptions
on how to test the main effects and interaction.
The standard deviations are all approximately equal which appears to satisfy the assumption of equal
variances for each treatment group. A plot of the standard deviations of each treatment versus the mean of
each treatment group
also doesnt show any strong relationship between the standard deviation and the mean
5
.
We also construct side-by-side dot plots by constructing a pseudo-factor from the combination of the two
factor levels. We create a pseudo-factor by concatenating the value of the Temperature and Subunit variables
together and then do a dot plot:
/
*
Side-by-side dot plots to check for outliers
*
/
/
*
Create the pseudo factor
*
/
data UDI;
set UDI;
length trt $20.;
trt = put(temp,2.0) || . || subunit;
run;
proc sgplot data=UDI;
yaxis label=UDI offsetmin=.05 offsetmax=.05;
scatter x=trt y=UDI_yo / markerattrs=(symbol=circlefilled);
run;
5
Often the standard deviation increases with the mean which would imply that some transformation (usually a log() transformation)
should be used.
The dot plot doesnt show any obvious outliers.
The assumption of normality will be checked after we t the model. It is not correct to simply check
if the pooled (over all treatment groups) comes from a single normal distribution as each of the treatment
groups has a different mean.
The assumption of independence needs to be assessed based on the experimental protocol. Separate
batches of cells were used, so there should be no batch effects the may cause a similar response over the
experimental conditions.
It appears that all the assumptions required for this design are satised. We now proceed to the the
formal analysis. We start by developing the formal statistical model for this experiment. This model has
terms corresponding to the factors (and their interaction if the experiment is factorial), the experimental
units (the smallest size of experimental unit is always implicit), and the randomization structure (we always
assume complete randomization at this time).
Using a standard shorthand syntax, the model is
UDI = Temp Subunit Temp Subunit
The UDI represents the variation in the response variable (the UDI value). Some of the variation is ex-
plained by the factor (and their interaction) effects (the Temp, Subunit, and Temp Subunit terms).
Because there is only size of experimental unit (the batch of cells), the random batch-to-batch variation
is implicit (and does not formally enter the model). Finally, complete randomization occurred during the
experiment, so no extra terms are needed as well.
The above shorthand notation is commonly used in many computer packages as noted below.
Either Proc GLM or Proc Mixed can be used to analyze this data. The code sequences are shown below.
ods graphics on;
proc glm data=UDI plots=diagnostics;
title2 GLM analysis;
class temp subunit;
model UDI_Yo = temp subunit temp
*
subunit;
lsmeans temp / stderr pdiff cl adjust=tukey lines ;
lsmeans subunit / stderr pdiff cl adjust=tukey lines;
lsmeans temp
*
subunit / stderr pdiff cl adjust=tukey lines;
estimate CW vs reg at 34
subunit 0 1 -1
subunit
*
temp 0 0 0 0 1 -1 / e;
ods output ModelAnova=GLMModelAnova;
run;
ods graphics off;
ods graphics on;
proc mixed data=UDI plot=residualpanel;
title2 Mixed analysis;
class temp subunit;
model UDI_Yo = temp subunit temp
*
subunit / ddfm=kr ;
lsmeans temp / diff adjust=tukey;
lsmeans subunit / diff adjust=tukey;
lsmeans temp
*
subunit / diff adjust=tukey;
estimate CW vs reg at 34
subunit 0 1 -1
subunit
*
temp 0 0 0 0 1 -1 / e;
ods output tests3=MixedTests;
ods output lsmeans=MixedLSMeans;
ods output diffs=MixedDiffs;
run;
ods graphics off;
Selected portions of the output are shown below and for the remainder of the document. You should run
the SAS programs available from the Sample Program Library to get the complete listing.
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
Temp 1 31 29.41 <.0001
subunit 2 31 0.41 0.6676
Temp*subunit 2 31 2.26 0.1212
Dependent
Hypothesis
Type Source DF
Type
I
SS
Mean
Square
F
Value
Pr
>
F
UDI_Yo 3 Temp 1 0.52095609 0.52095609 29.41 <.0001
UDI_Yo 3 subunit 2 0.01450323 0.00725162 0.41 0.6676
UDI_Yo 3 Temp*subunit 2 0.08008009 0.04004004 2.26 0.1212
Start by examining the effect tests for the interaction-terms. The p-value for the interaction effect is 0.12
indicating no evidence of an interaction effect between the two factors of the experiments. This implies that
the responses to the two temperatures should be parallel over the different sub-unit types, i.e. the difference
in mean UDI between the two temperature levels is the same for all sub-units.
Because there was no evidence of an interaction, it is sensible to look at the main effects. Had there been
strong evidence of an interaction, it may not be sensible to look at the main effects.
The p-value for the effect of sub-unit is 0.66 indicating no evidence of a difference in mean UDI among
the three sub-units when averaged across both temperatures. The p-value for the effect of temperature is
.0001 indicating strong evidence of a temperature effect.
The assumptions of normality and inuence of individual observations is provided in the output gener-
ated by ODS GRAPHICS. Only the output from GLM is shown:
There is no evidence of a problem with the model t.
The prole plot
shows a separation in the proles between the two temperatures as expected from the results of the effect
tests, but the variation in response is sufcient to hide any non-parallelism (the interaction) that may be
present. It is pity that the default output from most packages doesnt add standard error (or condence
interval) bars to the above plot.
Examine carefully the estimated population marginal means for each level of temperature, subunit, or
the treatment combination (not all of which may be shown below):
Effect Temp subunit Estimate
Standard
Error DF
t
Value
Pr
>
|t|
Temp 22 0.3040 0.03003 31 10.12 <.0001
Temp 34 0.5448 0.03269 31 16.67 <.0001
subunit _ Nav1.2 0.4072 0.04029 31 10.10 <.0001
subunit _ Nav1.2 + Beta 0.4139 0.03794 31 10.91 <.0001
Effect Temp subunit Estimate
Standard
Error DF
t
Value
Pr
>
|t|
subunit _ Nav1.2 + Beta(CW) 0.4521 0.03702 31 12.21 <.0001
Temp*subunit 22 Nav1.2 0.2783 0.05433 31 5.12 <.0001
Temp*subunit 22 Nav1.2 + Beta 0.3538 0.04705 31 7.52 <.0001
Temp*subunit 22 Nav1.2 + Beta(CW) 0.2800 0.05433 31 5.15 <.0001
Temp*subunit 34 Nav1.2 0.5360 0.05952 31 9.01 <.0001
Temp*subunit 34 Nav1.2 + Beta 0.4740 0.05952 31 7.96 <.0001
Temp*subunit 34 Nav1.2 + Beta(CW) 0.6243 0.05030 31 12.41 <.0001
Because the design is NOT balanced, the marginal means (known as the LSmeans in SAS lingo) are not
equal to the corresponding raw means computed using the data. These marginal estimates are found by rst
computing the average of each combination of subunit and temperature, and then averaging these averages
across the three subunits for each temperature level.
Because the interaction term was not statistically signicant, the effect of temperature could be constant
for all subunits resulting in two parallel lines. The estimated temperature effect (or the average difference
between the means at the two temperature levels) is found as:
The LSMeans statements also estimates the difference in the marginal means.
Effect Temp subunit _Temp _subunit Est SE
Adj
P
Temp 22 34 -0.2407 0.04439 <.0001
subunit _ Nav1.2 _ Nav1.2 + Beta -0.00671 0.05534 0.9919
subunit _ Nav1.2 _ Nav1.2 + Beta(CW) -0.04498 0.05472 0.6924
subunit _ Nav1.2 + Beta _ Nav1.2 + Beta(CW) -0.03827 0.05301 0.7525
Temp*subunit 22 Nav1.2 22 Nav1.2 + Beta -0.07542 0.07188 0.8972
Temp*subunit 22 Nav1.2 22 Nav1.2 + Beta(CW) -0.00167 0.07684 1.0000
Temp*subunit 22 Nav1.2 34 Nav1.2 -0.2577 0.08059 0.0343
Temp*subunit 22 Nav1.2 34 Nav1.2 + Beta -0.1957 0.08059 0.1779
Temp*subunit 22 Nav1.2 + Beta 22 Nav1.2 + Beta(CW) 0.07375 0.07188 0.9055
Temp*subunit 22 Nav1.2 + Beta 34 Nav1.2 -0.1823 0.07587 0.1867
Temp*subunit 22 Nav1.2 + Beta 34 Nav1.2 + Beta -0.1202 0.07587 0.6141
Temp*subunit 22 Nav1.2 + Beta 34 Nav1.2 + Beta(CW) -0.2705 0.06888 0.0054
Temp*subunit 22 Nav1.2 + Beta(CW) 34 Nav1.2 -0.2560 0.08059 0.0360
Temp*subunit 22 Nav1.2 + Beta(CW) 34 Nav1.2 + Beta -0.1940 0.08059 0.1849
Effect Temp subunit _Temp _subunit Est SE
Adj
P
Temp*subunit 22 Nav1.2 + Beta(CW) 34 Nav1.2 + Beta(CW) -0.3443 0.07404 0.0008
Temp*subunit 34 Nav1.2 34 Nav1.2 + Beta 0.06200 0.08417 0.9757
Temp*subunit 34 Nav1.2 + Beta 34 Nav1.2 + Beta(CW) -0.1503 0.07793 0.4048
and the estimated difference in the mean UDI between the two temperature levels (averaged across all three
sub-unit types) is 0.24 (SE .044).
A specialized contrast comparing NaV1.2+B(CW) and NAV1.2 sub-unit types at the elevated tempera-
ture can be formed using a contrast.
The ESTIMATE statement in the previous program estimates this contrast. Note that you need to know
the order in which SAS has stored the factor levels and the parameterization used to ensure that the correct
coefcients are being used. Please see me for more details.
The Mixed Procedure
Coefcients for CW vs reg at 34
Effect subunit Temp Row1
Intercept
Temp 22
Temp 34
subunit Nav1.2
subunit Nav1.2 + Beta 1
subunit Nav1.2 + Beta(CW) -1
Temp*subunit Nav1.2 22
Temp*subunit Nav1.2 + Beta 22
Temp*subunit Nav1.2 + Beta(CW) 22
Temp*subunit Nav1.2 34
Temp*subunit Nav1.2 + Beta 34 1
Temp*subunit Nav1.2 + Beta(CW) 34 -1
The Mixed Procedure
Estimates
Label Estimate Standard Error DF t Value Pr > |t|
CW vs reg at 34 -0.1503 0.07793 31 -1.93 0.0630
This indicates that there is no evidence that the estimated mean UDI at 34
C for the Nav1.2+Beta(CW)

sub-unit type is different than the mean UDI at 34
C for the Nav1.2+Beta unit because the estimated

difference is 0.15 (SE 0.08, p = .063).
Other contrasts can be explored in a similar fashion.
The original research question was if temperature exacerbates the functional effect of an epilepsy caus-
ing sodium channel beta subunit mutation (C121W). If there were an effect of temperature, then the lines
would not be parallel. As there was no evidence of non-parallelism in the ANOVA table, there is no evidence
against the hypothesis. Similarly, there was no evidence that the mean response at 34
C was different among

the two non-control sub-units (the nal contrast).
10.9 Blocking in two-factor CRD designs
This twist on the design poses no real problems. Things to look out for:
the randomization within each block is done independently of every other block
the blocks must be complete, i.e., every treatment combination must appear in every block.
when analyzing the data, specify factors as xed or random as well as blocks as xed or random.
as before, random blocks or random factors imply that some packages may give you incorrect results.
random blocks or random factors imply that some of the packages will have difcult in estimating the
se of the marginal means and of the contrasts among the means.
The statistical model for this experiment will be
Y = Block Factor1 Factor2 Factor1 Factor2
The blocking variables can be declared as a xed or random effect. If blocks are complete, i.e. every block
has every treatment combination, then it makes no difference to the analysis if blocks are treated as xed or
random effects. If blocks are incomplete, e.g. due to missing values or deliberate action, then a model with
random blocks can recover additional information see the chapter on Incomplete Block Designs.
An example of such a design will be done in an assignment.
10.10 FAQ
10.10.1 How to determine sample size in two-factor designs
How are sample sizes determined in two-factor experiments?
In a later chapter, you will analyze in detail an experiment to investigate ways of warming people suffer-
ing from hypothermia.
Suppose that literature reviews have shown that in past experiments, the standard deviation in the time
needed to rewarm bodies was around 10 minutes. We are interested in detecting differences of about 10
minutes in the mean time needed to rewarm bodies among the three methods and between the two sexes.
What sample sizes would be needed for a CRD.
First, compute the sample size required for each of the two factors.
When examining sex effects, we nd that the standard deviation is about 10, the difference to detect is
about 10, and a total sample size of about 34 is required for an 80% power at = 0.05.
When examining method effects, we nd that again the standard deviation is about 10, the difference
between the smallest and largest mean is set to 10 while the third mean is placed in the middle, and a total
sample size of about 60 is needed for an 80% power at = 0.05.
The two results must be reconciled. If you must obtain the desired power for both factors, then you must
use the larger sample size, i.e., about 10 subjects per treatment combination.
If this is too costly, you will have to make compromises. For example, you could choose a sample size
of 50 which would give you the desired power for detecting sex effects, but not method effects.
10.10.2 What is the difference between a block and a factor?
Blocks are typically not manipulated but rather are a collection of experimental units that are similar. You
usually assume that blocking effects dont interact with factor effects. Usually, blocks are not assigned to
experimental units - the experimental units are conveniently grouped into blocks. Blocks are "passive".
A factor has levels that are manipulated and randomized over the experimental units. Factors are usually
assigned to experimental units. [Of course in analytical surveys, units are randomly selected from each level
rather than being assigned to levels.] There is usually no natural grouping of experimental units to factor
levels. Factors are "active".
In the case of the rancid fat, laboratories were assigned at random to irradiate the experimental units
the batches of fat. There is no natural grouping of batches of fat with the laboratories. Consequently,
laboratories were treated as a factor rather than a block.
In the case of the mosquito repellent, people were randomly assigned to locations. There were no natural
groups of people at each location. Location was again treated as a factor rather than a block.
In the case of seedling growth, the blocks were locations around the province. At each location, there
were several 1 ha plots. The plots were grouped naturally by location. Hence, location should be treated as
a block rather than as a factor.
In some cases, the distinction is not as clear cut. The computer packages will give the same results if a
term is factor or a block so the nal results will be the same. The only real reason to distinguish carefully
among blocks or factors is for interpreting the results. Usually, blocks are not randomized so tests for "block"
effects dont make much sense.
10.10.3 If there is evidence of an interaction, does the analysis stop there?
In the case of an interaction being detected, rst examine if a transformation would remove the interaction.
For example, if the factor operates multiplicatively (it reduces yield by 1/2) rather than additively (it reduced
yield by 50 kg/ha) a log-transform would remove interaction effects.
For example, consider an experiment with two factors, one with three levels and the second at two levels.
Table 10.83 shows true population means under the assumption of additivity.
Table 10.83: Population means under assumption of additivity between factors
Factor A
Factor B a
1
a
2
a
3
b
1
10 20 15
b
2
35 45 40
Notice that the effect of going from a
1
to a
2
is the same for both levels of B. Similarly, the effect of going
from b
1
to b
2
is the same for all levels of B. Note that the above table refers to POPULATION means - the
sample means may not enjoy this strict additivity as an artifact of the sampling process.
There are two ways in which additivity can fail. First, the units may be measured on the wrong scale.
Consider Table 10.84 of means where additivity does not hold:
Table 10.84: Population means when the assumption of additivity between factors is false but correctable
Factor A
Factor B a
1
a
2
a
3
b
1
10 20 15
b
2
100 200 150
In Table 10.84, the treatment effects of each factor are not the same across all levels of the other factor, but
notice that the mean for level a
2
is always twice the mean for level a
1
regardless of the level of B, and that
the mean for level b
1
is always one-tenth of the mean for level b
2
regardless of the level of A. This suggests
that the effects of each factor are multiplicative rather than additive, and that the analysis should proceed on
the log-scale. Indeed, consider the same values in Table 10.85 after a log-transformation:
6
Table 10.85: Population means when the assumption of additivity among factors is false but correctable. A
log-transform is applied.
Factor A
Factor B a
1
a
2
a
3
b
1
2.30 3.00 2.71
b
2
4.61 5.30 5.01
The treatment effects in Table 10.85 are now additive on the log-scale (how can you tell?).
In some cases, no transformation will correct non-additivity among factors as illustrated in Table 10.86..
Notice that in Table 10.86 the response in going from a
1
to a
2
is different depending upon the level of
B, but there is not simple pattern to the means.
If evidence of interaction is still present, then it really doesnt make much sense to test main effects. An
interaction between two factors indicates that the effect of a factor changes depending on the level of the
other factor. The test for the main effect in the presence of an interaction would examine if the average effect
exists - this may not bear any relevance to the individual effects.
At this point, you should examine the individual treatment combinations to see which treatments appear
to differ from other treatments.
10.10.4 When should you use raw means or LSmeans?
When should raw means be used and when should LSmeans be used?
6
Either the ln or log transformation can be used. The results will differ by a constant.
Table 10.86: Population means when the assumption of additivity among factors is false and not correctable.
Factor A
Factor B a
1
a
2
a
3
b
1
10 20 15
b
2
35 30 25
The difference between LSMeans and Raw Mean (ordinary means) is important whenever you "average"
across units with unequal sample sizes. For example, suppose you have 2 sh, one with 2 sub-samples and
the other with 5 sub-samples with the following made up data for the fat concentration in tissues?
Fish 1 2 4
Fish 2 7 8 9 4 11
The raw mean is computed as (2 +4 +7 +8 +9 +10 +11)/7 = 7.2. The LSmeans is computed as the
average of the averages
(2+4)
2
+
(7+8+9+10+11)
5
2
=
(3+9)
2
= 6.
Which is a better representation of the average fat concentration across sh? In this case it seems rea-
sonable that each sh should be given equal "weight" in the averaging process, so the LSMeans is likely a
better estimate.
Now suppose rather than sh 1 and sh 2, you had one reading from each of 2 male sh and one reading
from each of 5 female sh:
Male 2 4
Female 7 8 9 10 11
What is the best estimate of the average fat concentration over males and females? Here it is not so clear.
If the sex ratio in the population is 50:50, then the LSmeans are appropriate as the sample size just reects
that females were easier to catch.
But if the two sexes were equally catchable, then the different sample size for the two sexes is a reection
of an unequal sex ratio, so the population average has to reect the different sex ratio and the raw means
may be more appropriate.
But what if the sex ratio was not 50:50 and males/females were not equally catchable. Then again
different weighting would be needed.
So... in many cases the LSmeans are preferable, but this is not always true. You need to examine the
circumstances for each survey or experiment to decide which mean is most appropriate.
Chapter 11
SAS CODE NOT DONE
673
Chapter 12
Two-factor split-plot designs
Contents
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
12.2 Example - Holding your breath at different water temperatures - BPK . . . . . . . . 675
12.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
12.2.2 Standard split-plot analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
12.2.3 Adjusting for body size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
12.2.4 Fitting a regression to temperature . . . . . . . . . . . . . . . . . . . . . . . . . 687
12.2.5 Planning for future studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
12.3 Example - Systolic blood pressure before presyncope - BPK . . . . . . . . . . . . . . 698
12.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
12.3.3 Power and sample size determination . . . . . . . . . . . . . . . . . . . . . . . . 707
12.1 Introduction
As noted elsewhere in these notes, there are several different ways in which two-factor designs can physically
be performed. It is important then to understand exactly how an experiment was run in order to perform the
appropriate analysis.
In split-plot designs, there are two sizes of experimental units, traditionally called main-plots and sub-
plots. In experiments done in BPK, a common split-plot design occurs when various tests are done on
subjects of different genders. The factor gender operates on the main-plot level, while the tests operate
within subjects.
674
CHAPTER 12. TWO-FACTOR SPLIT-PLOT DESIGNS
This is also known as repeated-measures design, but as noted in
McCulloch, C.E. (2005).
Repeated Measures ANOVA, R.I.P?
Chance, 18, 29-33.
any repeated-measures design can be analyzed using random-effect mixed models and this latter analysis
provides substantially more exibility. For example, missing observations and different covariance structures
can be easily handled see me for more details.
12.2 Example - Holding your breath at different water temperatures
- BPK
12.2.1 Introduction
How does the time that a subject can hold its breath vary by the temperature of the water in which the subject
was immersed. Does it vary between males and females?
Several subjects of each sex were asked to hold their breath when their faces were immersed in water of
various temperatures. The time (seconds) the subject was able to hold their breath was recorded. The height
of the subject (m) was also recorded as a measure of size. Finally, the time to hold the breath when their face
was not immersed in water, i.e. at ambient air conditions, was also recorded.
This data provided by Matthew D. White in BPK at SFU.
The goal of the study was to see if lower water temperatures decreased breath hold times. Biologists
insist that lower water temperature prolongs breath hold time as seen in many diving mammals. But is this
true? Consider working off the coast of Newfoundland where helicopter pilots and passengers wear whole
body survival suits when ying over the cold North Atlantic Ocean. In the tragic event of a crash these suits
leave the face exposed during underwater breath-hold swims from inverted and submerged helicopters. So
the water temperature may affect the length of time that a person can hold their breath and their ability to
escape safely from a submerged helicopter.
Is there a difference in the mean time subjects can hold their breath at different water temperatures?
Is there a difference in the mean time males and females can hold their breath?
Is the difference between males and females consistent over the different water temperatures?
Is the height of the subject (a measure of size) relevant in explaining some of the differences?
Here is part the raw data in a *.csv format.
Subject,height,gender,bht0,bht5,bht10,bht15,bht20,bhtair
1,1.78,Male,25,45,50,58,67,100
2,1.78,Male,24,33,34,34,44,45
3,1.73,Male,24,36,103,142,137,117
4,1.78,Male,38,35,47,62,49,53
5,1.7,Male,58.3,32,32,48.33,36,64.3
6,1.8,Male,110,108,108,122,123,137
7,1.64,Female,24,25,29,33,33,62
8,1.45,Female,16,23,24,37,34,47
9,1.7,Female,34,62,42,65,44,77
The data are read into SAS and converted to the standard structure with one variable representing the
gender, the subject number, the height of the subject, the temperature of the water, and the time the subject
was able to hold their breath:
data breath;
infile breath.csv dlm=, dsd missover firstobs=2;
length gender $10.;
input subject height gender temp0 temp5 temp10 temp15 temp20 air;
/
*
convert to standard format
*
/
temp = 0; time=temp0; output;
temp = 5; time=temp5; output;
temp =10; time=temp10; output;
temp =25; time=air; output; /
*
temperature in air
*
/
keep gender subject height temp time;
run;
Part of the raw data after restructuring is shown below:
Obs gender subject height temp time
1 Male 1 1.78 0 25
2 Male 1 1.78 5 45
3 Male 1 1.78 10 50
4 Male 1 1.78 15 58
5 Male 1 1.78 20 67
6 Male 1 1.78 25 100
Obs gender subject height temp time
7 Male 2 1.78 0 24
8 Male 2 1.78 5 33
9 Male 2 1.78 10 34
10 Male 2 1.78 15 34
12.2.2 Standard split-plot analysis
It is always a good idea to plot the data to check for outliers and unusual points and to see if a transformation
will be required. Here a plot shows the change in time to hold a breath as a function of water temperature
(with the ambient air temperature arbitrarily assigned a value of 25
C).
We use Proc SGPLOT to create a prole plot for each subject:
proc sgplot data=breath;
title2 profile plot;
series x=temp y=time / group=subject lineattrs=(color=black) datalabel=subject;
run;
In general, the time to hold ones breath increases with increasing water temperature, and it is not evident
that males and females are all that different. But, the plot shows two subject with unusual results. Subject
6 seems to have an extraordinary ability to hold his/her breath this subjects change over time is not that
different than the other subjects, but this subject appears to be an outlier in their respective genders ability.
Subject 3 is most unusually in that the prole as temperature changes is clearly different than all of the other
subjects.
In this case, it may be sensible to remove both subjects. Clearly subject 3 is different than all other
subjects, and including this subject will lead to increase variance in the estimates making it more difcult
to detect effects. Subject 6 is an outlier only in the gender dimension the prole over different water
temperature is similar in pattern to other subjects, but inclusion of this subject will affect the ability to detect
gender effects.
As is the case with all outliers, it is suggested that you can rerun the analysis with the outliers included
and with the outliers excluded to see if the results differ in a substantive fashion.
This code snippet removes the two outlier subjects.
data breath_nooutlier;
set breath;
if subject in (3,6) then delete;
run;
The revised prole plot is:
There doesnt seem to be many odd points other than random scatter.
There are two factors in this experiment gender and temperature. Gender has two levels, male and
female. Temperature has 6 levels. Because all combinations of gender and temperature appear in the exper-
iment, the treatment structure is a factorial. This implies that all main effects and interactions can be t in
the model.
There are two sizes of experimental units. Gender operates on subjects and we assume that the subjects
used in this experiment are a random sample of people of each gender. We labelled the subjects from 1 to
12 rather than from 1 to 6 in each gender to avoid making the mistake of thinking that subject 1 of the
males is the same as subject 1 of the females. This use of individual labels for each experimental unit is
recommended as outlined in earlier chapters. Temperature operates on parts of a subjects lifetime.
There are two types of randomizations that occur in this experiment. First we assume that subjects
are a random sample from each gender. Second, we randomize the order of the temperature levels within
each subject making sure that every subject has every temperature level and every temperature level occurs
in every subject. In this study, the order of temperatures was assigned in a randomized manner that was
balanced across the two genders. Each subjects order of temperatures was drawn from a hat, with one
hat being used for males and one hat being used for females.
Each subject was familiarized to breath holding at the control air temperature of 21
C. Also to account
for any training effects, each subject had 3 successive breath hold trials at each face bath temperature. The
mean of these 3 trails for each volunteer at each water temperature is given in this data set.
1
Between each trial, at each water temperature, enough time was given to allow the subjects face temper-
ature to return to a resting value.
If the order of the temperature levels was not randomized within each subject, this can lead to a more
complex analysis. First, there is a conceptual difculty that if the same temperature order was used for all
subject, then you cannot statistically distinguish the temperature effect from the time order effect. Perhaps
the time that subjects can hold their breath simply increases over time because of practice rather than because
of increasing temperature? There is also a subtle problem in the correlational structure when measurements
are taken in the same time order. Now it is possible that the residual errors follow what is known as an AR(1)
structure where where observations that are closer in time are more highly correlated than observations that
are further apart in time. Please see me for details on how to deal with this AR(1) structure.
This design is an example of a split-plot repeated-measures design and should not be analyzed as a
completely randomized design. The appropriate model (using the short hand syntax) is
Time = Gender Subject(Gender)(R) Temperature Gender*Temperature
where the term Time represents the response variable; the term Gender represents the effect of the two gen-
ders; the term Subject(Gender)(R) represents the random effect of the main-plot experimental units to which
gender is applied; the term Temperature represents the effect of the different water (and air) temperature;
and the term Gender*Treatment represents the interaction effect of the effects of temperature and gender.
We t this model using Proc Mixed:
proc mixed data=breath_nooutlier plots=all;
title2 Analysis after outliers removed;
class gender temp subject;
model time = gender | temp / ddfm=kr;
random subject(gender);
1
These replicated readings for each trial are an example of pseudo-replication (sub-sampling) as outlined in earlier chapters. As
shown in those chapters, using the average time over the three pseudo-replicates is the appropriate way to deal with this preoblem.
lsmeans gender
*
temp / diff cl;
lsmeans gender / diff cl;
lsmeans temp / diff cl adjust=tukey;
ods output Tests3 = MixedTests;
ods output LSmeans= MixedLSmeans;
ods output diffs = MixedLSmeansDiffs;
ods output covparms=MixedCovParms;
run;
The effect tests:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
gender 1 8 2.11 0.1844
temp 5 40 14.38 <.0001
gender*temp 5 40 0.33 0.8935
show no evidence of an interaction between the effects of gender and temperature (p = 0.89). This implies
that there is no evidence that the prole of the mean response for each gender are not parallel to each other,
i.e. no evidence that the effect of gender differs at the different temperatures and vice versa.
There is strong evidence of an effect of temperature (p < 0.001). This implies that the mean time
that subjects can hold their breath differs across the different water temperature and air temperature (when
averaged over the two genders), but at this point, we dont know where the mean time differs.
There is no evidence of a difference in the mean time subjects can hold their breath between the two
genders (p = 0.18) (when averaged over all times).
Let us examine the prole plot of the mean response by gender and temperature. This is generated in SAS
by extracting the population marginal means (the lsmeans) for each combination of gender and temperature
and then plotting them:
data plotdata;
set mixedlsmeans;
if lowcase(gender)=male then temp = temp - .3; /
*
create a bit of jitering
*
/
run;
proc sgplot data=Plotdata;
title2 Profile plot of estimated mean time to hold breath;
where index(lowcase(effect),gender)>0 and index(lowcase(effect),temp)>0;
series x=temp y=estimate / group=gender lineattrs=(color=black) datalabel=gender;
highlow x=temp low=lower high=upper / group=gender;
xaxis offsetmin=.05 offsetmax=.05 label=Water Temperature (C);
yaxis label=Estimated mean time to hold breath (s);
footnote Bars are 95% confidence intervals;
footnote2 Plotting positions jittered slightly to avoid overplotting;
footnote3 Temperature = 25 = Ambient air;
run;
The 95% condence intervals for the mean response for the males and females overlap considerably at
each temperature level even though there appears to be a consistent gap between the males and females
mean response the overlap in the condence intervals explains why no gender effect was detected.
There is a gradual increase over the temperature levels, so it is likely difcult to detect differences in the
mean response from neighboring temperature levels.
The curves are roughly parallel which is an indication of no interaction between the effects of gender
and temperature.
Despite not detecting a gender effect, it is still useful to estimate the size of the gender effect.
This is obtained via the LSMEANS Gender statement in the Proc Mixed commands seen earlier.
Effect Estimate
Standard
Error Lower Upper
gender -9.4415 6.4997 -24.4300 5.5469
The estimated difference (averaged over all temperature levels) is about 9 (se 7) seconds.
We should also explore the effects of temperature more closely to see where differences are detected. A
Tukey multiple-comparison of the effects of temperature (averaged over both genders) can be found.
In SAS this is obtained via the LSMEANS Temp statement in the Proc Mixed commands seen earlier. This
generates a large table showing all of the possible pair-wise comparisons between each temperature levels
mean.
Effect temp _temp Estimate
Standard
Error
Adj
P
Adj
Low
Adj
Upp
temp 0 5 -4.2958 4.6594 0.9385 -18.2378 9.6461
temp 0 10 -5.8792 4.6594 0.8035 -19.8211 8.0628
temp 0 15 -17.7954 4.6594 0.0057 -31.7373 -3.8535
temp 0 20 -15.9208 4.6594 0.0171 -29.8628 -1.9789
temp 0 25 -34.2917 4.6594 <.0001 -48.2336 -20.3497
temp 5 10 -1.5833 4.6594 0.9994 -15.5253 12.3586
temp 5 15 -13.4996 4.6594 0.0626 -27.4415 0.4423
temp 5 20 -11.6250 4.6594 0.1499 -25.5669 2.3169
temp 5 25 -29.9958 4.6594 <.0001 -43.9378 -16.0539
temp 10 15 -11.9163 4.6594 0.1320 -25.8582 2.0257
temp 10 20 -10.0417 4.6594 0.2811 -23.9836 3.9003
temp 10 25 -28.4125 4.6594 <.0001 -42.3544 -14.4706
temp 15 20 1.8746 4.6594 0.9985 -12.0673 15.8165
temp 15 25 -16.4963 4.6594 0.0123 -30.4382 -2.5543
temp 20 25 -18.3708 4.6594 0.0040 -32.3128 -4.4289
This largish table is difcult to interpret, but you can get all the pairwise comparisons with the estimated
difference in the mean, standard error of the estimated difference in mean, and 95% condence intervals for
the difference in means.
A better display is the joined-line plots that we saw in previous sections.
It is bit cumbersome to get these in SAS but there is a special macro available at the SAS website that
helps. We start by selecting out the population marginal means and differences for only the temperature
effect and then call the macro.
%include pdmix800.sas; run;
data tempdiffs;
set mixedlsmeansdiffs;
if index(lowcase(effect),gender)=0 and index(lowcase(effect),temp)>0;
run;
data tempmeans;
set mixedlsmeans;
if index(lowcase(effect),gender)=0 and index(lowcase(effect),temp)>0;
run;
%pdmix800(tempdiffs, tempMeans, alpha=0.05, sort=yes);
This gives:
Effect=temp Method=Tukey-Kramer(P<0.05) Set=1
Obs gender temp Estimate Standard Error Alpha Lower Upper Letter Group
1 25 62.9542 4.4280 0.05 53.8160 72.0923 A
2 15 46.4579 4.4280 0.05 37.3198 55.5960 B
3 20 44.5833 4.4280 0.05 35.4452 53.7215 B
4 10 34.5417 4.4280 0.05 25.4035 43.6798 BC
5 5 32.9583 4.4280 0.05 23.8202 42.0965 BC
6 0 28.6625 4.4280 0.05 19.5244 37.8006 C
The letter grouping A indicates that the mean response in the air (denoted by temp 25
C) appears to be
different than all other treatments because no other temperature group also has the letter A. However, it is
very difcult to distinguish between the means for the actual water treatments. For example, the letter B
indicates that it is difcult to distinguish all but the mean from water temperature 0
. Notice that because of

the overlapping of the letters B and C, the interpretation of the temperature effects is difcult refer to the
main notes in single factor designs (the Cuckoo example) for more details.
So the nal conclusions are:
No evidence of a gender effect. The effects are suggestive as the prole line for males is always above
and parallel to that of females, but the noise in the data is large enough to hide any effect.
Difcult to detect effects of different water temperatures. Clear evidence that the mean time to hold
ones breath while in water is different than in ambient air.
No evidence of an interaction, i.e. the two proles for males and female mean response could be
parallel.
The diagnostic plots (residual plots):
dont show any major problems. There is some evidence of non-normality in the residuals, but is not serious.
12.2.3 Adjusting for body size
Some of the noise in the data may be due to body size males are generally larger than females and there
is a large variation in body sizes in each gender. Perhaps larger people have more lung volume and this
inuences how low they can hold their breath? Physiologists generally believe that pulmonary function is
more a function of height than weight.
We use the height variable as covariate. The appropriate model (using the short hand syntax) is
Time = Gender Height Subject(Gender)(R) Temperature! Gender*Temperature
where the additional term Height represents the adjustment for height.
proc mixed data=breath_nooutlier plots=all;
title2 Analysis after outliers removed - Height covariate included;
class gender temp subject;
model time = height gender | temp / ddfm=kr;
lsmeans gender
*
temp / diff cl;
lsmeans gender / diff cl;
lsmeans temp / diff cl adjust=tukey;
ods output Tests3 = MixedTestsHeight;
ods output LSmeans= MixedLSmeansHeight;
ods output diffs = MixedLSmeansDiffsHeight;
run;
The effect tests:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
height 1 7 0.80 0.4004
gender 1 7 0.08 0.7803
temp 5 40 14.38 <.0001
gender*temp 5 40 0.33 0.8935
again show no evidence of an interaction between gender and temperature and the gender effect is much less
signicant than earlier seen.
Be careful not to fall into the trap of looking at the (non-signicant) p-value associated with height and
thinking that height had no inuence. Each of the effect tests are marginal to other variables in the model.
Consequently, the test for a height effect is performed after adjusting for the gender (and temperature and
their interaction) effect, and the test for gender is performed after adjusting for height (and temperature and
their interaction) effect. As height and gender are partially confounded (i.e. females tend to have smaller
heights than males), it is not surprising that both terms appear to be statistically not signicant..
We can produce prole plot adjusted for height differences here the estimated mean time to hold your
breath is evaluated at the mean height seen in the data.
We see that adjusting for height, removes much of the gender difference seen before. Of course as height
is a good predictor for gender, this is not surprising that adjusting for height removes much of the gender
effect!
Additional tables to compare the different levels of temperature can be found in the same way as before.
12.2.4 Fitting a regression to temperature
If we ignore the ambient air test, there looks like a steady progression in the mean time to hold ones breath
as temperature increases. Can we t a line through these points and is the relationship the same for males
and females?
It is tempting to t a simple regression line to the data ignoring the split-plot structure of the data. This
will lead to incorrect inference as the data values within each subject are not independent of each other.
We need to extend the simple regression model to account for the correlated observations within each
subject. This is known as the andom intercept model. In this new model, we add a random subject effect
to the regression model. This random subject effect shift the observations from the same subject up or down
across all temperature and creates the correlation. For example, if a person is much better than average
in holding his/her breath, their response to different temperatures may be similar to another person, but the
rst persons responses are shifted upward or downward this is similar to the response of subject 6 which
we discarded earlier.
The appropriate model (using the short hand syntax) is
Time = Gender Subject(Gender)(R) Temperature(C) Gender*Temperature(C)
which is similar to the previous model except now that temperature is modelled as a continuous variable
rather than a categorical variable as in the previous analyses.
We continue to use the dataset with subjects 3 and 6 removed, and also exclude the ambient air conditions
as it is not comparable to the water treatments.
proc mixed data=breath_nooutlier_noair plots=all;
title2 Analysis after outliers removed - Temp as continuous variable;
class gender subject;
model time = gender | temp / ddfm=kr;
estimate female 0 intercept 1 gender 1 0 temp 0 / cl;
estimate male 0 intercept 1 gender 0 1 temp 0 / cl;
ods output Tests3 = MixedTestsReg;
ods output Estimates= Mixedestimatesreg;
run;
The Estimate statements extract estimates means at the temperature levels.
The effect tests:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
gender 1 15.8 3.37 0.0854
temp 1 38 23.31 <.0001
temp*gender 1 38 0.36 0.5504
again show no evidence of an interaction, i.e. there is no evidence that the regression lines are not parallel.
We can produce prole plot of the tted line and the mean from the previous analysis.
The line seems to be a reasonable t and there is no evident non-parallelism. The weak effect of gender is
also seen because the 95% condence intervals at each predicted mean on the lines overlap considerably.
We could also adjust for height as in the previous section, but this not done here.
12.2.5 Planning for future studies
From this analysis you can also get information on the variability of subjects within each gender and the
random noise for the time to hold ones breath for each subject which is needed to plan the study to determine
the appropriate sample size to detect effects.
In the case of balanced designs, i.e. same number of subjects in each gender, and every subject tested the
same number of times, some simple on-line programs are available to help plan studies. For example,
Lenth, R. V. (2006-12).
Java Applets for Power and Sample Size [Computer software].
Retrieved 2011-12-12 from http://www.stat.uiowa.edu/~rlenth/Power.
In the case of unbalanced designs, more complex methods are required contact me for more information.
In previous chapters, we showed that there are four important pieces of information needed for a power/sample
size determination.
Estimates of the effect sizes that are biologically important. This is often the hardest part of the
planning exercise.
Estimates of the variance components.
The level to use (typically = 0.05).
The required power (typically 80%) at = 0.05.
Estimates of Effect size You need to determine what is biologically important to detect for both factors.
For example, how big of a difference in the mean time of holding their breath between males and females is
biologically important? How big of a difference in the mean time between the different water temperatures
is important to detect? This is often the hardest part of the study as the question of what is biologically
important is often difcult to answer.
Suppose that after a long deliberation, you decide that it is important to detect a difference of about 10
seconds in the mean time to hold ones breath between the two genders, and that a difference of 20 seconds
between the time to hold ones breath at 20
C and at 0
C is also important.
Start by constructing a table of the approximate MEAN times to hold the breath that you would observe
in this experiment for each combination of the gender and temperature. For example, given the biologically
important differences noted above one such table is:
Gender
Temp Male Female
0 20 10
5 25 15
10 30 20
15 35 25
20 40 30
Notice that the difference in each row between the mean for males and females is 10 seconds and that
the difference in each column between the mean at 0
C and 20
C is 20 seconds. Because the differences

in each row and each column are consistent, there is NO interaction effect between the two factors. If the
differences were not consistent (i.e. interaction between the effects of gender and temperature existed), the
computations for the size of the effect of gender and temperature (below) would not change, but now the
effect size of the interaction would also need to be computed. Please see me for details.
Next, compute the row and column averages and nd the STANDARD DEVIATION of the row and
column averages. In the above table, the column averages are 30 = (20 + 25 + 30 + 35 + 40)/5 and
20 = (10 + 15 + 20 + 25 + 30)/5. The STANDARD DEVIATION of the column averages is 7.1. This is
the gender effect.
Similarly, the row averages are 15, 20, 25, 30, 35 and the STANDARD DEVIATION of the row averages
is 7.9. This is the temperature effect.
These two standard deviations are the EFFECT SIZES needed in a power analysis. Note that different
computer packages will compute an EFFECT SIZE in different ways, but they will all give you the same
estimates of power/sample size. In most cases, start with the table of means as shown above.
Estimates of the variance components. Estimates of the variance components can be found by looking
at past studies (for example, the dataset used above) or educated guesses. From the rst analysis (after
removing the effects of the outliers), the subject variance component estimate is around 81 seconds and the
residual variance is about 100 seconds as shown in the table below:
Cov
Parm Estimate
subject(gender) 84.0240
Residual 104.21
Choice of level This is usually set at = 0.05 or = 0.10.
Level of power desired There are no hard and fast rules, but usual rules of thumb are to aim for an 80%
power at = 0.05 and a 90% power at = 0.10.
Using Lenths routines We are set to try the power program from Lenth. Visit http://www.stat.
uiowa.edu/~rlenth/Power and select the Balanced ANOVA option.
This allows you select from some pre-determined experimental designs (including a split-plot with main
plots in blocks) or your own model. We will use the model for the rst experiment. Enter the model (using
the simplied syntax) but with + signs between the terms. The terms can be in any order.
Start by specifying a starting point for the size of the design. I started with the number of levels of gender
is 2; the number of levels of temperature is 5; and the number of subjects IN EACH GENDER is 12. These
values can be changed later. Click on the F tests button.
This brings up the power computation box with default values for the effect sizes and variance compo-
nents that we will need to modify:
Start by specifying the effect sizes for the (xed) effects of gender, temperature, and gender*temperature.
Change these values to 7.1 (for the gender effect), 7.9 (for the temperature effect), and 0 (for the gender*temp
effect). You can either use the slider or click on the small box and enter the values directly:
Next, specify the STANDARD DEVIATION for the variance components for subject(gender) and the
residual. These
81 = 9 and
100 = 10 respectively:
The power to detect each effect is listed on the right side. With 12 subjects/gender (for a total of 24
subjects), the power to detect a difference in the means of 10 seconds between the two genders is about
65%; the power to detect a difference in the means of 20 seconds among the temperature levels is over
99%; and the power to detect an interaction is 5%. [Of course, we set up the planning assuming that there
was no interaction, so this 5% is just the false positive rate.] You would not examine the power for the
subject(gender) term.
You can move the slider on the number of subjects/gender until the power to detect the 10 second differ-
ence in the means between the two genders is at least 80%. This occurs at 17 subjects/gender.
The power analysis made a number of guesses as to the likely size of the effects that are biologically
important and the size of the variance components. You should try various settings to see how sensitive the
results are to your choices.
In this case, the limiting feature of the design was the ability to detect a gender effect. It is often the case
that the factor at the upper level of a split-plot design has the lowest power. Intuitively, you only have 17 2
subjects that provide information on the gender differences. The multiple readings within each subject dont
provide any information on the gender differences. However, you have 17 2 5 = 170 measurements for
detecting temperature effects and so have a greater power to detect these effects.
A summary of the power computations (computed by SAS) is
Effect
gender gender*temp temp
Power Power Power
subjects_per_gender 0.221 0.050 0.919
4
6 0.345 0.050 0.993
8 0.458 0.050 1.000
10 0.558 0.050 1.000
12 0.644 0.050 1.000
14 0.717 0.050 1.000
16 0.777 0.050 1.000
18 0.826 0.050 1.000
20 0.866 0.050 1.000
22 0.897 0.050 1.000
24 0.921 0.050 1.000
26 0.940 0.050 1.000
28 0.955 0.050 1.000
30 0.966 0.050 1.000
and a plot of the power as a function of the sample size/gender is:
Lenths routines can only be used for balanced simple design. The SAS program for this example
shows how Stroups method can be used for split-plot designs (indeed for any design).
Are there alternate designs that might be of interest? In this case, a large number of subject is needed
to detect the gender effect but the required number of subjects is overkill to detect the temperature effect.
Perhaps the cost of the experiment can be reduced by not requiring all subjects to do all temperature tests.
For example, some subjects could do the 1st, 3rd, and 5th temperature level; other subjects do the 2nd, 4th,
and 5th temperature level, etc. This would be known as an split-plot with an incomplete block design at the
subplot level. Planning such a study and its analysis is beyond the scope of this course see me for details.
12.3 Example - Systolic blood pressure before presyncope - BPK
The data for this experiment was provided by Claire Protheroe, a M.Sc. candidate in BPK at SFU.
Fifteen subjects took place in an experiment to measure their orthostatic tolerance, the time the individual
can stand still and regulate their blood pressure until presyncope (the symptoms experienced before a faint).
During presyncope patients experience light-headedness, muscular weakness, and feeling faint (as opposed
to a syncope, which is actually fainting). In many patients, lightheadedness is a symptom of orthostatic
hypotension which occurs when blood pressure drops signicantly such as when the patient stands from a
supine or sitting position
Each subject was measured on a tilt tests three times - one with a compression stocking, one with a
placebo stocking, and another with a different placebo stocking. The subjects were randomized to stocking
conditions on three different days.
For each test, the subject was subject to a 20 minute supine (lying on the back) period, followed by
a 20 minute tilt period, followed by a 10 minute period of 2 mmHg of lower body negative pressure
(LBNP), a 10 minute period of 40 mmHg LBNP, followed by a 10 minute period of 60 mmHg LBNP.
However, not all patients made it through the entire test before reaching the pre-syncope stage. As noted at
http://advan.physiology.org/content/31/1/76.full
During LBNP, participants lie in a supine position with their legs sealed in aLBNP chamber at
the level of the iliac crest. Air pressure inside the chamber is reduced by a vacuum pump, mak-
ing the pressure inside the chamber less than atmospheric pressure. This causes blood to shift
from an area of relatively high pressure (i.e., the upper body, which is outside the chamber) to-
ward an area of relatively low pressure (i.e., the legs inside the chamber). Without physiological
compensations, blood is shunted away from the thoracic cavity and ultimately pools in the lower
limbs and the lower abdomen. Normally, the body compensates by peripheral vasoconstriction
and an increase in heart rate, which serve to maintain normal circulation. Inadequate physiolog-
ical compensations in response to increasing negative pressure results in falling arterial blood
pressure and, ultimately, syncope.
Systolic blood pressure was measured every 2 minutes for 20 minutes during the supine phase; again
measured every 2 minutes during the tilt phase; and nally every 2 minutes during the LBNP phases until
the patient ended the trial.
The rawdata available on the Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/
Here is a (very small) snippet of the raw data for two patients.
Subject 23395 29902 23395 29902 23395 29902
Supine 2 136 87 114 106 137 111
4 129 81 114 94 128 97
6 125 84 114 103 118 98
8 132 81 115 104 124 99
Continued. . .
10 125 76 119 101 118 98
12 131 89 121 107 120 97
14 128 86 118 106 110 113
16 124 103 120 109 123 119
18 134 107 122 104 125 114
20 132 94 124 112 134 109
Tilt 22 129 105 113 106 117 103
24 130 111 113 116 125 100
26 130 89 112 103 115 93
28 145 89 126 99 131 98
30 135 89 142 103 127 103
32 128 102 133 100 117 107
34 124 101 143 89 122 103
36 119 93 129 88 124 102
38 79 92 129 114 113 107
40 102 89 131 106 128 108
-20mmHg 42 98 123 100 120 107
44 128 102 127 97 114 100
46 149 88 123 91 116 100
48 152 90 133 93 111 105
50 158 96 127 86 114 93
-40mmHg 52 153 104 128 95 92
54 149 97 126 89 94
56 147 101
58 143
60 145
-60mmHg 62 147
64
66
68
70
Presyncope Time 74.3 52.3 68.6 76.2 54.8 70.0
Patient 23395 started off with a systolic blood pressure (SBP) of 136 mmHg at the end of the second
minute while supine wearing the Placebo 1 stocking, and the blood pressure varied over the next 18 minutes,
with the nal blood pressure (at minute 20) of 132 mmHg. Then the patient was tilted. At minute 22 (2
minutes into the tilt) the SBP was 129 and at minute 40 the SBP was 102. The LBNP was applied. For
some reason, the blood pressure was missing for this patient at minute 42, but the blood pressure increased
and ended at 158 mmHg at minute 50. The LBNP was increased and at minute 60 the SBP was 145 mmHg.
The LBNP was again increased. At minute 62 the SBP was 147. At this point blood pressure readings were
terminated. At minute 74.3, the patient experienced presyncope.
This patient underwent similar testing under the Experimental and Placebo 2 conditions.
Patient 29902 underwent a similar protocol, but the SBP terminated at minute 54 under the Placebo 1
condition. This patient experience pre syncope at 52.2 minutes.
Whew the data is quite large with over 1000 values.
Is there a difference in the mean time to presyncope between the different treatments?
Is there a difference in the mean blood pressure between the different phases.
The analysis of the time to presyncope was done in a previous chapter now we will look at the changes
in systolic blood pressure under the different phases and treatments.
12.3.2 Analysis
In this part of the experiment, changes in the mean systolic blood pressure in response to the different
stockings and the the different phases (supine, tilt, 20 mmHg, 40 mmHg and 60 mmHg) are of interest.
This is now a two-factor experiment with the two factors being treatment (the different stockings, 3
levels) and phase (5 levels). However, the experiment is not a simple completely randomize design it is a
variant of a split-plot design.
In split-plot designs, there are two different sizes of experimental units. Here the concept of an exper-
imental unit is not clear it is easiest to think of the experimental units as days or minutes. The treatment
(stockings) are applied at the day level within each subjects visit. Subjects serve as blocks for this factor.
Then within a particular day, the different phases are applied on a minute-by-minute basis. Notice that the
phases are always applied in the same order. This lack of randomization can lead to some subtle problems
in the analysis which we shall ignore for now.
2
Also notice that the blood pressure within each phase is measured multiple times (at two minute inter-
vals). These are known as pseudo-replicates (or sub-samples) and cannot be treated as being independent.
2
The problem is that the residuals from the phase effects may be correlated over time and a more specialized covariance structure,
the AR(1) covariance model, may be appropriate.
As shown in previous chapters, there are two ways to deal with the pseudo-replication analyze the averages
over the pseudo-replicates or deal with the individual observations using a more complex model. Not every
subject had the same number of sub-samples taken and so the analysis on the averages will not be exactly
the same as the analysis on the individual measurements but should be close enough.
Not all subjects nished all phases. We will be making the implicit assumption that this missing data is
MCAR (Missing Completely at Random), i.e. the probability of missingness is unrelated to the response or
any other covariate. Under MCAR, the missingness doesnt cause any great problem in this analysis all
that happens is that the standard error of some comparisons is increased because the effective sample size
for comparisons over missing data has been reduced. We would be very worried if the missingness is not
MCAR for example, suppose that people with low blood pressure to begin with are less likely to complete
all phases. In this case, only those people who tend to have higher blood pressure will make it to the nal
phases and so the estimates involving these later phases will be biased by having a non-random sample of
subjects participating in the comparison. There is no statistical way to check for MCAR and this must be
assessed based on an intimate biological knowledge of the experiment.
The raw data are available in the SystolicBloodPressure.csv le in the Sample Program Library available
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The data are imported into SAS in the usual way
proc import out=bp file=SystolicBloodPressure.csv dbms=csv replace;
run;
and part of the raw data is shown below:
Obs treatment time phase subject sys_bp
1 PLACEBO CONDITION 1 2 supine 23395 135.7884857
A tabulation of the number of measurements in each combination of stocking and phase (not shown)
reveals that there was only 1 measurement taken at the 60 mmHg phase over all all subjects and treatments
and so this phase will be ignored in subsequent analyses. Virtually all subjects were measured 10 times in
the supine and tilt positions, but the number of measurements at the 20 and 40 mmHg varies by patients
and stocking.
We begin by taking averages over the pseudo-replicates:
We use Proc Means in SAS to average over the pseudo-replicates.
data bp2;
set bp;
if index(phase,-60)>0 then delete; /
*
discard this phase
*
/
run;
proc sort data=bp2;
by subject treatment phase;
run;
proc means data=bp2 noprint;
by subject treatment phase;
var sys_bp;
output out=mean_bp mean=mean_sys_bp;
run;
A dot and prole plot of the mean responses over the 3 4 combinations of treatment (stockings) and
phase does not show any obvious outliers and the assumption of parallelism within subjects appears to be
satised.
proc sgplot data=mean_bp;
title2 Profile and dot plot of mean responses;
scatter x=trt y=mean_sys_bp / markerattr=(symbol=circlefilled);
series x=trt y=mean_sys_bp / group=subject lineattr=(color=black pattern=1);
run;
which gives:
The statistical model for this experiment is a split-plot design with main plots (the days when a stocking
is worn) in blocks:
Y = Subject(R) Treatment Subject*Treatment(R) Phase Phase*Treatment
where the Subject(R) term represents the blocking by subjects; the Treatment term represents the effect of the
different stockings; the Subject*Treatment(R) term represents the experimental unit for the treatment factor
levels (the stockings) and NOT an interaction between subject and treatment
3
the Phase term represents the
effect of the different phases within a day; and the Phase*Treatment term represents the interaction effects
between the two factors.
This model can then be t in the usual way.
In SAS we use Proc Mixed:
3
A key assumption of blocking is no interaction between the block and treatment effects. The Subject*Treatment notation is only a
way to reference the date when a stocking is worn by a subject.
ods graphics on;
proc mixed data=mean_bp plots=all;
title2 Split-plot with main plots in blocks;
class subject treatment phase;
model mean_sys_bp = treatment | phase / ddfm=kr;
random subject subject
*
treatment;
lsmeans phase / diff cl adjust=tukey;
ods output lsmeans=lsmeans;
ods output tests3 =tests;
ods output diffs =lsmeansdiffs;
run;
ods graphics off;
which gives the following test statistics:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
treatment 2 30.7 0.87 0.4288
phase 3 90.6 15.63 <.0001
treatment*phase 6 91.5 1.72 0.1249
There is no evidence of a treatment (stocking) effect (p = 0.43), nor of a treatment-phase interaction
(p = 0.13). There is strong evidence of a difference in the means among the different phases (p < 0.0001).
To investigate where the differences in the mean systolic blood pressure exist among the different phases,
we use a Tukey multiple-comparison procedure:
Effect phase _phase Estimate
Standard
Error
Adj
P
Adj
Low
Adj
Upp
phase -20mm -40mm 5.5693 2.2408 0.0690 -0.2956 11.4342
phase -20mm supine -8.1784 1.7256 <.0001 -12.6951 -3.6618
phase -20mm tilt -4.4170 1.7256 0.0577 -8.9336 0.09970
phase -40mm supine -13.7477 2.1920 <.0001 -19.4849 -8.0106
phase -40mm tilt -9.9863 2.1920 <.0001 -15.7235 -4.2491
phase supine tilt 3.7614 1.5486 0.0788 -0.2918 7.8147
It might be easier to interpret the results from the Joined-Line plots that weve seen earlier:
We use the pdmix00 macro with the output from the Proc Mixed procedure.
%include pdmix800.sas; run;
data tempdiffs;
set lsmeansdiffs;
if index(lowcase(effect),phase)>0 and index(lowcase(effect),treatment)=0;
run;
data tempmeans;
set lsmeans;
if index(lowcase(effect),phase)>0 and index(lowcase(effect),treatment)=0;
run;
%pdmix800(tempdiffs, tempMeans, alpha=0.05, sort=yes);
which gives:
Effect=phase Method=Tukey-Kramer(P<0.05) Set=1
Obs phase Estimate Standard Error Alpha Lower Upper Letter Group
1 supine 116.30 2.7745 0.05 110.45 122.15 A
2 tilt 112.54 2.7745 0.05 106.69 118.39 AB
3 -20mm 108.12 2.8770 0.05 102.12 114.13 BC
4 -40mm 102.55 3.1787 0.05 96.0472 109.06 C
The estimated marginal means (the average at each phase when averaged over all subjects and treatments
(stockings)) move in the correct direction (i.e. the mean systolic blood pressure decreases as the experiment
progresses), but the change is too small to be detectable except at the extreme ends of the experiment.
Diagnostic plots dont show any obvious problems.
The analysis on the individual measurements within each phase using a more complex model come to
the same conclusions.
12.3.3 Power and sample size determination
The determination of an appropriate sample size is more complex because of the need to account for the
pseudo-replication. For example, the design choices are both in the number of subjects for each treatment
(the stocking) but also in the number of minutes each phase should be measured.
If you are willing to stick with about 10 repeated measurements in each phase, then the information from
the analysis on the averages can be used to construct a power analysis to determine the number of subjects
needed in the same fashion as the How long can you hold your breath example. Consult those notes for
details.
In order to take into account both aspects of planning, you will need information on the three variance
components:
Subject*treatment variance term which is the experiment unit (the day) error variation.
Minute-to-Minute vacation termwhich represents howthe individual minutes vary among the repeated
measurements within a phase.
Both are available from the more complex model (i.e. using all of the data).
The actual power analysis is beyond the scope of this course please contact me for details.
This is the end of the chapter
Chapter 13
Analysis of BACI experiments
Contents
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
13.2 Before-After Experiments - prelude to BACI designs . . . . . . . . . . . . . . . . . . 714
13.2.1 Analysis of stream 1 - yearly averages . . . . . . . . . . . . . . . . . . . . . . . 717
13.2.2 Analysis of Stream 1 - individual values . . . . . . . . . . . . . . . . . . . . . . 719
13.2.3 Analysis of all streams - yearly averages . . . . . . . . . . . . . . . . . . . . . . 721
13.2.4 Analysis of all streams - individual values . . . . . . . . . . . . . . . . . . . . . 724
13.3 Simple BACI - One year before/after; one site impact; one site control . . . . . . . . . 726
13.4 Example: Change in density in crabs near a power plant - one year before/after; one
site impact; one site control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
13.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
13.5 Simple BACI design - limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
13.6 BACI with Multiple sites; One year before/after . . . . . . . . . . . . . . . . . . . . . 737
13.7 Example: Density of crabs - BACI with Multiple sites; One year before/after . . . . . 739
13.7.1 Converting to an analysis of differences . . . . . . . . . . . . . . . . . . . . . . . 741
13.7.2 Using ANOVA on the averages . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
13.7.3 Using ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
13.7.4 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
13.8 BACI with Multiple sites; Multiple years before/after . . . . . . . . . . . . . . . . . . 752
13.9 Example: Counting sh - Multiple years before/after; One site impact; one site control 754
13.9.1 Analysis of the differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
13.9.2 ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
13.10Example: Counting chironomids - Paired BACI - Multiple-years B/A; One Site I/C . 764
13.10.1 Analysis of the differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
13.10.2 ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
709
CHAPTER 13. ANALYSIS OF BACI EXPERIMENTS
13.11Example: Fry monitoring - BACI with Multiple sites; Multiple years before/after . . 771
13.11.1 A brief digression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
13.11.2 Some preliminary plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775
13.11.3 Analysis of the averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
13.11.4 Analysis of the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
13.11.5 Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786
13.12Closing remarks about the analysis of BACI designs . . . . . . . . . . . . . . . . . . . 787
13.13BACI designs power analysis and sample size determination . . . . . . . . . . . . . . 788
13.13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
13.13.2 Power: Before-After design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
13.13.3 Power: Simple BACI design - one site control/impact; one year before/after; inde-
pendent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
13.13.4 Power: Multiple sites in control/impact; one year before/after; independent samples 803
13.13.5 Power: One sites in control/impact; multiple years before/after; no subsampling . . 808
13.13.6 Power: General BACI: Multiple sites in control/impact; multiple years before/after;
subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
13.1 Introduction
Environmental-impact assessment is often required as part of large projects. A very common design is
the Before-After-Control-Impact (BACI) design. In this design, measurements are taken at the treatment
(impacted) site and at a control site both before and after the impact occurs.
This design is preferred over a simple Before-After comparison as a change in the response may occur
independently of any impact because of temporal effects. For example, precipitation levels may change
between the before and after periods and the response may be related to precipitation rather than the impact.
Or measurement devices may improve over time and the observed difference is simply an artefact of the
measurement process.
By establishing a control site (where presumably no effect of the impact will be felt), the temporal change
that occurs in the absence of the impact can be measured Then the differential change in the difference over
time is evidence of an environmental impact.
There are many variants of the BACI design this chapter will review some of the most common. This
chapter looks at detecting changes in the mean, but similar designs can be used to detect changes in slopes or
proportions or any other response measure. The analysis of these more complex experiments can be difcult
please contact me for details.
A key assumption of BACI designs is that the system is inequilibrium before and after the impact.
Schematically, we assume that the overall mean before and after the impact is stable and that the change is
quick and immediate in the mean. Schematically, we are assuming a system like:
where there is a step change in the mean immediately after the impact.
Consequently, BACI designs are good for:
Detecting large changes after impact.
Detecting permanent changes after impact.
Monitoring to protect against disasters.
Monitoring for changes in the MEAN.
BACI designs are poor for:
Small potential changes after impact.
Gradual changes after impact (i.e. not a step change).
Long term monitoring.
Monitoring for changes in VARIABILITY.
It is important to recognize in advance that there will be some change over time and because of the
impact. It is simply not scientically defensible to believe that the mean response will be identical to 20
decimal places over time or in the absence of an impact. Rather, the key question is more than simply
hypothesis testing for evidence of any change regardless of the nal p-value from the statistical hypothesis
testing, a estimate of the effect size, i.e. an estimate of what is known as the BACI contrast, is required along
with a measure of precision. Then this can be examined to see if the effect is detected and if it is biologically
important. A conceptual diagram of the possible outcomes from the analysis of an experiment is:
In the left three experiments, the effects are all statistically signicant (i.e. the p-value will be less than
the level, usually 0.05). However, the rst outcome implies that the effect is also biologically important.
In the second outcome, the biological important is in doubt. In this case, more data needs to be collected
or precautionary management actions should be implemented. In the third outcome, despite the effect being
statistically signicant, the actual effect is small and not of biological importance.
Conversely, in the right two experiment, the effect is not statistically signicant. However, the fourth
case really is no different than the third case. In the fourth case, the effect was not statistically signicant,
and the 95% condence intervals indicate that even if the effect was different from zero, it is small. In both
the third and fourth cases, no management action is needed. In the last experiment, the estimated effect size
has such poor precision (i.e. a large standard error and wide condence limits) that nothing can be said about
the outcome. This failure of this experiment can be due to inadequate sample size, excessive variation in the
data that wasnt expected, or other reasons, but really provides no information on how to proceed.
The most difcult part of the above conceptual diagram is the determination of the biologically important
effect size! This is NOT a trivial exercise. For example, if you are looking at the impact of warm water from
a power plant upon local crab populations, how much of a change in density is biologically important?
An important part of the study design is the question of power and sample size determination, i.e. how
many samples or how many years of monitoring are required to detect the biologically important difference
as noted above? Far too many studies have inadequate sample sizes and end up with results as in the far right
entry in the previous diagram. The proper time for a determination of power and sample size is at the start of
the study the practise of retrospective power analysis provides no new information (see the chapter on the
introduction to ANOVA for details). A key (and most difcult) part of the power/sample size determination
is the question of biologically important effect sizes.
Now a days, computer packages are used to analyze these complex designs. As noted in other chapters,
simple spreadsheet programs (such as Excel) are NOT recommended. With the use of sophisticated packages
such as JMP or SAS, care needs to be taken to ensure that the analysis requested matches the actual design of
the experiment. Always draw a diagram of the experimental protocol. Make sure that your brain is engaged
before putting the package in gear!
A review article on the design and analysis of BACI designs is available at:
Smith, E. P. (2002).
BACI design.
Encyclopedia of Environmetrics, 1, 141-148.
http://www.web-e.stat.vt.edu/vining/smith/B001-_o.pdf
The standard reference for the analysis of complex BACI designs is:
Underwood, A. J. (1993).
The mechanics of spatially replicated sampling programs to detect environmental impacts in a
variable world.
Australian Journal of Ecology 18, 99-116.
http://http://dx.doi.org/10.1111/j.1442-9993.1993.tb00437.x
In this chapter, we will look at the design and analysis of four standard BACI designs:
1. Single impact site; single control site; one year before; one year after.
2. Single/multiple impact site; multiple control sites; one year before; one year after.
3. Single impact site; single control site; multiple years before; multiple years after.
4. Single impact site; multiple control sites; multiple years before; multiple years after.
13.2 Before-After Experiments - prelude to BACI designs
As a prelude to the analysis of BACI designs, let us consider a simple before-after experiment, where mea-
surements are taken over time both before and then after an environmental impact.
For example, consider an experiment where two streams were measured at a small hydro electric power
project. At each stream multiple quadrats were taken on the steam with a new set of quadrates chosen each
year, and the number of invertebrates was counted in each quadrat. After 5 years, the hydro electric plant
started up, and an additional 5 years of data were collected. Is there evidence of a change in mean abundance
after the plant starts?
Note that this is NOT a BACI design, and is a VERY poor substitute for such. The key problem is
that changes may have occurred over time unrelated to the impact, so a change may simply be natural
(e.g. unrelated to the hydro project). A nice discussion of the perils of using a simple before-after design in
non-ecological contexts is found at http://ssmon.chb.kth.se/safebk/Chp_3.pdf.
Also, these design typically involve a long time-series. As such, problems related to autocorrelation can
creep in. Refer to the chapter on the analysis of trend data for more details. iIn this section, we will ignore
autocorrelation which can be a serious problem in long-term studies.
This data will be analyzed several ways. First, we will look at only stream 1, and analyze the yearly
averages or the individual data. In these cases, inference is limited to that particular stream and you cannot
generalize to other streams in the study. If the same number of quadrates was measured in all years, the two
analyses will give identical results.
Second, we will analyze both streams together and analyze the yearly averages or the individual values.
Now, because of multiple streams, the results can be generalized to all streams in the project (assuming
that the two streams were selected at random from all potentially affected streams. Again, if the number of
quadrats was the same for all streams in all years, then the analysis of the yearly averages and the individual
values will be identical.
Finally, we will demonstrate how to compute a power analysis of these designs to see how many years
of monitoring would be required. The limiting feature for these designs is the year-to-year variation and no
amount of subsampling (i.e. sampling more quadrats each year) will reduce this variation.
The rawdata is available in the invert.csv le available at the Sample Program Library at http://www.
the usual way:
data invert;
infile invert.csv dlm=, dsd missover firstobs=2;
length period $10.;
input year period stream quadrat count;
run;
Obs period year stream quadrat count
1 Before 1 1 1 58
2 Before 1 1 2 62
3 Before 1 1 3 48
4 Before 1 1 4 72
5 Before 1 1 5 41
6 Before 2 1 1 61
7 Before 2 1 2 48
8 Before 2 1 3 60
9 Before 2 1 4 55
10 Before 2 1 5 57
We begin by computing some basic summary statistics in the usual fashion
stream
1 2
count count
N Mean Std N Mean Std
year period 5 56.2 12.1 4 71.0 4.4
1 Before
2 Before 5 56.2 5.2 5 60.0 6.2
3 Before 4 49.8 7.9 5 56.6 7.7
4 Before 3 50.0 8.9 5 59.2 12.1
5 Before 4 41.3 3.8 4 52.3 12.9
6 After 5 68.8 9.8 5 70.6 8.7
7 After 5 52.0 7.3 5 61.2 5.7
8 After 5 58.8 6.8 3 73.0 14.0
stream
1 2
count count
N Mean Std N Mean Std
9 After 5 67.8 4.9 5 80.6 8.2
10 After 4 56.3 6.4 4 64.3 7.2
The standard deviations appears to be approximate equal over time, and a plot of the ln(s) vs ln(Y ):
doesnt show any evidence of relationship between the mean and the variance which is often found when
dealing with count data, especially if the counts were smallish.
A plot of the time trend:
appears to show a shift in the mean starting in year 6, but there is large year-to-year variation in both streams.
However, the trend lines in both streams appear to be parallel (indicating little evidence of a stream-year
interaction).
13.2.1 Analysis of stream 1 - yearly averages
We start with the analysis of a single stream as it is quite common for Before-After studies to have only one
experimental unit. Of course, inference will be limited to that specic stream chosen, and the results from
this stream cannot be extrapolated to other streams in the study.
The multiple quadrats measured every year are pseudo-replicates and cannot be treated as independent
observations. As in most cases of pseudo-replication, one approach is to take the average over the pseudo-
replicates and analyze the averages. If the number of quadrats taken each year was the same, the analysis
of the averages doesnt loose any information and is perfectly valid. If the number of quadrats varies from
year-to-year, the analysis of the averages is only approximate, but usually is sufcient.
The averages are computed in the usual way and the actual code to do this is not shown. After the aver-
ages are taken, the data has been reduced to 10 values (one for each year in the study) with ve measurements
taken before the impact and ve measurements taken after the impact. Consequently, the simple two-sample
t-test can be used to analyze the data. As noted earlier, autocorrelation in the measurements over time has
been ignored.
We use Proc Ttest to t the model:
ods graphics on;
proc ttest data=mean_stream1 plots=all;
title2 Analysis of stream 1 using the mean over the quadrats;
var mean_count;
class period;
ods output ttests = TtestTest100;
ods output ConfLimits=TtestCL100;
ods output Statistics=TtestStat100;
footnote Inference limited to that single stream and cannot be generalized to other streams;
run;
ods graphics off;
The sample means and other summary statistics
Variable period N Mean
Std
Dev
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
mean_count After 5 60.7300 7.3335 3.2796 51.6243 69.8357
mean_count Before 5 50.6800 6.1480 2.7495 43.0462 58.3138
mean_count Diff (1-2) _ 10.0500 6.7667 4.2797 0.1811 19.9189
show that the variances before and after are about equal and there appears to be small shift in the mean. As
seen in earlier chapters, there are two test-statistics that can be computed one assuming equal variance in
the before/after periods and one allowing for differential variances in the two periods:
t
Value DF
Pr
>
|t|
mean_count Pooled Equal 2.35 8 0.0468
mean_count Satterthwaite Unequal 2.35 7.7636 0.0477
In this case, the two approaches (the equal variance and unequal variance t-test) yield virtuality the same
results weak evidence of an effect with a two-sided p-value just under 0.05.
The effect size can be estimated:
Variable period Method Variances Mean
Lower
Limit
of
Mean
Upper
Limit
of
Mean
mean_count Diff (1-2) Pooled Equal 10.0500 0.1811 19.9189
mean_count Diff (1-2) Satterthwaite Unequal 10.0500 0.1285 19.9715
As expected, the condence interval for the effect size just barely excludes the value of zero. A power
analysis could be conducted to see how many years of monitoring would be required to detect a reasonably
sized effect.
13.2.2 Analysis of Stream 1 - individual values
The number of quadrats does vary a bit over time, and so the analysis on the averages for each year is only
approximate. A more rened analysis that also provides estimates of the year-to-year and the quadrat-to-
quadrat variation is possible.
Again note that quadrats are pseudo-replicates taken in each year. This analysis is functionally equivalent
to the analysis of means with sub-sampling discussed in earlier chapters.
A mixed-model analysis is performed. The statistical model is
Y = Period Year(Period)(R)
If each year is labelled with a separate value (e.g. using the actual calendar year as the label) rather than
year 1 within the before period and year 1 within the after period, the nesting of year within period can be
dropped and the model can be written as
Y = Period Year(R)
Note that these two models are equivalent we are just letting the software deal with the nesting of years
within each period. The Period refers to the before vs. after periods and represents the effect of interest.
This can be t using Proc Mixed in SAS in the usual fashion:
/
*
analysis of the individual quadrats for Stream 1
*
/
ods graphics on;
proc mixed data=stream1 plots=all;
title2 Analysis of stream 1 using the individual quadrats;
class period year;
model count = period / ddfm=kr;
random year(period);
lsmeans period / diff cl;
footnote Inference limited to that single stream and cannot be generalized to other streams;
ods output tests3 =Mixed200Test; /
*
*
/
ods output lsmeans =Mixed200Lsmeans;
ods output covparms =Mixed200CovParms;
run;
ods graphics off;
Note the use of the random statement to specify that yearly effects are random effects and should be
separated from the residual (or quadrat) random noise.
The test for a period effect (i.e. is there evidence of a difference in the mean response between the before
and after period) is found as:
period 1 8.21 5.24 0.0505
Inference limited to that single stream and cannot be generalized to other
streams
The p-value indicates weak evidence of an effect, similar to that found with the analysis of the averages.
Dont live and die by the 0.05 rule surely the two analyses are saying similar things!
The estimated means in the two periods can be found.
Effect period Estimate
Standard
Error DF
t
Value
Pr
>
period After 60.7852 3.0397 7.89 20.00 <.0001 0.05 53.7580 67.8125
period Before 50.8346 3.1081 8.55 16.36 <.0001 0.05 43.7461 57.9230
The estimated difference in the means between the two periods can also be estimated, and usually is of most
interest.
Effect period _period Estimate
Standard
Error DF
t
Value
Pr
>
period After Before 9.9507 4.3475 8.21 2.29 0.0505 0.05 -0.02927 19.9306
Both results are similar to that from the analysis of the averages.
There are two sources of variation in this experiment. First the year-to-year variation represents the
effects of year-specic factors such as total rainfall, or other random events. Second, the residual variation
(or quadrat-to-quadrat) variation measures the variation over quadrats in different parts of the stream. The
two estimated variance components are:
Cov
Parm Estimate
year(period) 33.7853
Residual 59.1734
As you will see later when the power analysis is discussed, you can reduce the impact of the quadrat-to-
quadrat variation on the estimates by sampling more quadrats each year. However, sampling more quadrats
has no impact on the year-to-year random variation and this is often the limiting features of simple be-
fore/after designs. As a head up, the year-to-year variation can be controlled by conducting a paired-baci
experiment where both impact and control sites are measured on each year of the study.
It should be noted that in the case of a single stream, that year-to-year variation really is a combina-
tion of the actual year-specic effects and a year-stream interaction where the time-trend of this particular
stream may not be completely parallel with other streams in the study. Of course, with only a single-stream
measured, these two sources could be disentangled.
Diagnostic plots (not shown) show no evidence of problems.
13.2.3 Analysis of all streams - yearly averages
We now move to the analysis of both streams. By sampling more than one stream, the inference is greatly
strengthened as now the conclusions are more general. Rather than referring to one specic stream, the
results now refer to all streams in the population of streams on this project.
As before, we start by averaging over the pseudo-replicates (the multiple-quadrats) measured in each
stream-year combination. The statistical model is
Y earlyAvg = Period Stream(R) Year(Period)(R)
Once again, if each year is uniquely labelled, the model can be written as:
Y earlyAvg = Period Stream(R) Year(R)
As before, the Period term represents the difference in the mean between the before and after periods and
the other terms represent the random effects of streams and years. The residual variation represents a combi-
nation of the stream-year interaction and the quadrat-noise terms, but because the averages over the quadrats
were taken, cannot be separated.
ods graphics on;
proc mixed data=mean_invert plots=all;
title2 Analysis of means of quadrat counts;
class stream year period;
model mean_count = period / ddfm=kr;
random year(period) stream;
footnote Inference is now for ALL streams in the project;
*
*
/
run;
ods graphics off;
Note the use of the random statement to specify the random effects.
period 1 8 5.73 0.0436
Inference is now for ALL streams in the project
The p-value indicates weak evidence of an effect.
The estimated marginal means in the two periods are:
Standard
Error DF
t
Value
Pr
>
period After 65.3300 5.4239 1.92 12.04 0.0079 0.05 41.0319 89.6281
period Before 55.2450 5.4239 1.92 10.19 0.0109 0.05 30.9469 79.5431
The estimated difference in the marginal means between the two periods is usually of most interest.
Standard
Error DF
t
Value
Pr
>
period After Before 10.0850 4.2127 8 2.39 0.0436 0.05 0.3704 19.7996
The condence interval just barely excludes the value of zero.
There are four sources of variation in this experiment, and not all all can be separated when the analysis
is done on the yearly averages. First the year-to-year variation represents the effects of year-specic factors
such as total rainfall, or other random events. Second, the stream-to-stream variation measures the effect
of the different streams on the mean response. The plot produced earlier shows that there is some stream
effect as the lines for the streams are consistently separated. Third, is the stream-year interaction where
the individual streams do not have exactly parallel trend lines. Lastly, is the quadrat-to-quadrat variation
measures the variation over quadrats in different parts of a stream. Unfortunately, when the yearly averages
are analyzed, the last two variance components cannot be separated.
The estimated variance components are:
Cov
Parm Estimate
stream 41.0907
Residual 9.0795
Notice that the residual variance component represents a combination of the stream-year interaction and the
quadrat-to-quadrat variation (divided by the average number of quadrats sampled per year). The stream and
year variance component are substantially larger than the residual variance!
Diagnostic plots (not shown) show no evidence of problems.
13.2.4 Analysis of all streams - individual values
Finally, we do the analysis of both streams using the individual values.. By sampling more than one stream,
the inference is greatly strengthened as now the conclusions are more general. Rather than referring to one
specic stream, the results now refer to all streams in the population of streams on this project.
The statistical model is
Y = Period Stream(R) Year(Period)(R) Stream*Year(Period)(R)
Once again, if each year is uniquely labelled, the model can be written as:
Y = Period Stream(R) Year(R) Stream*Year(R)
As before, the Period term represents the difference in the mean between the before and after periods and
the other terms represent the random effects of streams and years. We now can separate out the stream-year
interaction from the quadrat-to-quadrat variation.
/
*
analysis of the individual quadrats
*
/
ods graphics on;
proc mixed data=invert plots=all nobound;
title2 Analysis of all streams using the individual quadrats;
class period year stream;
model count = period / ddfm=kr;
random year(period) stream stream
*
year(period);
footnote Inference is now for ALL streams in the project;
*
*
/
run;
ods graphics off;
Note the use of the random statement to specify the random effects.
period 1 8.07 5.70 0.0438
Inference is now for ALL streams in the project
The p-value indicates weak evidence of an effect and the result is similar to that found when the yearly
averages were analyzed.
The estimated marginal means in the two periods are:
Standard
Error DF
t
Value
Pr
>
period After 65.1743 5.2611 1.97 12.39 0.0068 0.05 42.1882 88.1605
period Before 55.1828 5.2671 1.98 10.48 0.0093 0.05 32.2656 78.1001
and the estimated difference in the marginal means between the two periods is:
Standard
Error DF
t
Value
Pr
>
period After Before 9.9915 4.1853 8.07 2.39 0.0438 0.05 0.3544 19.6286
The condence interval just barely excludes the value of zero.
There are four sources of variation in this experiment, but all can separated when the analysis is done
on the individual values. First the year-to-year variation represents the effects of year-specic factors such
as total rainfall, or other random events. Second, the stream-to-stream variation measures the effect of the
different streams on the mean response. The plot produced earlier shows that there is some stream effect as
the lines for the streams are consistently separated. Third, is the stream-year interaction where the individual
streams do not have exactly parallel trend lines. Lastly, is the quadrat-to-quadrat variation measures the
variation over quadrats in different parts of a stream. These can now be separated:
Cov
Parm Estimate
stream 37.9094
year*stream(period) -5.1972
Residual 68.9852
Somewhat surprisingly, the stream-year variance component is estimated to be negative. If the model is ret
using the bound option on the Proc Mixed statement, the model would be t forcing all variance components
to be non-negative. The results are very similar to those presented here. A negative estimate of a variance
component usually indicates that the actual variance component is very small (i.e. close to zero).
We see that the stream-year interaction is very small this not surprising given the parallel responses of the
two steams over the years in the study seen in the preliminary plots.
The usual diagnostic plots (not shown) fail to show any problem with the model t.
13.3 Simple BACI - One year before/after; one site impact; one site
control
The very simplest BACI design has the treatment and a single control site measured before and after the
impact occurs. As noted earlier, the evidence of an environmental impact is the non-parallelism of the
response between the control and the treatment site. Several examples of possible response are shown
below:
The results in the rst row above both show no environmental impact; the results in the bottom row all
show evidence of an environmental impact.
The number of replicates at each site in each year does not have to be equal, however, power to detect
non-parallelism is maximized when the sample sizes are equal in each site-time combination.
The usually assumptions for ANOVA are made:
the response variable is interval or ratio scale (i.e. continuous in JMP parlance). This assumption can
be modied, e.g. responses are categorical, but this is not covered in this course. Please contact me
for details.
the standard deviations at each site-year combination should be equal. This is checked by looking at
the sample standard deviations. If these are reasonably equal (rough rule of thumb is that the ratio of
the largest to smallest standard deviation should be less than 5:1), then the analysis can proceed. If
the standard deviation are grossly unequal, then consider a transformation (e.g. taking the logarithm
of each observation).
the observations are independent within each site-year combination and across years. For example,
if the measurement at each site are from quadrats, then the quadrats are sufciently far apart and the
same location isnt measured after impact. [If the same spot is measured before and after impact, this
is an example of blocking and will be discussed later.]
the distribution of responses within each site has an approximate normal distribution. This assumption
is not crucial it is far more important that the standard deviations be approximately equal and the
observations are independent. In any case, most impact studies have too small of a sample size to
detect anything but gross violations of Normality.
The analysis is straight forward. This is a classical two-way anova, a.k.a. a two-factor completely ran-
domized design. The model to be t (in the standard simplied syntax) is:
Y = Treatment Time Treatment*Time
where Treatment takes the values of control or impact, and Time takes the value of before or after. The
random variation among the multiple measurements is the only source of random variation and is implicit at
the lowest level of the design.
Most standard statistical packages can analyze this design without problems.
13.4 Example: Change in density in crabs near a power plant - one
year before/after; one site impact; one site control
A simple BACI design was used to assess the impact of cooling water discharge on the density of shore
crabs. The beach near the outlet of the cooling water was sampled using several quadrats before and after
the plant started operation, as was a control beach on the other side of the body of water.
A schematic of the BACI design is:
The raw data is available in the baci-crabs.csv le available at the Sample Program Library at http:
data crabs;
infile baci-crabs.csv dlm=, dsd missover firstobs=2;
length SiteClass Site Period quadrat $10 trt $20;
input SiteClass site Period quadrat density;
trt = compress(SiteClass || "." || Period);
run;
Obs SiteClass Site Period quadrat trt density
1 Control C1 After Q16 Control.After 25
5 Control C1 Before Q10 Control.Before 32
Notice that only one site was measured at both the impact and control areas, and multiple, independent
quadrats were measured at each site. We have labelled the quadrats using separate identiers to remind
ourselves that new quadrats were measured in the both the before and after phases of the experiment. A
common notation, which is not recommended, is to reuse the labels, e.g. you would have a Q1 in each site-
time combination. This notation is NOT recommended as it not clear that separate quadrats were measured
at each opportunity.
The variability in observations from a simple BACI (and more complex BACI as well) can be partitioned
into several components, as will be reported in the ANOVA table.
First is the main effect of time. This is simply the different in the mean (over all sites and quadrats)
before and after the impact took place:
This temporal effect is not of interest as it may represent the effects of changes in the global conditions
unrelated to the impact.
Next, not all sites have exactly the same mean even before the impact takes place. The Site Main Effect
represents the differents in the mean among the sites when averaged over all times and quadrats:
The Site effect is not of interest as it represents the effects of site-specic factors that we can neither measure
nor control.
Finally, and of greatest intest, is the BACI effect, which is the DIFFERENTIAL CHANGE, i.e. the
difference of the two changes in the means, that occurs between the before and after periods.:
The BACI, or interaction effect, represent the potential environmental impact and so at test for no interaction
is equivalent to a test for no environmental impact.
13.4.1 Analysis
A table of means and standard deviations can be computed in the usual ways.
proc tabulate data=crabs; /
*
*
/
class SiteClass Site Period;
var Density;
table SiteClass
*
Site
*
Period, density
*
(n
*
f=5.0 mean
*
f=5.2 std
*
f=5.2 stderr
*
f=5.2) /rts=15;
run;
giving
density
N Mean Std StdErr
SiteClass Site Period 4 27.50 3.11 1.55
Control C1 After
Before 6 30.50 3.39 1.38
Impact I1 After 5 20.20 3.63 1.62
Before 4 26.25 2.99 1.49
The sample standard deviations are approximately equal indicating that the assumption of equal population
standard deviations is tenable. The quadrats were sampled far enough apart that they can be considered to
be independent. Separate quadrats were sampled in the two time periods.
This is a simple two-factor CRDexperiment as seen in previous chapters because no quadrat is repeatedly
measured over time, and all observations are independent of each other. The model is
Density = SiteClass Period SiteClass*Period
where SiteClass represents the effect of the Control vs. Impact sites; Period represents the before vs. after
contrast; and the SiteClass*Period interaction represents the BACI effect where as noted earlier the non-
parallelism (the interaction) represent evidence of an environmental impact.
linear models). First is Proc GLM which performs the traditional sums-of-squares decomposition. Second is
Proc Mixed which uses restricted maximum likelihood (REML) to t models. Proc Mixed has the advantage
of being able to t more complex BACI models with more than one site and more than one year measured
in the before or after periods. Consequently, I prefer to to use Proc Mixed.
ods graphics on;
proc mixed data=crabs plots=all;
title2 Mixed BACI analysis;
class SiteClass Period; /
*
*
/
model density = SiteClass Period SiteClass
*
Period / ddfm=kr;
lsmeans SiteClass / cl adjust=tukey ;
lsmeans Period / cl adjust=tukey ;
lsmeans SiteClass
*
Period / cl adjust=tukey ;
estimate BACI effect SiteClass
*
Period 1 -1 -1 1 / cl;
*
*
/
ods output lsmeans =MixedLsmeans;
ods output estimates=MixedEsts;
run;
ods graphics off;
The Class statement species the categorical factors for the model. Notice how you specify the statistical
model in the Model statement it is very similar to the statistical model seen earlier.
The order of the terms in the model is not important.
The output from most packages is voluminous and only part of it will be explored here.
First, the Effects tests:
SiteClass 1 15 13.90 0.0020
Period 1 15 8.54 0.0105
SiteClass*Period 1 15 0.97 0.3404
indicate signicant effects of sites (the mean density varied between the two sites regardless of the potential
impact), of time (the mean density changed over time due to temporal effects unrelated to the impact), but
there was no evidence of an environmental impact as the test for an interaction effect was not statistically
signicant (p = 0.34).
is a CRD.
proc sort data=crabs;
by SiteClass Period;
run;
proc means data=crabs noprint; /
*
*
/
by SiteClass Period;
var density;
run;
series y=mean x=SiteClass / group=Period;
highlow x=SiteClass high=uclm low=lclm / group=Period;
yaxis label=Density offsetmin=.05 offsetmax=.05;
xaxis label=SiteClass offsetmin=.05 offsetmax=.05;
run;
shows two approximately par-
allel lines.
The diagnostic plot from Proc Mixed does not show any problems:
What is of primary interest, is the difference of the difference, i.e. how much did the change differ
between the control and impact sites. This called the BACI contrast and is the estimated environmental
impact. Look rst at the estimates of the Least Square Means. In simple designs, the least-square means
are equal to the simple raw means, but in more complex designs they will differ. he least-square means will
adjust for missing data and unbalance in the experimental design.
The estimates of the marginal means (and standard errors) are requested using the LSmeans statement.
The LSMeans estimates are equal to the raw sample means this is will be true ONLY in balanced data. In
the case of unbalanced data (see later), the LSMEANS seem like a sensible way to estimate marginal means.
Effect SiteClass Period Estimate
Standard
Error DF
t
Value
Pr
>
SiteClass*Period Control After 27.5000 1.6636 15 16.53 <.0001 0.05 23.9542 31.0458
SiteClass*Period Control Before 30.5000 1.3583 15 22.45 <.0001 0.05 27.6048 33.3952
SiteClass*Period Impact After 20.2000 1.4880 15 13.58 <.0001 0.05 17.0285 23.3715
SiteClass*Period Impact Before 26.2500 1.6636 15 15.78 <.0001 0.05 22.7042 29.7958
The estimated mean density in the control site dropped by 3; the mean density in the impact site dropped
by 6.05 crabs. The mean difference-in-the-difference is 3.05 = 6.05 3 crabs. But we would like to get an
standard error for this estimate. This is obtained by specifying a contrast among the means.
The BACI contrast is:
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
This is specied by the Estimate statement in Proc Mixed (see above).
The estimate of the BACI effect is:
Label Estimate
Standard
Error DF
t
Value
Pr
>
BACI effect 3.0500 3.0974 15 0.98 0.3404 0.05 -3.5520 9.6520
The estimated mean difference-in-the-difference is 3.05 with a standard error of 3.097.
The p-value (p = .34) is the same as the test for interaction (as it must). It is pretty clear that the 95%
condence interval for the difference-in-the-difference will include 0, indicating no evidence of an effect.
The key advantage of the estimate and 95% condence interval over the p-value is that it enables the
analyst to see if the test for interaction was not statistically signicant because the effect was small (good
news) or because the experiment has so much noise that the standard error for the effect is enormous (bad
news).
There is no statistical rule of thumb to distinguish between these two cases. The analyst must rely on
biological knowledge. In this case, is a difference-in-the-difference of 3 crabs of biological importance when
the base densities are around 20-30 crabs per quadrat?
13.5 Simple BACI design - limitations
The simple BACI design is less than satisfactory for most impact studies for several reasons.
First, the experimental and observation unit are different sizes. The impact is applied at the site level,
but the measurements are taken within each unit at the quadrat level, i.e., are pseudo-replicates as outlined
by Hurlbert (1984). This means that the scope of inference is limited if you do detect an effect, you really
only can conclude that this impact site behaved different than this particular control site. This leaves the
study open to the criticism that the control site was accidentally badly chosen and so the effect detected was
an artefact of the study.
Second, the measurements are taken at a single point in time before and after the impact. Again, one
could argue that the times chosen were (accidentally) poorly chosen and the results were specic to these
time points. Multiple time points before and after the impact could be measured.
Third, there is only one control and one impact site. While impacts are normally applied to a single site,
there is no reason why a single control site must be measured. Again, if only a single control site was chosen,
one could argue that this single control site was poorly chosen and that the results are again specic to this
single control site. Multiple control sites could be measured to ensure that the results are generalizable to
more than this specic control site.
Fourth, if separate quadrats are measured at each site-time combination, additional variability is intro-
duced making it harder to detect effects. Following the principles of experimental designs, blocking, where
the same unit is measured over time improves the design.
13.6 BACI with Multiple sites; One year before/after
As noted in the previous section, a pointed critique of the simple BACI design is the use of a single control
site. The main criticism is that with a single control site, the replicated measurements within each site
are pseudo-replicates. The problem is that the treatment (control or impact) operates at the site level, but
measurements are taken at the quadrat within the site level. This implies that inference is limited to these
specic impact and control sites. While limiting inference to the single impact site seem reasonable (e.g.
only one dam is built), limiting inference to the single control site seems too narrow.
Consequently, it a almost always advisable to have multiple control sites in the impact-assesment design.
The theory presented below is general enough to allow replicated impact sites as well.
A conceptual diagram of the BACI design with multiple control sites is:
We will assume that all the sites will be measured both before and after the impact, but only one year
of monitoring is done before or after impact. In later sections, we will extend this approach to allow for
multiple years of monitoring as well.
All sites are usually measured in both years, i.e. both before and after impact occurs. The analysis below
can also deal with sites that are measured only before the impact or after the impact (i.e. is missing some
data). However, for these cases, it is important that good software be used (e.g. SAS, JMP, R) because
sophisticated algorithms are required. These incomplete sites do have information that should be integrated
into the analysis. In particular, sites that are measured only before or only after impact provide information
about the variance components which are needed when the formal tests of hypotheses are done.
The use of multiple controls avoids the criticism that the results of the BACI experiment are solely due to
a poor choice of control sites. For example, in the above diagram, if only a single control site was monitored,
one could conclude that the impact was positive, neutral, or negative depending on which control site was
selected.
There are several, mathematically equivalent, ways to analyze these types of designs. The choice among
the analysis is often dictated by the availability of software. However, if you want to use this assessment to
investigate alternatives, such as increasing the subsampling, increasing the number of years of monitoring,
or increasing the number of site, then the full analysis that provides estimates of the individual variance
components is needed.
We will proceed by example.
13.7 Example: Density of crabs - BACI with Multiple sites; One year
before/after
Let us extend the density of crabs example seen in an earlier section.
The beach near the outlet of the cooling water was sampled using several quadrats before and after the
plant started operation. Two control beaches at other locations in the inlet were also sampled.
A schematic of the experiemntal design is:
The raw data is available in the baci-crabs-mod.csv le available at the Sample Program Library at
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are im-
data crabs;
infile baci-crabs-mod.csv dlm=, dsd missover firstobs=2;
length SiteClass Site Period quadrat $10 trt $20;
input SiteClass site Period quadrat density;
trt = compress(SiteClass || "." || Site || "." || Period);
run;
1 Control C1 After Q16 Control.C1.After 25
5 Control C1 Before Q10 Control.C1.Before 32
There was one impact site monitored (Site I1 before and after the impact) and two control beachs (C1 and C2
both measured before and after impact). In each site-year combination, between four and six quadrats were
measured. It is not necessary that the same number of quadrats be measured in each site-year combination
and most good software and deal with this automatically. Notice that each site has a unique label as does
each quadrat. This is a good data management practise as it makes it clearer that different sites are used and
that all of the quadrats were located and measured independenly.
We can again partition the variation in the data to that due to Period (before vs. after), SiteClass (treatment
vs. control), and the BACI contrast (the differential change). In addition, with multiple sites, there are now
additional sources of random variation that must be accounted for:
The quadrat-to-quadrat variation (within a beach) represent micro-habitat effects that affect the very
localized density of the crabs. The site-to-site variation within each site class (i.e. the multiple sites within
the control group) represents the effect of site-specic factors that cause the means in the individual sites (all
within the control class) to vary. Finally, the Site-Period or Site-Year interaction represents the INconsistency
in the response of the sites to the before vs. after period. For example, not all of the control sites have exactly
the same increase in the mean between the before and after periods.
Note that if only one quadrat is measured per site in each year, then the quadrat-to-quadrat variation and
the site-year variation are confounded together and cannot be separated.
The analyses presented below start with the analysis of a highly distilled dataset and progressively
move to the analysis of the raw data.
13.7.1 Converting to an analysis of differences
The measurements taken in each site-year combination are pseudo-replicates. Different quadrats were mea-
sured in each year, so there is no linking among the quadrats. This idea is reinforced in the dataset by
using separate quadrat labels for each quadrat.
We rst nd the average count over the four to six quadrats in each site-year combination. If the number
of quadrats varied among sites, this will give an approximate analysis because the design is unbalanced,
but is likely good-enough.
Use Proc Means to compute the mean density for each combination of SiteClass, Site, and Period:
by SiteClass Site Period;
run;
proc means data=crabs noprint;
by SiteClass Site Period;
var density;
output out=meancrabs mean=mean_density;
run;
proc print data=meancrabs;
run;
This gives the following means:
Obs SiteClass Site Period mean_density
1 Control C1 After 27.50
2 Control C1 Before 30.50
3 Control C2 After 31.20
4 Control C2 Before 38.00
5 Impact I1 After 20.20
6 Impact I1 Before 26.25
Next, you need to compute the difference in the sample means between the before and after period within
each site. Use Proc Transpose place the two means side-by-side on the same data line, and them compute
the difference:
proc transpose data=meancrabs out=TransMeanCrabs;
by SiteClass Site;
var mean_density;
id Period;
run;
data TransMeanCrabs;
set TransMeanCrabs;
diff = After-Before;
run;
proc print data=TransMeanCrabs;
run;
This gives the following difference in the means:
Obs SiteClass Site _NAME_ After Before diff
1 Control C1 mean_density 27.5 30.50 -3.00
2 Control C2 mean_density 31.2 38.00 -6.80
3 Impact I1 mean_density 20.2 26.25 -6.05
Finally, we now test if the mean difference is the same for the control and impact groups. This is a simple
two-sample t-test. Use Proc Ttest to test if the mean difference is the same in the before and after Periods:
ods graphics on;
proc ttest data=TransMeanCrabs plot=all dist=normal;
title2 T-test on the difference (After-Before);
class SiteClass;
var diff;
run;
ods graphics on;
This gives:
t
Value DF
Pr
>
|t|
diff Pooled Equal 0.35 1 0.7860
diff Satterthwaite Unequal . . .
Variable SiteClass N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
diff Control 2 -4.9000 1.9000 -29.0418 19.2418
diff Impact 1 -6.0500 . . .
diff Diff (1-2) _ 1.1500 3.2909 -40.6648 42.9648
Note that because of the very small effective sample size, only the equal variance t-test can be performed.
The p-value for the test of the hypothesis of no difference in the mean difference between Control and
Impact sites is 0.79 indicating no evidence of a difference in the means, i.e. no evidence of an impact. The
output also includes the estimated difference in the mean differences along with a 95% condence interval.
Not surprising, the 95% condence interval contains the value of 0.
At rst glance, it looks as if we have thrown away much information the multiple quadrats were
averaged for each site-year combination, and only the difference between the averages before and after
impact were used. In fact, all of the relevant information has been captured. The multiple quadrats within
each site-year combination contain information only on the quadrat-to-quadrat variation within each site-
year combination. The mean of these values will automatically capture the fact that 5 or 50 quadrats were
measured because if a large number of quadrats were measured, the variation of the sample mean will be
reduced which implies that there will be a reduction in the variation of the differences, which implies that
the power of the test to detect differences in the mean difference will be improved.
Similarly, a BACI design nds evidence of an impact if the change between the before and after period
is the different for the control and treatment groups, i.e. if the mean difference between the before and after
periods is the same for the treatment and control groups which is exactly what is tested by the two-sample
t-test above.
This was a very small experiment so it is not surprising that no effect was detected. Later on, the power
of this design will be examined.
13.7.2 Using ANOVA on the averages
Rather than going though the contortions of splitting a column and nding differences, it is possible to do
an ANOVA on the averages directly.
This design is an example of a Split-Plot design which is covered in more detail elsewhere in my course
notes on my web pages.
In split-plot designs, there are two levels to the experiment. The rst level is at the site level. Here sites
are chosen (at random) from each of the treatment and control areas. Each site is then split by taking
measurements at two time periods (before and after).
Split-plot designs are examples of two-factor designs (see my notes on my web page). They differ from
two-factor completely randomized designs where there is only a single level in the experiment. The model
used to analyze data of this form must include terms for both the two factors (and their interaction which is
the BACI effect of interest) but also for the two levels in the experiment. The experimental unit at the lowest
level (in this case the time period within each site) is usually not specied in the model and is implicit
1
.
The model to be t is:
MeanDensity = SiteClass Site(R) Period SiteClass*Period
1
This corresponds to the Error line in the ANOVA table.
where SiteClass is the classication of sites as either Impact or Control; Site(R) is the random site effect
within SiteClass
2
Period is the classication of time as before or after, and the SiteClass*Period term repre-
sent the BACI contrast of interest.
linear models). First is Proc GLM which performs the traditional sums-of-squares decomposition. Second is
Proc Mixed which uses restricted maximum likelihood (REML) to t models. Proc Mixed has the advantage
of being able to t more complex BACI models with more than one site and more than one year measured
in the before or after periods. Consequently, I prefer to to use Proc Mixed.
ods graphics on;
proc mixed data=meancrabs plots=all;
title2 Mixed BACI analysis on averages;
class SiteClass Period site; /
*
*
/
model Mean_density = SiteClass Period SiteClass
*
Period / ddfm=kr;
random site;
lsmeans SiteClass
*
*
Period 1 -1 -1 1 / cl;
*
*
/
ods output covparms =MixedCovParms;
run;
ods graphics off;
model in the Model statement it is very similar to the statistical model seen earlier. You will also see the
random statement that species that sites are random effects.
Many packages will use a technique called REML (Restricted MaximumLikelihood estimation) to obtain the
test statistics and estimates. REML is a variant of maximum likelihood estimation and extract the maximal
information from the data. With balanced data, the F-tests are identical to an older method called the
Expected Mean Squares (EMS) method. Modern statistical theory prefers REML over the EMS methods.
The REML method comes with two variants. In one variant (the default with JMP), estimated variance
components are allowed to go negative. This might seem to be problematic as variances must be non-
negative! However, the same problemcan occur with the EMS method, and usually only occurs when sample
sizes are small and variance components are close to 0. It turns out that the F-tests are unbiased when
2
As noted in previous chapters, if you label each site with a unique label, it is not necessary to use the nesting notation of
Site(SiteClass)(R). The sites are considered a random effect because we wish to make inference about the impact over all sites in
the study area.
the unbounded variance component option is chosen even if some of the variance estimates are negative. As
noted in the JMP help les,
"If you remain uncomfortable about negative estimates of variances, please consider that the
random effects model is statistically equivalent to the model where the variance components are
really covariances across errors within a whole plot. It is not hard to think of situations in which
the covariance estimate can be negative, either by random happenstance, or by a real process
in which deviations in some observations in one direction would lead to deviations in the other
direction in other observations. When random effects are modeled this way, the covariance
structure is called compound symmetry.
So, consider negative variance estimates as useful information. If the negative value is small,
it can be considered happenstance in the case of a small true variance. If the negative value is
larger (the variance ratio can get as big as 0.5), it is a troubleshooting sign that the rows are not
as independent as you had assumed, and some process worth investigating is happening within
blocks.
So if a variance component is negative and has a large absolute value, this is an indication that your model
may not be appropriate for the data at hand. Please contact me for assistance. The default for SAS is to
constrain the variance components to be non-negative and so the F-tests may vary slightly from those of
JMP.
Part of the output (the Fixed Effect Tests) is presented below:
SiteClass 1 1 3.13 0.3277
Period 1 1 11.07 0.1859
The test for no interaction between the Time and Treatment effects is the test of no BACI effect, i.e. no
impact. The p-value of 0.786 is identical to that found in the previous analyses. The other F-tests represent
the tests for the main effects of Treatment or Time and are usually not of interest.
The estimates of the variance components provide some information on the relative sizes of variation
found in this experiment:
Cov
Parm Estimate
Site 13.8750
Residual 3.6100
The estimated variance of site-to-site effects is 13.9, while the variation of the average within a site is 3.61.
Note that the latter must be interpreted carefully as it represents the variance on the AVERAGE of the
replicated quadrats and not the quadrat-to-quadrat variation. It also include the site-time interaction. This
variance component represent the potential of different changes between before and after depending on site.
For example, in some of the sites, the effect of time may be to increase the response, while in other sites, the
effect of time is to decrease the response (this is after accounting for random error). If this occurs, then you
have serious problems in detecting an environmental impact as it not even clear how to properly dene an
impact when the direction of the impact is not consistent.
A naked p-value should never be reported. Regardless if the test of interaction (the BACI effect) is
statistically signicant or not, an estimate of the effect should should also be reported.
An estimate of the BACI contrast that is independent of the internal parameterization used can be ob-
tained by looking at the Effect Details, in particular at the Time*Treatment estimates. The BACI effect is the
difference in the differences, i.e.
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
The estimate is found by substituting in the estimated means (called the Least Square Means) or
BACI = 29.35 34.25 20.2 + 26.25 = 1.15

This is specied by the Estimate statement in Proc Mixed (see above). and gives:
Label Estimate
Standard
Error DF
t
Value
Pr
>
BACI effect 1.1500 3.2909 1 0.35 0.7860 0.05 -40.6648 42.9648
The estimated mean difference-in-the-difference is 1.15 with a standard error of 3.29.
Note that the p-values for the test of the interaction (the BACI effect) are all identical (as they must be)
regardless of the parameterization used.
We can nd the estimates of the marginal means for each effect
by SiteClass Site descending Period;
run;
proc means data=crabs noprint; /
*
*
/
by SiteClass Site descending Period;
var density;
run;
series y=mean x=Period / group=Site;
highlow x=Period high=uclm low=lclm / group=Site;
yaxis label=Density offsetmin=.05 offsetmax=.05;
xaxis label=Period offsetmin=.05 offsetmax=.05 discreteorder=data;
run;
This shows that the response is almost parallel it is no wonder no BACI effect was detected.
13.7.3 Using ANOVA on the raw data
Finally, it is possible to do an ANOVA on the raw data. This will correctly account for the unbalance in
the number of quadrats sampled at each site-year combination (and so the p-values will be slightly different
than the above analyses based on averages). The advantage of this analysis method is that it also produces
estimates of all of the variance components that can then be used for a power analysis.
The analysis of the raw data is a variant of a split-plot design with sub-sampling (the quadrats). The
statistical model will include terms for the site level (the random effects of site) and also a term for the
possible site-time interaction that was subsumed in to the residual error in the previous analysis.
The model is
Density = SiteClass Site(R) Period SiteClass*Period Site*Period(R)
within SiteClass
3
Period is the classication of time as before or after, the SiteClass*Period term represent
the BACI contrast of interest; and the Site*Period term represents non-parallel changes in the SITES over
the two periods. We hope that this variance component will be negligible otherwise, it indicates that the
effect of before/after varies among the sites as well as between the control and impact classications. This
is generally not good news.
We again use Proc Mixed to t the model.
ods graphics on;
proc mixed data=crabs plots=all;
title2 Mixed BACI analysis on raw data;
class SiteClass Period site; /
*
*
/
model density = SiteClass Period SiteClass
*
Period / ddfm=kr;
random site site
*
period;
lsmeans SiteClass
*
*
Period 1 -1 -1 1 / cl;
*
*
/
ods output estimates=Mixed2Ests;
run;
ods graphics off;
random statement that species that sites and Sites*Period terms are random effects.
The Effect Tests are:
3
the study area.
SiteClass 1 1.01 2.92 0.3351
Period 1 1.11 9.99 0.1746
SiteClass*Period 1 1.11 0.10 0.8002
Because the number of quadrats was not identical in all site-year combinations, the degrees of freedom can
be fractional (except in R which does have the same methods of computing the df as SAS or JMP. Also notice
that the p-value is slightly different from the previous results (but very similar). As long as the imbalance
in the number of quadrats is small, the analyses on the averages as presented earlier, should not be affected
much. The p-value for the test of interaction is 0.80 and there is no evidence of a BACI effect.
Again, regardless of the p-value, an estimate of the BACI effect should be obtained. The reported
parameter estimates or the indicator parameterization estimates should not be used directly (unless you are
very sure of what they mean!), and again construct the BACI contrast based on the Effect Details:
Standard
Error DF
t
Value
Pr
>
SiteClass*Period Control After 29.3038 3.0704 1.28 9.54 0.0380 0.05 5.7217 52.8859
SiteClass*Period Control Before 34.2500 3.0153 1.2 11.36 0.0363 0.05 8.1089 60.3911
SiteClass*Period Impact After 20.2000 4.3068 1.25 4.69 0.0967 0.05 -14.5302 54.9302
SiteClass*Period Impact Before 26.2500 4.3697 1.32 6.01 0.0644 0.05 -5.6551 58.1551
The BACI contrast is:
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
and
BACI = 29.30 34.25 20.2 + 26.25 = 1.10

The BACI contrast was specied by the Estimate statement in Proc Mixed (see above). and gives:
Label Estimate
Standard
Error DF
t
Value
Pr
>
BACI effect 1.1038 3.4794 1.11 0.32 0.8002 0.05 -33.8575 36.0650
The estimated mean difference-in-the-difference is 1.10 with a standard error of 3.48 which are very
similar to those from the analysis on the averages seen earlier. As seen earlier, the prole plot does show
any evidence of non-parallelism.
Cov
Parm Estimate
Site 14.6854
Period*Site 1.6775
Residual 10.9274
The estimated site-to-site variation is 14.68; the estimated interaction of sites and time is 1.68. This
is small which is good, because it indicates that the response of the sites over time is fairly consistent. If
this value were large, it would throw into doubt the whole BACI paradigm of this experiment. Finally,
the residual variance component of 10.92 measures the quadrat-to-quadrat variation within each site-year
combination.
13.7.4 Model assessment
Regardless of the model chosen, you should carefully examine the data for outliers and unusual points. If
there are outliers, and the data was incorrectly entered, then, of course, the data should be corrected. If the
outlier is real, then you may wish to run the analysis with and without the outlier present to see what effect
it has. If the nal conclusions are the same, then there isnt a problem. However, if inclusion of the outlier
greatly distorts the results, then you have a problem, but it is not something that statistical theory can help
with. If the data point is real, it should be included!
You should also look a the residual plots (look under the red triangles to plot residuals). Check to see
that the residuals appear to be randomly scattered around zero.
Here is the model assessment plots from the nal model on the raw data:
If you have a large number of sites, it is possible to estimate the site-residuals and plots of these may
also be informative. In many cases, the number of sites is small so these are less than informative.
The assumption of Normality is often made. Unfortunately, with small sample sizes, it is very difcult
to assess this assumption. Fortunately, ANOVA is fairly robust to non-Normality if the standard deviations
are roughly equal in each site-year combination. A transformation (e.g. the ln() transformation) may be
required.
13.8 BACI with Multiple sites; Multiple years before/after
The previous section looked at designs with only a single year of measurement before and after the impact.
This design can be improved by also monitoring for several years after and before the impact. While the
term year is used generally in these notes, the same methods can be used when sampling takes place
more frequently than years. For example, samples could be taken every season which would add additional
structure to the data.
Once additional years of monitoring are used, there is nowa concern about the potential type of responses
that can occur following impact. For example, the impact could cause a permanent step change in the impact
site(s) that persists for many years. Or, the impact could cause a temporary change in the response variable
that slowly returns to pre-impact levels.
The simplest case is that of a permanent step change after impact that persists for several years. It is
implicitly assumed that the relationship between the control and treatment areas is consistent in the pre-
impact years.
For these types of studies, the most sensible design is the BACI-paired design that was outlined earlier.
Here, permanent monitoring sites are established in each of the control and treatment areas. All sites are
measured each year. [The analyzes presented below can deal with cases where not all sites are measured
in all years as long as the missing data is missing completely at random (MCAR). This implies that the
missingness of measurement at a site is not related to the response or any other covariate in the study.] The
year effects are assumed to operate in parallel across all sites. [This assumptions can be relaxed if you have
multiple sites and every site is measured in every year.]
We will also allow for sub-sampling at each year-site combination and will assume that all sub-samples
are independent both within and across years. A case where this is not true, would be the use of permanent
sub-sampling sites within each permanent site. Please contact me for more details on this design.
Conceptually this is diagrammed as:
Note that the BACI contrast looks at changes in the mean response between the before and after periods.
As before, you should enter the data carefully using unique labels for every distinct experimental or
observational unit. For example, if separate quadrats are measured within each site-time combination, then
each quadrat should have a different label. So if ve quadrats are measured at site I1 (the treatment site)
and at the two control sites (C1 and C2), for two years both before and after the impact, there there are
5 3 2 4 = 120 quadrats and they should be labelled Q1, Q2, . . . Q120, rather than Q1 to Q5 within
each site-year combination. This latter (but common) labelling is confusing because the data appears to be
coming from a design with permanent sub-sampling quadrats. Even more confusing, how can quadrat Q1
be sampled from two different sites?
The key to the analysis of these experiment is to recognize the different sizes of experimental and ob-
servational units in the impact designs. At the top level, the multiple sites within the control and treatment
areas are again the starting of a split-plot type of design. The effect of year is what is known in statistical
parlance as a strip as it operates across all sites regardless if it is a treatment or control site. The quadrats
measured in each site-year combination are again treated as pseudo-replicates.
We will proceed by example.
13.9 Example: Counting sh - Multiple years before/after; One site
impact; one site control
This is the example that appears in Smith (2002, Table 6). Samples of sh were taken for a period of 12
months before and 13 months after a nuclear power plant began operations. The power plant is cooled by
water that is drawn from a river. When the water exits the plant, its temperature is elevated. The concern
is that the warmed water will adversely affect the abundance and composition of sh below the plant. [It
wasnt clear from Smith (2002) what the response measure is, so Ill arbitrarily assume that it is counts.]
A schematic of the design is:
except that for this example, there is only one reading per site per year rather than multiple readings.
As before, we can identify the Period (before vs. after), SiteClass (impact vs. control) and the BACI
effect (the interaction between Period and SiteClass). There are also several sources of variation:
The quadrat-to-quadrat variation represents localized effects within each site. The Year-to-Year variation
represent year-specic factor that cause the response to move higher or lower at both site classes. For
example, a very cool year, could cause productivity of a population to be lower than normal in both the
impact and control sites. Again, the site-year interaction variance component represents the non-parallel
behavior of sites over time. We expect that year-specic factors to shift the means in the sites up and down
in parallel, but not all sites respond in the same way.
Again note that if only a single measurement is taken in each site-year combination then the quadrat-to-
quadrat and the site-year interaction variance components are completely confounded together and cannot
be separated.
The raw data is available in the baci-sh.csv le available at the Sample Program Library at http:
data fish;
infile baci-fish.csv dlm=, dsd missover firstobs=2;
length SiteClass Site Period trt $20;
input SampleTime Year Period SiteClass Site Count;
trt = compress(SiteClass || "." || put(year,2.0) || "." || Period);
run;
Obs SiteClass Site Period trt SampleTime Year Count
1 Control C1 Before Control.75.Before 1 75 1
Obs SiteClass Site Period trt SampleTime Year Count
2 Impact I1 Before Impact.75.Before 1 75 1
In this study there was one (permanent) site measured in each of the control and impact areas and the
site is labelled as C1 and I1 respectively. At each site, monthly (paired) readings were taken at sampling
times 1, . . . , 25 denoted by the sampling time variable. Notice that in the pre-impact year, all months were
measured, while in the post-impact years sampling spans two years. The actual months of sampling were
not given by Smith (2002), so without further assumptions it is not possible to use the calendar months as
another covariate (e.g. counts in January are expected to be different than those in July in some systematic
fashion).
No sub-sampling was conducted so it is not necessary to average over the sub-samples.
As only one site is measured for each of the control and impact areas, this experiment is an example of
pseudo-replication and so conclusions will be limited to inference about these two sites only.
We being by a simple plot of the data over time. We use Proc SGplot to get a plot of the two series over
time:
proc sgplot data=fish;
title2 Profile Plot over time;
yaxis label=Count offsetmin=.05 offsetmax=.05;
xaxis label=Month offsetmin=.05 offsetmax=.05;
series x=SampleTime y=count / markerattrs=(symbol=circlefilled) group=Site;
refline 12.5 / axis=x;
run;
giving:
Notice that the two curve track each other over time indicating that the pairing by month will be effective
in accounting for the natural variation over the seasons.
As in the previous section, there are several (equivalent) ways to analyze this data.
13.9.1 Analysis of the differences
This is an example of simple paired-design where the sampling times induce the pairing in the observations
across the sites. We start by reshaping the data to create two columns - one for the impact and one for the
control site.
We use Proc Transpose to put the control and impact counts on the same data line and then the difference
in counts is easily computed:
proc sort data=fish;
by SampleTime Period;
run;
proc transpose data=fish out=Transfish;
by SampleTime Period;
var Count;
id SiteClass;
run;
data Transfish;
set Transfish;
diff = Impact - Control;
run;
proc print data=Transfish;
run;
giving the rst few lines of the revised data:
Obs SampleTime Period _NAME_ Control Impact diff
1 1 Before Count 1 1 0
6 6 Before Count 63 36 -27
We start with a plot of the differences over the sampling times:
proc sgplot data=Transfish;
title2 Plot of the difference (Impact-Control) over time;
yaxis label=Difference in Count offsetmin=.05 offsetmax=.05;
xaxis label=Month offsetmin=.05 offsetmax=.05;
series x=SampleTime y=diff / markerattrs=(symbol=circlefilled);
run;
We notice that there is very large swing in the difference in the before period. This is unfortunate as
this will give rise to a tremendous residual variation and make it difcult to detect effects.
A two-sample t-test is used to compare the difference between the two Periods. Use Proc Ttest to test if
the mean difference is the same in the before and after Periods:
ods graphics on;
proc ttest data=Transfish plot=all dist=normal;
title2 T-test on the differences (Impact-Control);
class Period;
var diff;
run;
ods graphics on;
This gives:
t
Value DF
Pr
>
|t|
diff Pooled Equal 0.59 23 0.5631
diff Satterthwaite Unequal 0.57 17.019 0.5734
Variable Period N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
diff After 13 16.0769 7.4408 -0.1351 32.2889
diff Before 12 7.0833 13.7849 -23.2569 37.4236
diff Diff (1-2) _ 8.9936 15.3280 -22.7149 40.7020
Notice that we used two-sample t-test allowing for unequal variances in the two time periods. If the pooled
variance t-test is used it replicates the results from ANOVA in the next section exactly.
The p-value is 0.57 which provides no evidence of a difference in the differences between the two periods
(the BACI contrast). The estimate of the BACI contrast (with se) is also provided.
The large differences is worry some this indicates perhaps a lack of t of this model. If a non-parametric
test is done (click on the red triangle), it also fails to detect a difference:
Use Proc Npar1Way to test if the median difference is the same in the before and after Periods using the
Wilcoxon Test;
ods graphics on;
proc npar1way data=Transfish plot=all wilcoxon;
class Period;
var diff;
ods output WilcoxonTest = WilcoxonTest;
run;
ods graphics on;
This gives:
Wilcoxon Two-Sample Test
Statistic 137.0000
Normal Approximation
Z -1.0102
One-Sided Pr < Z 0.1562
Two-Sided Pr > |Z| 0.3124
t Approximation
One-Sided Pr < Z 0.1612
Two-Sided Pr > |Z| 0.3225
Z includes a continuity correction of 0.5.
13.9.2 ANOVA on the raw data
The model for the raw data is
Count = SiteClass Period SiteClass*Period SampleTime(R)
where the term SiteClass represents the control vs. impact effect; the term Period represents the before
vs. after effect; and the term SiteClass*Period represents the BACI effect, i.e. the non-parallel response.
Because there is only one site measured in each of the control or treatment areas, the site-effects are com-
pletely confounded with the treatment effects (as happens with pseudo-replicated data). Consequently, it
is not necessary to enter the Site effect into the model.
4
We do have multiple times measured before and
after the impact. If the Sampling Times use unique labels, we can once again enter a random effect for
SamplingTimes&Random into the model. Alternatively you could enter SamplingTimes (Time)&Random as
the term in the model if the sampling times were not uniquely labelled.
ods graphics on;
proc mixed data=fish plots=all;
title2 BACI analysis on raw data using MIXED;
class SiteClass Period site sampletime; /
*
*
/
model count = SiteClass Period SiteClass
*
Period / ddfm=kr;
random SampleTime;
4
If Site is entered as a random effect, it will be displayed with 0 degrees of freedom.
lsmeans SiteClass
*
*
Period 1 -1 -1 1 / cl;
*
*
/
run;
ods graphics off;
random statement that species that SamplTime terms are random effects. Be sure that SamplingTimes
variable appears in the Class statement This variable serves as a blocking variable and lets SAS know that
the data are paired at each sampling point.
The tests for the xed effects are:
SiteClass 1 23 2.28 0.1444
Period 1 23 0.93 0.3438
The interaction term is the BACI contrast and the large p-value presents no evidence of an impact.
Never report a naked p-value. Use the LSMeans for the Time*Treatment interaction to estimate the BACI
contrast as in previous sections:
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
Standard
Error DF
t
Value
Pr
>
SiteClass*Period Control After 13.0000 11.5658 34.5 1.12 0.2688 0.05 -10.4929 36.4929
SiteClass*Period Control Before 31.8333 12.0381 34.5 2.64 0.0122 0.05 7.3812 56.2855
Standard
Error DF
t
Value
Pr
>
SiteClass*Period Impact After 29.0769 11.5658 34.5 2.51 0.0168 0.05 5.5840 52.5698
SiteClass*Period Impact Before 38.9167 12.0381 34.5 3.23 0.0027 0.05 14.4645 63.3688
Label Estimate
Standard
Error DF
t
Value
Pr
>
BACI effect -8.9936 15.3280 23 -0.59 0.5631 0.05 -40.7020 22.7149
The estimated mean difference-in-the-difference is 8.99 with a standard error of 15.32 which are very
similar to those from the analysis on the averages seen earlier.
The estimates of the variance components are:
Cov
Parm Estimate
SampleTime 1005.95
Residual 733.04
As seen by the plot, there is huge residual variation over and above the variation among sampling times. This
large residual variation is what makes it difcult to detect any impact. From the previous analysis, much of
this huge residual variation comes from a single sampling time where a very large difference was seen.
Notice that the sampling times do have an additional structure. For example, it appears that counts of
sh vary seasonally. It is possible to t a model that accounts for this seasonality, e.g. if the actual month
of the year was known, but this simple paired analysis will remove the effects of this structure, so this more
complicated analysis is not needed.
The very large differences seen in the rst analysis is worry some. If you extract the residual and plot them,
there impact of these large residual is clear. A Normal quantile plot also indicates evidence of a lack of
Normality.
There is some evidence of non-normality in the residuals which isnt too surprising given the huge swing in
the differences seen in the earlier analysis.
13.10 Example: Counting chironomids - Paired BACI - Multiple-
years B/A; One Site I/C
This example comes from Krebs (1999), Ecological Methodology, 2nd Edition, Box 10.3.
Estimates of chironomid (a type of invertebrate (an ohgochaste worm) present in aquatic sediments)
abundance in sediments were taken at one station above and one below a pulp mill outow pipe for 3 years
before plant operation and six years after the plant started.
The raw data is available in the baci-chironomid.csv le available at the Sample Program Library at
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are im-
data chironomid;
infile baci-chironomid.csv dlm=, dsd missover firstobs=2;
length Period $20;
input Year ControlCount TreatmentCount Period;
Diff = TreatmentCount - ControlCount;
run;
Obs Period Year ControlCount TreatmentCount Diff
1 Before 1988 14 17 3
2 Before 1989 12 14 2
3 Before 1990 15 17 2
4 After 1991 16 21 5
5 After 1992 13 24 11
6 After 1993 12 20 8
7 After 1994 15 21 6
8 After 1995 17 26 9
9 After 1996 14 23 9
In this study there was one (permanent) site measured in each of the control and impact areas At each site,
yearly (paired) readings were taken with three baseline years and ve post-baseline years. No sub-sampling
was conducted at each year so it is not necessary to average over the sub-samples.
As only one site is measured for each of the control and impact areas, this experiment is an example of
pseudo-replication and so conclusions will be limited to inference about these two sites only.
We being by a simple plot of the data over time. We use Proc SGplot to get a plot of the two series over
time:
proc sgplot data=chironomid;
title2 Plot of counts over time;
yaxis label=Count offsetmin=.05 offsetmax=.05;
series x=Year y=ControlCount / markerattrs=(symbol=circlefilled);
series x=Year y=TreatmentCount / markerattrs=(symbol=circlefilled);
run;
giving:
Notice that the two curve track each other over time prior to the impact event, and then the shape are
generally the same after the impact event indicating that the pairing by year will be effective in accounting
for the natural variation over the years.
There are two (equivalent) ways to analyze this data.
13.10.1 Analysis of the differences
This is an example of simple paired-design where the sampling times (years) induce the pairing in the
observations across the sites. We nd the difference between the impact and control readings and then plot
of the differences over time:
proc sgplot data=Chironomid;
title2 Plot of the difference (Impact-Control) over time;
yaxis label=Difference in Count offsetmin=.05 offsetmax=.05;
series x=Year y=diff / markerattrs=(symbol=circlefilled);
run;
A two-sample t-test is used to compare the difference between the two Time periods. Use Proc Ttest to
test if the mean difference is the same in the before and after Periods:
ods graphics on;
proc ttest data=Chironomid plot=all dist=normal;
class Period;
var diff;
run;
ods graphics on;
This gives:
t
Value DF
Pr
>
|t|
Diff Pooled Equal 4.27 7 0.0037
Diff Satterthwaite Unequal 5.94 6.187 0.0009
Variable Period N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
Diff After 6 8.0000 0.8944 5.7008 10.2992
Diff Before 3 2.3333 0.3333 0.8991 3.7676
Diff Diff (1-2) _ 5.6667 1.3274 2.5279 8.8054
Notice that we used two-sample t-test allowing for unequal variances in the two periods. If the pooled
variance t-test is used it replicates the results from ANOVA in the next section exactly.
The p-value is 0.0009 which provides strong evidence of a difference in the differences between the
two periods (the BACI contrast). The estimate of the BACI contrast is 5.67 (SE 0.95 from the unpooled
variance analysis) and (SE .1.33) (from the pooled variance analysis).
13.10.2 ANOVA on the raw data
The model for the raw data is
Count = SiteClass Period SiteClass Period Y ear(R)
where the term SiteClass represents the control vs. impact effect; the term Period represents the before vs.
after effect; and the termSiteClass*Period represents the BACI effect, i.e. the non-parallel response. Because
there is only one site measured in each of the control or treatment areas, the site-effects are completely
confounded with the treatment effects (as happens with pseudo-replicated data). Consequently, it is not
necessary to enter the Site effect into the model.
5
We do have multiple years measured before and after the
impact. If the Year variable uses unique labels, we can once again enter a random effect for Year&Random
into the model. Alternatively you could enter Year (Period)&Random as the term in the model if the years
were not uniquely labelled.
5
If Site is entered as a random effect, it will be displayed with 0 degrees of freedom.
ods graphics on;
proc mixed data=StackChironomid plots=all;
title2 BACI analysis on raw data using MIXED;
class SiteClass Period year; /
*
*
/
model count = SiteClass Period SiteClass
*
Period / ddfm=kr;
random year;
lsmeans SiteClass
*
*
Period 1 -1 -1 1 / cl;
*
*
/
run;
ods graphics off;
random statement that species that SamplTime terms are random effects. Be sure that Year is dened in
the Class statement This variable serves as a blocking variable and lets SAS know that the data are paired
at each sampling point.
The tests for the xed effects are:
SiteClass 1 7 60.60 0.0001
Period 1 7 9.11 0.0194
The interaction term is the BACI contrast and the small p-value (0.0037) indicates strong evidence of an
impact. Notice that the p-value from the test for no interaction from the ANOVA matches the p-value of the
Period effect in the t-test assuming that the variances were equal in the two periods.
Never report a naked p-value. Use the LSMeans for the Time*Treatment interaction to estimate the BACI
contrast as in previous sections:
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
Standard
Error DF
t
Value
Pr
>
SiteClass*Period Control After 14.5000 0.7993 10.8 18.14 <.0001 0.05 12.7375 16.2625
SiteClass*Period Control Before 13.6667 1.1304 10.8 12.09 <.0001 0.05 11.1741 16.1592
SiteClass*Period Treatment After 22.5000 0.7993 10.8 28.15 <.0001 0.05 20.7375 24.2625
SiteClass*Period Treatment Before 16.0000 1.1304 10.8 14.15 <.0001 0.05 13.5074 18.4926
Label Estimate
Standard
Error DF
t
Value
Pr
>
BACI effect -5.6667 1.3274 7 -4.27 0.0037 0.05 -8.8054 -2.5279
The estimated mean difference-in-the-difference is 5.67 with a standard error of 1.33 which is the same
as found in the t-test earlier.
The estimates of the variance components are:
Cov
Parm Estimate
Year 2.0714
Residual 1.7619
The year-to-year variance is comparable in size to the residual variation which implies that the year effects
are about the same size as measurement error.
There is no evidence of any problems in the diagnostic plots.
13.11 Example: Fry monitoring - BACI with Multiple sites; Multiple
years before/after
This example is based (loosely) on a consulting project from an Independent Power Producer who was
interested in monitoring the effects of an in-stream hydroelectric project. The response variable for the
project was the minnow density at different locations in the stream.
The monitoring design has the river divided into six segments of which three are upstream of the diver-
sion and three are downstream of the diversion. In each reach, several sites have been located where minnow
fry congregate. In each of the sites, minnow traps are set for various lengths of time. At the end of the soak-
ing period, the traps are removed and the number of minnows are counted and classied by species. The
counts are standardized to a common period of time to adjust for the different soak-times. [This could be
done directly in the analysis by using the soak-time as a covariate, but the soak-time data was not available.]
An initial pre-project monitoring study was run in 2000 to 2002. The project became operational in late
2002, and post-project monitoring continued in 2003 and 2004.
This design is a variant of a classic Before-After-Control-Impact (BACI) design where presumably the
control sites will be located in the upper reaches and potentially impacted sites in the lower reaches.
13.11.1 A brief digression
A major concern for this study is proper replication and the avoidance of pseudo-replication (Hurlburt,
1984). Replication provides information about the variability of the collected data under identical treatment
conditions so that differences among treatments can be compared to variation within treatments. This is the
fundamental principle of ANOVA.
Consider rst a survey to investigate fry density at various locations on a river. A simple design may
take a single sample (i.e. single fry trap) at each of 4 locations
These four values are insufcient for any comparison of the fry density across the four locations because
the natural variation present in readings at a particular location is not known.
In many ecological eld studies, the concepts of experimental units and randomization of treatments to
experimental units are not directly applicable making replication somewhat problematic. Replication is
consequently dened as the taking of multiple INDEPENDENT samples from a particular location. The
replicated samples should be located sufciently far from the rst location so that local inuences that are
site specic do not operate in common on the two samples. The exact distance between samples depends
upon the biological process. For example if the locations are tens of kilometers apart, then spacing the
samples hundreds of meters apart will likely do for most situations. This gives rise to the following design
where two SITES are sampled at each location, each with a single trap operating:
Now a statistical comparison can be performed to investigate if the mean response is equal at four lo-
cations. The key point is that the samples should be independent but still representative of that particular
location. Hence, taking two samples from the exact same location, or splitting the sample in two and doing
two analyses on the split sample will not provide true replication.
Consequently, a design where duplicate samples (e.g. multiple traps) are taken from the exact same
location and time would be an example of pseudo-replication
Note that the data from the last two designs looks identical, i.e. pairs of replicated observations from
four locations. Consequently, it would be very tempting to analyze both experiments using exactly the same
statistical model. However, there is a major difference in interpretation of the results.
The second design with real replicates enables statements to be made about differences in the mean
response among those four general locations. However, the third design with pseudo-replication only allows
statements to be made about differences among those four particular sampling sites which may not truly
reect differences among the broader locations.
Obviously the line between real and pseudo-replication is somewhat ill-dened. Exactly how far apart
do sampling sites have to be before they can be considered to be independent. There is no hard and fast rule
and biological consideration and knowledge of the processes involved in the environmental impact must be
used to make a judgment call.
In terms of the current monitoring design, we hope that the different sites are that are far enough apart to
be considered independent of each other. The multiple minnow traps within each site would be considered
pseudo-replicates and cannot be treated as independent across sites.
Here is a schematic of this general BACI design:
As before, we can indentify the Period (before vs. after), SiteClass (impact vs. control) and the BACI
interaction. As well there are several sources of variation in the monitoring design that inuence how much
sampling and when sampling should take place.
At the lowest level is trap-to-trap variation within a single site-year combination. Not all traps will catch
the exact same number of minnows. The number of minnows captured is also a function of how long the
traps soak with longer soak periods leading, in general, to more minnows being captured. However, by
converting to a common soak-time, this extra variation has been reduced.
Next is site-to-site variation within the reaches of the river. This corresponds to natural variation in
minnow density among sites due to habitat and other ecological variation.
Thirds is yearly variation. The number of minnows captured within a site varies over time (e.g. 2002 vs.
2003) because of uncontrollable ecological factors. For example, one year may have more precipitation or
be warmer which may affect the fry density in all locations simultaneously.
Finally, there is the site-year interaction which represents the INconsistency in how the sites respond over
time. For example, not all of the control sites respond exactly the same way across the multiple years prior
to impact. Note that if only trap is measured each year, then the trap-to-trap and the site-year interaction
variance components are completely confounded together and cannot be separated.
13.11.2 Some preliminary plots
The raw data is available in the baci-fry.csv le available at the Sample Program Library at http://www.
the usual way:
data fry;
infile baci-fry.csv dlm=, dsd missover firstobs=2;
length SiteClass $20 Site $20 Period $20.;
input SiteClass Site Sample Year Period Fry;
log_Fry = log(fry);
attrib log_fry label=log(Fry);
run;
A portion of the raw data are shown below.
Obs SiteClass Site Period Sample Year Fry log_Fry
1 Upstream A Before 1 2000 4 1.38629
2 Upstream A Before 2 2000 47 3.85015
3 Upstream B Before 1 2000 188 5.23644
4 Upstream B Before 2 2000 425 6.05209
5 Upstream C Before 1 2000 50 3.91202
6 Upstream C Before 2 2000 199 5.29330
7 Downstream D Before 1 2000 185 5.22036
8 Downstream E Before 1 2000 258 5.55296
As you will see in a few moment, there is a good reason to analyze the log(fry) rather than the fry count
directly.
It is always desirable to examine the data closely for unusual patterns, outliers, etc.
First, let us examine the pattern of when sites are measured over time and how many traps are used each
time, etc. In SAS, use Proc Means to compute the mean and standard deviation of the number of fry in each
year-site combination.
proc sort data=fry;
by year site;
run;
proc means data=fry noprint;
by year site ;
var fry log_fry;
output out=mean_fry n=ncages mean=mean_fry mean_log_fry std=std_fry std_log_fry;
id SiteClass period;
run;
This gives the following (partial) table of means and standard deviations:
Obs Year Site SiteClass Period ncages
mean
fry log(Fry)
std
fry log(Fry)
1 2000 A Upstream Before 2 25.500 2.61822 30.406 1.74221
2 2000 B Upstream Before 2 306.500 5.64427 167.584 0.57675
3 2000 C Upstream Before 2 124.500 4.60266 105.359 0.97671
4 2000 D Downstream Before 1 185.000 5.22036 . .
5 2000 E Downstream Before 3 334.667 5.58409 265.438 0.84649
6 2000 F Downstream Before 2 234.500 5.45103 37.477 0.16050
9 2001 C Upstream Before 1 365.000 5.89990 . .
The design is unbalanced with some sites having from 1 to 3 samples taken and not all sites being
measured in all years (e.g. site D is only measured in 2000, 2001, and 2003).
The variability in the number of fry is very large with the standard deviation often being of the same
magnitude as the mean number of fry. It is often quite useful to plot the log(std dev) vs. the log(mean)
to see if there is a relationship between the two. The slope of the line of the relationship can be used
to determine the appropriate transformation based on Taylors Power Law which is covered elsewhere in
these notes.
ANOVA assumes that variation in the data is roughly constant here the differences in variation are
too large for ANOVA.
6
Based on the relationship between the log(std dev) and the log(mean), the natural
logarithmic transformation or log(fry)
7
is to be used. This was the reason why the log(fry) was computed
earlier in the data step. A plot of the standard deviation vs. the mean on the log-scale shows a dramatic
improvement:
with the standard deviation not appears to be dependent upon the mean.
Note that the average of values on the log-scale corresponds to the logarithm of the geometric mean on
the ordinary scale. The geometric mean is less sensitive to outliers and will always be less than the arithmetic
mean. Differences on the log-scale correspond to RATIOS on the anti-log scale. For example, if the counts
of fry are reduced by a factor of 2 (i.e. halved), this corresponds to a subtraction on the log-scale by a value
of 0.69.
Next look at the plot of the mean log(fry) density over time.
6
A rough rule of thumb is that the standard deviations in various groups should not vary by a factor of more than about 5:1.
7
In statistics, the log() transformation is always the natural logarithm. If you want the common (base 10) logarithm, use the log10()
function.
There is no obvious evidence of an impact of the diversion. There is some evidence of a year-to-year
effect with the curve generally moving together (e.g. the downward movement in 2004 in all sites).
13.11.3 Analysis of the averages
We start rst with the analysis of the average log(fry) count. The number of traps monitored in each year
varies from 1 to 3 so the results from this analysis will not match precisely those from the analysis of the
raw data but the results should be close enough.
We previously saw the following (partial) table of means:
mean
fry log(Fry)
std
fry log(Fry)
3 2000 C Upstream Before 2 124.500 4.60266 105.359 0.97671
5 2000 E Downstream Before 3 334.667 5.58409 265.438 0.84649
6 2000 F Downstream Before 2 234.500 5.45103 37.477 0.16050
mean
fry log(Fry)
std
fry log(Fry)
9 2001 C Upstream Before 1 365.000 5.89990 . .
The model to be t is:
MeanLogFry = SiteClass Site(R) Period Year(R) SiteClass*Period
within SiteClass
8
Period is the classication of time as before or after, Year(R) represents the multiple years
within each period
9
and the SiteClass*Period term represent the BACI contrast of interest.
The response variable is the mean log(fry). The xed effects are SiteClass, Period and their interaction
(which is the BACI contrast of interest).
Next, the sites again serve as the larger experimental unit in a split-plot sense. If the sites are uniquely
labelled, then the site-to-site variation can be entered as Site&Random as in previous examples.
The year-to-year variation is entered in a similar fashion, i.e. if the years are uniquely labelled, the simply
enter Year&Random.
When ever there are more than one random effects in a model, the usual practice is to enter the interac-
tion of the random effects as well (the Site*Year interaction). This term allows the site effect to vary among
years. If there is a large site-by-year interaction, this is worrysome as it implies that the effects of years are
not consistent over time and vice versa. In this case, because we are dealing with averages, we dont need
to add such a term and it will be automatically t as the residual variance components. If you do add this
term, you get exactly the same results.
Proc Mixed is used to t the model. Be sure the Year is entered on the Class statement.
ods graphics on;
proc mixed data=mean_fry plots=all;
title2 BACI analysis on AVERAGES of log(fry);
class SiteClass year period site;
model mean_log_fry = SiteClass period SiteClass
*
period / ddfm=kr;
/
*
because site and year labels are unique, we dont need the
nesting syntax of year(Period) site(SiteClass)
*
/
random year site;
8
the study area.
9
As noted in previous chapters, if you label each year with a unique label, it is not necessary to use the nesting notation of
Year(Period)(R). The years are considered a random effect because we wish to make inference about the impact over all year in
the study area.
lsmeans SiteClass
*
estimate baci contrast SiteClass
*
period 1 -1 -1 1 / cl;
*
*
/
ods output covparms =MixedCovParms;
run;
ods graphics off;
REML is a variant of maximum likelihood estimation and extract the maximal information from the
data. With balanced data, the F-tests are identical to an older method called the Expected Mean Squares
(EMS) method. Modern statistical theory prefers REML over the EMS methods.
The REML method comes with two variants. In one variant (the default with JMP), estimated variance
components are allowed to go negative. This might seem to be problematic as variances must be non-
negative! However, the same problemcan occur with the EMS method, and usually only occurs when sample
sizes are small and variance components are close to 0. It turns out that the F-tests are unbiased when
the unbounded variance component option is chosen even if some of the variance estimates are negative. As
noted in the JMP help les,
"If you remain uncomfortable about negative estimates of variances, please consider that the
random effects model is statistically equivalent to the model where the variance components are
really covariances across errors within a whole plot. It is not hard to think of situations in which
the covariance estimate can be negative, either by random happenstance, or by a real process
in which deviations in some observations in one direction would lead to deviations in the other
direction in other observations. When random effects are modeled this way, the covariance
structure is called compound symmetry.
So, consider negative variance estimates as useful information. If the negative value is small,
it can be considered happenstance in the case of a small true variance. If the negative value is
larger (the variance ratio can get as big as 0.5), it is a troubleshooting sign that the rows are not
as independent as you had assumed, and some process worth investigating is happening within
blocks.
So if a variance component is negative and has a large absolute value, this is an indication that your model
may not be appropriate for the data at hand. Please contact me for assistance. The default for SAS is to
constrain the variance components to be non-negative and so the F-tests may vary slightly from those of
JMP.
The Fixed Effect Test are:
SiteClass 1 4.31 1.92 0.2335
Period 1 2.94 0.41 0.5692
SiteClass*Period 1 20.3 0.01 0.9310
There is no evidence of a BACI effect as the p-value is very large Never simply report naked p-values, so an
estimate of the BACI contrast is needed. The BACI effect is the difference in the differences, i.e.
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
where C is Control, T is treatment site classication, B is before, and A is after the intervention. The
estimates of the marginal means (and standard errors) are requested using the LSmeans statement.
Standard
Error DF
t
Value
Pr
>
SiteClass*Period Downstream After 5.4217 0.5192 5.78 10.44 <.0001 0.05 4.1395 6.7040
SiteClass*Period Downstream Before 5.6240 0.4974 5.02 11.31 <.0001 0.05 4.3471 6.9009
SiteClass*Period Upstream After 4.6394 0.5124 5.51 9.05 0.0002 0.05 3.3583 5.9205
SiteClass*Period Upstream Before 4.7320 0.4943 4.91 9.57 0.0002 0.05 3.4545 6.0096
The contrast among these estimates is specied by the Estimate statement in Proc Mixed (see above).
and gives:
Label Estimate
Standard
Error DF
t
Value
Pr
>
baci contrast -0.1097 0.2986 17.2 -0.37 0.7178 0.05 -0.7390 0.5196
The estimated mean difference-in-the-difference is .11 with a standard error of .30.
The BACI estimate is small relative to the standard error which is not surprising given that the hypothesis of
no BACI effect was not rejected.
Because we analyzed the data on the log-scale, the estimated effect on the anti-log scale is exp(.11) =
.895 which is interpreted to say that the change in fry density between before and after is .89 as large in the
impact area as in the control area.
The estimated variance components can be obtained.
Cov
Parm Estimate
Year 0.06060
Site 0.6239
Residual 0.1460
The variance components show that site-to-site effects are about 10 larger than year-to-year effects (see
the earlier plot) and about 4 that of the residual variation (of the averages). Because we are dealing with
the averages, the residual variation is a composite of the site*year interaction and cage-to-cage variation (see
next section).
As always, dont forget to look at the diagnostic plots:
13.11.4 Analysis of the raw data
The raw data can also be analyzed directly. The model to be t is:
LogFry = SiteClass Site(R) Period Year(R) SiteClass*Period Site*Year(R)
within SiteClass
10
Period is the classication of time as before or after, Year(R) represents the multiple years
within each periodfootnoteAs noted in previous chapters, if you label each year with a unique label, it is not
necessary to use the nesting notation of Year(Period)(R). The years are considered a random effect because
we wish to make inference about the impact over all year in the study area. and the SiteClass*Period term
represent the BACI contrast of interest. Because subsampling occurs at each combination of site and year,
so need a interaction of the two random effects Year*Site&Random.
Proc Mixed is used to t the model. Be sure the Year is entered on the Class statement.
ods graphics on;
proc mixed data=fry plots=all nobound;
title2 BACI analysis on INDIVIDUAL of log(fry) - unbounded variance components;
class SiteClass year period site;
model log_fry = SiteClass period SiteClass
*
period / ddfm=kr;
10
the study area.
random year(period) site(SiteClass) year
*
site(period SiteClass);
estimate baci contrast SiteClass
*
period 1 -1 -1 1 / cl;
lsmeans SiteClass
*
*
*
/
run;
ods graphics off;
The Fixed Effect Tests again show no evidence of a BACI effect:
SiteClass 1 23 2.28 0.1444
Period 1 23 0.93 0.3438
The estimate of the BACI contrast:
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
where C is Control, T is treatment site classication, B is before, and A is after the intervention is found in
the usual way
Standard
Error DF
t
Value
Pr
>
SiteClass*Period Downstream After 5.4751 0.5198 6.48 10.53 <.0001 0.05 4.2257 6.7245
SiteClass*Period Downstream Before 5.6710 0.4942 5.44 11.47 <.0001 0.05 4.4307 6.9113
SiteClass*Period Upstream After 4.5908 0.5096 6.13 9.01 <.0001 0.05 3.3500 5.8315
SiteClass*Period Upstream Before 4.7597 0.4796 4.92 9.92 0.0002 0.05 3.5204 5.9990
The contrast among these estimates is specied by the Estimate statement in Proc Mixed (see above).
and gives:
Label Estimate
Standard
Error DF
t
Value
Pr
>
baci contrast -0.02694 0.3073 20.3 -0.09 0.9310 0.05 -0.6672 0.6134
The estimated mean difference-in-the-difference is 0.027 with a standard error of .31.
Despite the value appears to be smaller than that obtained by the analysis of the averages, the difference
is just an artefact of the differing number of samples taken in some years.
The estimated variance components:
Cov
Parm Estimate
Year(Period) 0.06915
Site(SiteClass) 0.5742
Year*Site(Site*Peri) -0.1884
Residual 0.6536
tell a similar story as happened in the analysis of the averages. Note that the estimate of the Year*Site
variance component is very small (and actually is reported as being negative by SAS or JMP if the variance
components are left unbounded). Of course this is silly as variances must be positive, but just indicates that
the actual variance component is small (i.e. close to 0) and the negative value is an artefact of the data. You
could re-run the model constraining the variance components to be non-negative and the results are similar.
Consult the SAS code for more details.
Also note that because we have analyzed the individual measurements, we have separated the residual
variance component (from the analysis of the averages in the previous section) into its site*year interaction
and cage-to-cage variation components. Ignoring for now the problem of negative variance components,
you see that the residual variance from the analysis of the averages (0.15 as reported by SAS and JMP)
which was a composite of site*year and cage-to-cage (residual) variation is approximately equal to the
current site*year component and the variance of the cage averages (based on an average of 1.85 cages per
year) or 0.15 0.19 + 0.65/1.85 = 0.16
13.11.5 Power analysis
A power analysis of these types of designs is vary difcult to do by hand (!) and still surprising tedious (but
possible) to do in JMP. Here is where SAS and R are extremely useful as you construct a power analysis
fairly simply as outlined in the papers:
Stroup, W. W. (1999).
Mixed model procedures to assess power, precision, and sample size in the design of experi-
ments.
Pages 15-24 in Proc. Biopharmaceutical Section.
American Statistical Association, Baltimore, MD.
Available at: http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms/
Power/stroup-1999-power.pdf.
or
Littell, R.C., Milliken, G.A., Stroup, W.W., Wolnger, R.D., and Schabenberger, O. (2006),
SAS for Mixed Models. 2nd Edition. Chapter 12.
Some examples of the use of this method are available at: http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms/Power/.
The basic idea of Stroups method is to generate dummy data with NO variability and taking on only
the means of the treatment applied to this observation. In the fry example, you would generate data cor-
responding to the 5 years of data, 6 sites, and multiple fry traps with the log(fry) values taking one of the
4 means corresponding to the treatment-before impact, treatment after-impact, control-before impact, and
control-after impact. The BACI contrast among these values should represent a biologically interesting dif-
ference. This mean data is then analyzed using the BACI model using estimates of the unknown variance
components you think are appropriate for your study. These values could be obtained from real life data (as
above) or from expert opinion. Then a simple function of the F-statistic gives you the estimated power of
your design.
A SAS program illustrating the method is available on the Sample Program Library (http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms). For this study, we xed the esti-
mated variance components as

site
- STANDARD DEVIATION of the site-to-site effects = 0.75;

year
- STANDARD DEVIATION of the year-to-year effects = 0.25;

siteyear
- STANDARD DEVIATION of the site-by-year interaction = 0. 10;

residual
- STANDARD DEVIATION of the residual variation = 0.75.
Note that because the same sites are measured in all years (both before and after the impact starts), the
sites serve as blocks and so the site-to-site variation does NOT affect the power. Similarly, because all
years have all sites measured, the years also serve as blocks and so the year-to-year variation also does
not affect power. Only the site-by-year variance and the residual variance impact the power of the design
directly.
We also examined a number of scenarios for different number of years and different number of sub-
samples. Here are some results:
alpha
n
TA
n
TB
n
CA
n
CB
ns
T
ns
C
ny
B
ny
A
mu
TA
mu
TB
mu
CA
mu
CB Power
0.05 3 3 3 3 3 3 3 2 5 5.5 4.5 4.5 0.30
0.05 6 6 6 6 3 3 3 2 5 5.5 4.5 4.5 0.51
0.05 9 9 9 9 3 3 3 3 5 5.5 4.5 4.5 0.76
0.05 6 6 6 6 3 3 3 4 5 5.5 4.5 4.5 0.67
Given the very large sub-sampling STANDARD DEVIATIONS, it certainly is useful to increase the sub-
sampling from 3 to 9 units per year-site combination to obtain power around the target of 80%. If the
sampling is extended in the future, i.e. more years of sampling in the future, the power also increases.
The power programs assume balanced data and no missing data, but can be modied to account for
planned imbalance and missingness if needed. More details on the power computations are available later in
this chapter.
13.12 Closing remarks about the analysis of BACI designs
As the examples above, the analysis of complex BACI designs takes some care. It is inadvisable to simply
bash the numbers though a statistical package without a good understanding of exactly how the data were
collected!
The key to a successful analysis of a complex experiment is the recognition of multiple levels (or sizes
of experimental units) in the study. For example, with multiple control sites and multiple quadrats measured
in each site-time combination, there are two sizes of experimental units the site and quadrat level.
It is impossible to provide a cookbook to give examples of all the possible BACI analyses. Rather than
relying on a cookbook approach, it is far better to use a no-name approach to model building.
The no-name approach starts by adding terms to the model corresponding to the treatment effects:
Y = Treatment Time Treatment*time . . .
where the term Treatment takes the values impact and control, the term Time represents the values before
and after. These terms in the model represent the main effects (i.e. in the absence of an interaction these
represents the site and temporal effects). The interaction term in the model again represent the potential
environmental impact, i.e. is the change in the mean between before and after the same for the control and
treatment sites?
To this model, terms must be included representing the various levels in the experiment, i.e. sites, years,
etc. These terms can be specied in one of two ways. For example, the site term can be added as;
M1 : Y = Treatment Time Treatment*Time Site(Treatment)&R. . .
M2 : Y = Treatment Time Treatment*Time Site&R. . .
Both models are equivalent. The additional term Site(Treatment) is the traditional specication for
the multiple sites within each treatment when the sites are NOT given unique labels, while the termSite can
be used (with modern software) when the site labels are unique.
The &R notation indicates that you need to specify these as random effects, i.e. you wish to generalize
the results to more than the specic sites used in this experiment.
Similarly, if you have multiple years, add a term for the multiple years in a similar fashion.
Finally, if you have subsampling, you need to specify a term that indicates to JMP how the sub-samples
should be grouped. This is often a combination of year-and-site and so the Year*Site&R effect was added to
the model.
Some textbooks recommend additional interactions terms be added to the model (e.g. the site*treatment
interaction term). I nd that unless you have a very large experiment, the data is often too sparse (especially
if not all sites are measured in all years) to add these extra interaction terms. I havent seen a real live
experiment where they were useful.
After tting the model, be sure to assess the t of the model by looking at residual plots and the like.
13.13 BACI designs power analysis and sample size determination
13.13.1 Introduction
The basic concepts of power analysis are explained elsewhere in these course notes (e.g. in the rst chapter
on ANOVA). Briey, power is the ability to detect a difference when that difference exists. The power of a
design depends on several features:
The level. This is often set to 0.05 or 0.10. This represents the strength of evidence required before
an effect is declared statistically signicant.
The variance components. You will need information on the variance components of all parts of your
design. Depending upon the exact BACI design, there can be up to four variance components for
which estimates will be needed: the site-to-site variance, the year-to-year variation, the site-year inter-
action variance, and the within site-year (typically quadrates or measurements) variance component.
A preliminary study, or literature review can be very helpful in getting these components. As you will
see later, some of the variance terms have no impact on the power analysis. For example, measuring
the same sites or time, or pairing measurements across sites by year will allow you to block and
remove these sources of variation from the analysis. This is the key advantage of re-using the sites
in each year or taking paired measurements over time.
The sample size. Here there are different levels to the samples. First is the number of quadrats
measured in each site-year combination. This could vary, but for most power analyzes, it is commonly
assumed that this is constant. Second, is the number of sites in the treatment or control areas. In
many cases, there is only a single treatment site but the theory is completely general below to allow
for multiple sites in the impacted areas. Lastly, how many years are measured before and after the
impact?
The effect size of interest. This is the most difcult part of the power analysis. You need to gure
out how big of a BACI effect is important to detect. This seems rather obtuse, but can be obtained
by drawing pictures of possible BACI outcomes (using hypothetical values) until people are sure that
such an effect needs to be detected. This will be illustrated below.
What is a reasonable target for power? There are two rules of thumb commonly used. First, at =
0.05 a target power of 0.80-80% is required; second at = 0.10 a target power of 0.90=90% is required.
These guidelines are somewhat arbitrary, but are widely used.
For many of the designs in this chapter, I have provided an Excel spreadsheet and JMP, R, or SAS code
to help you determine power and sample size for various types of designs.
Let us consider the elements in more detail for general BACI designs, before considering power and
sample size for a specic design.
Effect size of interest The effect size used in the power analysis is the BACI contrast. For many people
this is difcult to ascertain. In the spreadsheet is an interactive graph that show the BACI plot for choices
of the means before/after and control/treatment. Modify the values in the table until the graph shows a
biologically signicant difference that is important to detect.
The spreadsheet baci-power-prep has a section in the worksheet that can be used to determine the size
of the BACI contrast that is of biological importance.
For example, in the crab example, we started with both sites having mean densities of about 35 crabs/quadrat.
We suspect that both population may decline over over time, but if the difference in decline is more than
about 5 crabs, we will be highly interested in detecting this. We varies the means in the table until the
corresponding graph showed an effect that appears to be biologically important:
The BACI contrast value is obtained then in the usual way:
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
= 30 35 25 + 35 = 5
If a negative value is obtained for the BACI contrast, use the absolute value.
Variance components
The site-to-site variance component measures how much different sites may vary among the mean
because of local, site-specic effects (e.g. one site may have slightly better habit than a second site).
Fortunately, a good estimates of this variance component is not crucial as it cancels because the
same plot is repeated measured over time. A very rough rule of thumb is to take the range in means
over sites and use the standard deviation as range/4.
The year-to-year variance components measures how year-specic factors (e.g. a year may be wetter
than normal which tends to increase productivity at all sites) affect the response. Again, good estimates
of this variance component are not that crucial because if all sites are measured in all years (i.e. a
blocked design), the year-effect is accounted for in the analysis and again cancels in the same way
that blocking effects are removed.
The site-year interaction variance component measures how inconsistent the response at sites are over
time. Hopefully, it is small. Some typical values found in practice is that the site-time variance is
about 20-50% of the site standard deviation. If is smaller than this, you are doing well! This typically
is the limiting factor in the BACI study as nothing in the experiment can be modied to account for
this variance component.
The sub-sampling (or quadrat-to-quadrat or trap-to-trap) variance component asks how variable are
the repeated measurements taken in the same site-year combination. A very rough rule of thumb
is that for count data, the standard deviation is about the
average counts under the assumption a

poisson distribution.] It can be and often is higher.
Estimates of these variance components can be obtained from a preliminary study, but notice that at least
two years of preliminary data are required to estimate the year-to-year and year-site variance components.
Sample sizes There are two levels to the sample size specication.
At the lowest level is the number of quadrats (sub-samples) measured at each site-time combination.
This is usually kept xed for all sites and for all times and for treatment and control sites, but is allowed to
vary. However, all sites that are measured in the treatment-time combination are assumed to have the same
number of quadrats. If you wish different sites to have different number of sub-samples, please contact me.
At the upper level, you need to specify the number of sites measured in the treatment and control areas.
In many cases, a single site is measured in the treatment area, but there are cases where multiple treatment
sites can be measured. It is assumed that you will have at least 2 control sites measured for this design.
Similarly, at the upper level, you need to specify the number of years (before and after impact) for which
monitoring will take place.
level
Specify the level (usually 0.05 or 0.10). In many cases, the choice of power and level are given externally.
For example, the Canadian Department of Environment often species that a power of 0.90 is the target at
= 0.10.
Next we will examine how to determine the power/sample size for many of the BACI designs considered
in this chapter.
13.13.2 Power: Before-After design
In Before-After studies, there are three attributes that are under the control of the experimenter:
The number of years of monitoring before and after the impact;
The number of locations monitored (the number of streams in the earlier example);
The number of measurements taken at each year-location combination (the number of quadrats mea-
sured in the earlier example).
Each of these attributes attempt to control a particular source of variation. In the Before-After study,
there are four source of variation:
Year-to-Year variation. This is caused by year-specic factors such as increased rainfall in a year. A
year-specic effect would impact all of the locations in the study.
Location-to-Location variation (Stream-to-Stream variation in the earlier example). Not all locations
have exactly the same mean response in a particular year. For example, some locations may be better
habitat and generally have higher densities of organisms. We try and control for the the location-
specic effects by measuring the same location in all years of the study. In this way, location is like a
blocking factor, and so the effects of location can be removed from the study.
Location-Year interaction variation (Stream-Year interaction in the earlier example). Even though
there is a systematic difference among the streams, they may not all respond the same to year-specic
factors. For example, locations with better habitat may be better able to weather a downturn in rainfall
and so the change in density of organisms may not be as great as in locations with poorer habitat. A
large Location-Year interaction variance would indicate that the results of the before-after study are
not very reproducible in different locations.
Residual variation (Quadrat-to-quadrat variation in the earlier example). Not all sampling sites at a
location in a year have identical responses. For example, if the response are smallish counts, then the
number of organisms in different sample units tends to follow a Poisson distribution.
In some cases, some of the variance components are confounded. For example, if only a single location
(stream) is measured, then it is impossible to disentangle the year-to-year variation from the stream-year
interaction variation and the combined variation is simply labelled as year-to-year variation.
In the following sections, we will show how to compute the power for both single and multiple location
before-after studies.
Single Location studies
In these studies, only a single location is measured for multiple years before and after the impact, but at
several sampling sites (quadrats) in each year. The full statistical model is:
Y
ijk
=
i
+
year
ij
+
quadrat
ijk
where y
ijk
is the measured response in period i (either before or after the impact), in year j (of that period)
and in quadrat k;
i
is the mean response in period i;
year
ij
is the year-specic (random) effect with variance
2
year
; and nally
quadrat
ijk
is the residual random error with variance
2
.
For simplicity, assume that the same number of quadrats (n
q
) is measured in each year. Then, as seen in
the analysis of subsampling designs, the analysis of the averages is identical to that of the original data. The
statistical model for the average response is then:
Y
ij
=
i
+
year
ij
+
quadrat
ij
where now
quadrat
ij
is the residual random error with variance
2
/k. The variance of the yearly average is
V [Y
ij
] =
2
year
+

2
n
q
We are now in the situation of a simple two-sample t-test whose power depends on the above variance
and the number of years before and after the impact. Standard power programs for a two-sample t-test can
be used. From the previous example, we found that the estimates of the combined year-to-year and stream-
year interaction variance (labelled as the year variance) was 33.79 and the quadrat-to-quadrat variation (the
residual variation) was 59.17. We had 5 years of before studies and want to know the power for various
number of years after impact, assuming that a difference of 10 in the mean density was important to detect.
We will assume that we can measure 10 quadrats each year.
Then the variance of the yearly average is
V [Y
ij
] =
2
year
+

2
n
q
= 33.79 +
59.17
10
= 39.707
and the standard deviation of the year average is
39.707 = 6.3.
A sample program called invert-power.sas is available in the Sample Program Library at http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The power is then computed using
Proc Power as:
ods graphics on;
proc power;
title Power analysis for before/after analysis with one stream based on analysis of averages;
footnote Residual std dev of 6.3 assumes an average of 10 quadrats/year;
footnote2 Inference limited to that single stream and cannot be generalized to other streams;
/
*
*
/
twosamplemeans
test=diff /
*
*
/
meandiff=10 /
*
*
/
stddev=6.3 /
*
33.79 (year-to-year) + 59.17/10 (quad-to-quad/n-quad)
*
/
power=. /
*
solve for power
*
/
alpha=.05 /
*
*
/
sides=2 /
*
*
/
groupns=5 | 5 to 20 by 1 /
*
number of years BEFORE (5) and AFter (5 to 10 by 1) to examine
*
/
; /
*
*
/
run;
ods graphics off;
which gives the following power computations:
Power analysis for before/after analysis with one stream based on analysis
of averages
Obs Analysis Index Sides Alpha MeanDiff StdDev N1 N2 NullDiff Power
1 TwoSampleMeans 1 2 0.05 10 6.3 5 5 0 0.596
2 TwoSampleMeans 2 2 0.05 10 6.3 5 6 0 0.647
3 TwoSampleMeans 3 2 0.05 10 6.3 5 7 0 0.686
4 TwoSampleMeans 4 2 0.05 10 6.3 5 8 0 0.718
5 TwoSampleMeans 5 2 0.05 10 6.3 5 9 0 0.743
6 TwoSampleMeans 6 2 0.05 10 6.3 5 10 0 0.764
7 TwoSampleMeans 7 2 0.05 10 6.3 5 11 0 0.781
8 TwoSampleMeans 8 2 0.05 10 6.3 5 12 0 0.796
9 TwoSampleMeans 9 2 0.05 10 6.3 5 13 0 0.808
10 TwoSampleMeans 10 2 0.05 10 6.3 5 14 0 0.819
11 TwoSampleMeans 11 2 0.05 10 6.3 5 15 0 0.828
12 TwoSampleMeans 12 2 0.05 10 6.3 5 16 0 0.836
13 TwoSampleMeans 13 2 0.05 10 6.3 5 17 0 0.843
14 TwoSampleMeans 14 2 0.05 10 6.3 5 18 0 0.849
15 TwoSampleMeans 15 2 0.05 10 6.3 5 19 0 0.855
16 TwoSampleMeans 16 2 0.05 10 6.3 5 20 0 0.860
Residual std dev of 6.3 assumes an average of 10 quadrats/year
Inference limited to that single stream and cannot be generalized to other
streams
It turn out that you would need at least 13 years measured after impact to have an 80% power to detect a
difference of 10(!).
Note that the limiting factor for the power of this design is the combined year-to-year and year-stream
interaction variance. Even if a very large number of quadrats (i.e. n
q
) was measured each year, the repeated
quadrats only drive down the nal variance contribution and have no effect on the inuence of year-specic
and year-stream interaction effects.
Multiple Location studies
To improve the power, the same multiple locations can be measured in all year. As you will see in a minute,
the multiple locations serve as blocks and so the location effects are removed from the analysis. Further-
more, the multiple locations (streams) inuence the location-year (stream-year) interaction contribution. But
the year-to-year variation is still limiting unless you move to a full BACI design where years now serve a
blocks and the year-effect can be removed.
The full statistical model is:
Y
ijkl
=
i
+
year
ij
+ +k
stream
+
streamyear
ijk
+
quadrat
ijkl
where y
ijkl
is the measured response in period i (either before or after the impact), in year j (of that period),
streamk, and in quadrat l;
i
is the mean response in period i;
year
ij
is the year-specic (random) effect with
variance
2
year
;
stream
k
is the stream-specic effect with variance
2
stream
;
streamyear
ink
is the stream-year
interaction effect with variance sigma
2
streamyear
; and nally
quadrat
ijkl
is the residual random error with
variance
2
.
From the previous analysis of the raw data, the estimated variance components are:

2
year
= 36.64;

2
stream
= 38.312;

2
yearstreaminteraction
= 0;

2
quadrats
= 66.35;
The method of Stroup can be used to estimate the power to detect a specied difference using SAS for
any general design with varying number of locations measured each year, changing number of quadrats
measured at each year-location combination, etc. The code is too complicated to present here check with
the invert-power.sas programin the Sample ProgramLibrary. For the special case of 4 streams and 5 quadrats
measured per stream per year, the resulting output is:
Obs
n
years
before
n
years
after
n
streams
n
quadrats
Year
var
Stream
var
Stream
year
var
Quad
var Effect Power
1 5 5 4 5 36.64 38.312 0 66.35 period 0.59
2 5 6 4 5 36.64 38.312 0 66.35 period 0.64
3 5 7 4 5 36.64 38.312 0 66.35 period 0.68
4 5 8 4 5 36.64 38.312 0 66.35 period 0.71
5 5 9 4 5 36.64 38.312 0 66.35 period 0.74
6 5 10 4 5 36.64 38.312 0 66.35 period 0.76
7 5 11 4 5 36.64 38.312 0 66.35 period 0.78
8 5 12 4 5 36.64 38.312 0 66.35 period 0.79
9 5 13 4 5 36.64 38.312 0 66.35 period 0.81
Obs
n
years
before
n
years
after
n
streams
n
quadrats
Year
var
Stream
var
Stream
year
var
Quad
var Effect Power
10 5 14 4 5 36.64 38.312 0 66.35 period 0.82
11 5 15 4 5 36.64 38.312 0 66.35 period 0.83
12 5 16 4 5 36.64 38.312 0 66.35 period 0.83
13 5 17 4 5 36.64 38.312 0 66.35 period 0.84
14 5 18 4 5 36.64 38.312 0 66.35 period 0.85
15 5 19 4 5 36.64 38.312 0 66.35 period 0.85
16 5 20 4 5 36.64 38.312 0 66.35 period 0.86
There is no simple way to use the Stroup method in R or JMP. However, in the special case when all
locations are measured in all years and there are n
s
locations (streams) and the same number of quadrats
(n
q
) is measured in each year in each location, then rather simple computations can be done and the power-
analysis from a two-sample t-test can be used.
Because all locations are measured in all years, they play the same role as blocks and so the location
(stream) effect is removed from the analysis. The location (stream) variance component play no part in
the power determination. In order to use the power functions from a two-sample t-test, compute a standard
deviation as:
s
2
=
2
year
+
2
locationyear
n
s
+
2
quadrat
n
s
n
q
Using the estimated variance components above gives
s
2
= 33.79 +
0
4
+
66.35
4 5
= 6.32
This can now be used in the standard power programs. A sample program called invert-power.sas is
available in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms. The power is then computed using Proc Power as:
%let year_to_year_var =36.64;
%let stream_to_stream_var =38.312;
%let stream_year_interaction_var = 00;
%let quad_to_quad_var = 66.35;
ods graphics on;
%let stddev=%sysevalf((&year_to_year_var +
&stream_year_interaction_var/4 +
&quad_to_quad_var / 4 / 5)
**
0.5);
proc power;
title Power analysis for before/after analysis with 4 streams, 5 quadats/stream/year based on analysis of averages;
footnote Scope of inference is for ALL streams;
/
*
*
/
twosamplemeans
test=diff /
*
*
/
meandiff=10 /
*
*
/
stddev=&stddev /
*
*
/
power=. /
*
solve for power
*
/
alpha=.05 /
*
*
/
sides=2 /
*
*
/
groupns=5 | 5 to 20 by 1 /
*
number of years BEFORE (5) and AFter (5 to 10 by 1) to examine
*
/
; /
*
*
/
run;
ods graphics off;
which gives the following power computations:
Power analysis for before/after analysis with 4 streams, 5
quadats/stream/year based on analysis of averages
Obs Analysis Index Sides Alpha MeanDiff StdDev N1 N2 NullDiff Power
1 TwoSampleMeans 1 2 0.05 10 6.32 5 5 0 0.594
2 TwoSampleMeans 2 2 0.05 10 6.32 5 6 0 0.644
3 TwoSampleMeans 3 2 0.05 10 6.32 5 7 0 0.683
4 TwoSampleMeans 4 2 0.05 10 6.32 5 8 0 0.715
5 TwoSampleMeans 5 2 0.05 10 6.32 5 9 0 0.740
6 TwoSampleMeans 6 2 0.05 10 6.32 5 10 0 0.761
7 TwoSampleMeans 7 2 0.05 10 6.32 5 11 0 0.779
8 TwoSampleMeans 8 2 0.05 10 6.32 5 12 0 0.793
9 TwoSampleMeans 9 2 0.05 10 6.32 5 13 0 0.806
10 TwoSampleMeans 10 2 0.05 10 6.32 5 14 0 0.816
11 TwoSampleMeans 11 2 0.05 10 6.32 5 15 0 0.826
12 TwoSampleMeans 12 2 0.05 10 6.32 5 16 0 0.834
13 TwoSampleMeans 13 2 0.05 10 6.32 5 17 0 0.841
14 TwoSampleMeans 14 2 0.05 10 6.32 5 18 0 0.847
15 TwoSampleMeans 15 2 0.05 10 6.32 5 19 0 0.853
16 TwoSampleMeans 16 2 0.05 10 6.32 5 20 0 0.858
Scope of inference is for ALL streams
Notice that the power from the simpler method matches that from the Stroup method in this simplied
design.
13.13.3 Power: Simple BACI design - one site control/impact; one year before/after;
independent samples
Let us now turn our mind to power/sample size determination for the simplest BACI design. As noted, the
simple BACI design has one site measured at impact and control; one year measured before and after; and
independent samples taken at each year-site combination. This design is equivalent to a simple two-factor
CRD considered in previous chapters. Unlike in the previous chapters, we are now specically interested in
detecting the interaction the BACI effect.
If you have a program to determine power/sample size directly for the interaction in a two-factor CRD
design, it can be used given the means and standard deviations as noted below. In some cases, you may
not have access to such a program, but the computations for power and sample size are relatively straight-
forward.
The full statistical model for a general BACI design is:
Y
ijkl
=
ik
+
site
ij
+
year
ky
+
siteyear
ijky
+
quadrat
ijkyl
where Y
ijkyl
is the measured response in treatment i (either control or treatment), site j (in treatment i), at
period k (either before or after), in year y (within period k), and nally in subsample l (within treatment-
site-period-year combination ijky);
ik
is the mean response at treatment i and period k (i.e. the four means
used to determine the BACI contrast being the means at treatment-before, treatment-after, control-before,
and control-after);
site
ij
is site-to-site variation with variance (the square of the standard deviation) of
2
site
;
year
ky
is the year-to-year variation within the same period that affects all sites simultaneously;
siteyear
ijky
is
the site-year interaction with variance of
2
site-year
; and
quadrat
ijkl
is the subsampling variation with variance of
2
quadrat
.
In the simplest BACI experiment, there are only 1 site in each of the treatment or control SiteClasses (so
the k subscript only takes the value of 1), and only year measured in each of the two period (Before/After)
so the y subscript also only takes the value of 1.
The variance of an individual observation is (by statistical theory):
V (Y
ijkyl
) =
2
site
+
2
year
+
2
site-year
+
2
quadrat
We start by averaging over the n
ijky
subsamples taken in each site. This gives the sets of means:
Y
ijky
=
ik
+
site
ij
+
year
ky
+
site*year
ijky
+
quadrat
ijky
Notice that because the average is over the subsamples within each treatment-site-period-year combina-
tion, the only thing that varies is the quadrats, and so the average occurs only over the quadrat-to-quadrat
error. The variance of these averages is:
V (Y
ijk
) =
2
site
+
2
year
+
2
site-year
+
2
quadrat
n
ik
Only the nal variance is affect by averaging.
Now for each site, we take the difference between the after and before (average) readings recall that in
the simple BACI design, there is only 1 year in each period, i.e. the y subscript is always 1. This gives:
Y
ijA1
Y
ijB1
=
iA
+
site
ij
+
year
A1
+
siteyear
ijA1
+
quadrat
ijA1
(
iB
+
site
ij
+
year
B1
+
sitetime
ijB1
+
quadrat
ijB1
)
=
iA
iB
+
year
A1
+
sitetime
ijA1
+
quadrat
ijA1

year
B1

sitetime
ijB1

quadrat
ijB1
Notice that the site-to-site variance terms cancel out. This occurs because the same site is measured
on both the before and after period and so accounts for the extra site variation much like in paired designs.
The variance of the site differences is:
V (Y
ijA
Y
ijB
) =
2
year
+
2
site-year
+
2
quadrat
n
iA
+
2
year
+
2
site-year
+
2
quadrat
n
iB
= 2
2
year
+ 2
2
site-year
+
2
quadrat
n
iA
+
2
quadrat
n
iB
Finally, the BACI contrast is found as the difference of the difference in the treatment and control sites:
BACI =Y
T1A1
Y
T1B1
(Y
C1A1
Y
C1B1
)
=
TA
TB
(
CA
CB
)+
site
T1
+
year
A1
+
siteyear
T1A1
+
quadrat
T1A1

(
site
T1
+
year
B1
+
sitetime
T1B1
+
quadrat
T1B1
)
(
site
C1
+
year
A1
+
siteyear
C1A1
+
quadrat
C1A1
)+
(
site
C1
+
year
B1
+
sitetime
C1B1
+
quadrat
C1B1
)
=
TA
TB
CA
+
CB
+
siteyear
T1A1

sitetime
T1B1

siteyear
C1A1
+
sitetime
C1B1
+
quadrat
T1A1

quadrat
T1B1

quadrat
C1A1
+
quadrat
C1B1
Now notice that both the year and site effect both cancel out (as the must with any blocking variable) because
both sites were both measured in the same year before and after impact. The site-year interaction variance
component still remains and this is problematic! The problem is that the site-year random effects are
completely confounded with the respective means in the simple BACI design. This is exactly the problem
that occurs in pseudo-replication where it is impossible to distinguish treatment effects from random effects.
Because you only measured one site in the impact and control site classes and one year in the before and
after periods, you cannot tell if the observed interaction effect is due to the underlying BACI effect (the
difference in the means) or due to random, unknowable effects specic to each site-year combination. For
example, suppose that your control site tended to have lower than average counts in one year because of
some hidden problem at that site at that year. Then your control site would be lower than normal for one
year and you might detect an interaction effect for that experiment that is simple random chance.
Consequently, the best we can do in a simple BACI is dene a pseudo-BACI effect:
BACI
TA
TB
CA
+
CB
+ +
quadrat
T1A1

quadrat
T1B1

quadrat
C1A1
+
quadrat
C1B1
where the respective pseudo-means are specic to that choice of sites and years.
The nal variance of the pseudo-BACI (where we must assume that the site-year variance is negligible
and we are happy to restrict our ndings to these particular years and particular sites) effect of the simple
BACI design is found as:
V (BACI
) = 4
2
quadrat
n
sites
C
+
2
quadrat
n
sites
C
+
2
quadrat
n
sites
T
+
2
quadrat
n
sites
T
By taking multiple-subsamples, the contribution from the quadrat-to-quadrat variance can be reduced to a
fairly small value, but no amount of sampling within that site-year combination will resolve the confounding
of the site-year interaction terms with the means.
For example, in the crab example, we started with both sites having mean densities of about 35 crabs/quadrat.
We suspect that both population may decline over over time, but if the difference in decline is more than
about 5 crabs, we will be highly interested in detecting this. We varies the means in the table until the
corresponding graph showed an effect that appears to be biologically important:
The BACI contrast value is obtained then in the usual way:
BACI =
CA
CB
(
TA
TB
) =
CA
CB
TA
+
TB
= 30 35 25 + 35 = 5
If a negative value is obtained for the BACI contrast, use the absolute value in the power program.
We need information on the STANDARD DEVIATIONS for the quadrat-to-quadrat variation in the ex-
periment. Avery rough rule of thumb is that for count data, the standard deviation is about the
average counts
assuming that the counts follow a poisson distribution but it can be considerably higher. If you have access
to a preliminary survey, the average standard deviation among the quadrats measured in each site-year can
be used, or if you have a preliminary ANOVA, the Root Mean Square Error (RMSE) also measures the av-
erage standard deviation. Based on the results from the analysis presented earlier, the estimated standard
deviation is approximately 3.32.
Ive created a general power-analysis function for BACI designs called baci-power.sas available in the
This general method is based on the paper:
Stroup, W. W. (1999)
Mixed model procedures to assess power, precision, and sample size in the design of experi-
ments.
Pages 15-24 in Proc. Biopharmaceutical Section. Am. Stat. Assoc., Baltimore, MD
which is also discussed in Chapter 12 of
Littell, R.C., Milliken, G.A., Stroup, W.W., Wolnger, R.D., and Schabenberger, O.
SAS for Mixed Models. 2nd Edition
Basically, the program generated fake data that is used to estimate the power of the design.
In the case of the simple BACI design, the site-year random effect is confounded with the the four
treatment means and so we simply specify that its value is 0 in the function.
Here is some sample code to do the power analysis based on the baseline values and several other
scenarios.
proc datasets; /
*
delete any existing data set
*
/
delete all_power;
run;
data scenarios;
input alpha
sdSite sdYear sdSiteYear sdResid
n_TA n_TB n_CA n_CB
ns_T ns_C
ny_B ny_A
mu_TA mu_TB mu_CA mu_CB;
datalines;
.05 10 10 0 3.32 5 5 5 5 1 1 1 1 30 35 35 35
.05 10 10 0 3.32 10 10 10 10 1 1 1 1 30 35 35 35
.05 10 10 0 3.32 15 15 15 15 1 1 1 1 30 35 35 35
.05 10 10 0 3.32 20 20 20 20 1 1 1 1 30 35 35 35
run;
options mprint;
data _null_; /
*
compute the power for the various scenarios
*
/
set scenarios;
call execute(
%baci_power(alpha= || alpha ||
" , sdSite= " || sdSite ||
" , sdYear= " || sdYear ||
" , sdSiteYear=" || sdSiteYear ||
" , sdResid=" || sdResid ||
" , n_TA=" || n_ta ||
" , n_TB=" || n_tb ||
" , n_CA=" || n_CA ||
" , n_CB=" || n_CB ||
" , ns_T=" || ns_t ||
" , ns_C=" || ns_C ||
" , ny_B=" || ny_B ||
" , ny_A=" || ny_A ||
" , mu_TA=" || mu_TA ||
" , mu_TB=" || mu_TB ||
" , mu_CA=" || mu_CA ||
" , mu_CB=" || mu_CB || ");" );
*
put "
***
temp
***
" temp;
run;
giving the following power computations
alpha
n
TA
n
TB
n
CA
n
CB
ns
T
ns
C
ny
B
ny
A
mu
TA
mu
TB
mu
CA
mu
CB Power
0.05 5 5 5 5 1 1 1 1 30 35 35 35 0.35
0.05 10 10 10 10 1 1 1 1 30 35 35 35 0.64
0.05 15 15 15 15 1 1 1 1 30 35 35 35 0.82
0.05 20 20 20 20 1 1 1 1 30 35 35 35 0.91
For this example, if a BACI contrast of 5 is biologically important, then about 15 sub-samples are required
in each site-year combination. Varying the number of subsamples has some, but little effect on the overall
power compared to changes in the total sample size.
13.13.4 Power: Multiple sites in control/impact; one year before/after; independent
samples
As you saw in the previous section, the simple BACI design with one site measured in one year at impact
and control has a fundamental problem of pseudo-replication. We now extend the design to having multiple
sites (with still a single year of measurements in each period) in this section.
We start again with the full statistical model for a general BACI design is:
Y
ijkyl
=
ik
+
site
ij
+
year
ky
+
siteyear
ijky
+
quadrat
ijkyl
where Y
ijkyl
ik
and control-after);
site
ij
2
site
;
year
ky
siteyear
ijky
is
2
site-year
; and
quadrat
ijkl
2
quadrat
.
In this BACI experiment, there are now (possibly) multiple sites in either or both of the treatment or
control SiteClasses, but still only one year measured in each of the two period (Before/After) so the y
subscript also only takes the value of 1.
V (Y
ijkyl
) =
2
site
+
2
year
+
2
site-year
+
2
quadrat
We start by averaging over the n
ijky
subsamples taken in each site. This gives the sets of means:
Y
ijky
=
ik
+
site
ij
+
year
ky
+
site*year
ijky
+
quadrat
ijky
Notice that because the average is over the subsamples within each treatment-site-period-year combina-
tion, the only thing that varies is the quadrats, and so the average occurs only over the quadrat-to-quadrat
error. The variance of these averages is:
V (Y
ijk
) =
2
site
+
2
year
+
2
site-year
+
2
quadrat
n
ik
Only the nal variance is affect by averaging.
Now for each site, we take the difference between the after and before (average) readings recall that in
this design BACI design, there is still only 1 year in each period, i.e. the y subscript is always 1. This gives:
Y
ijA1
Y
ijB1
=
iA
+
site
ij
+
year
A1
+
siteyear
ijA1
+
quadrat
ijA1
(
iB
+
site
ij
+
year
B1
+
sitetime
ijB1
+
quadrat
ijB1
)
=
iA
iB
+
year
A1
+
sitetime
ijA1
+
quadrat
ijA1

year
B1

sitetime
ijB1

quadrat
ijB1
Notice that the site-to-site variance terms cancel out. This occurs because the same site is measured
on both the before and after period and so accounts for the extra site variation much like in paired designs.
The variance of the site differences is:
V (Y
ijA
Y
ijB
) =
2
year
+
2
site-year
+
2
quadrat
n
iA
+
2
year
+
2
site-year
+
2
quadrat
n
iB
= 2
2
year
+ 2
2
site-year
+
2
quadrat
n
iA
+
2
quadrat
n
iB
We now average over the (possible multiple) sites in the control areas:
Y
CA1
Y
CB1
=
CA
CB
+
year
A1
+
site*year
CA1
+
quadrat
CA1

year
B1

site*year
CB1

quadrat
CB1
which has variance of:
V (Y
CA
Y
CB
) = 2
2
year
+ 2
2
site-time
n
sites
C
+
2
quadrat
n
CA
n
sites
C
+
2
quadrat
n
CB
n
sites
C
A similar expression can be obtained for the average of (possibly) replicated treatment sites.
Finally, the BACI contrast is found as the difference of the difference in the treatment and control sites:
BACI =Y
TA1
Y
TB1
(Y
CA1
Y
CB1
)
=
TA
TB
(
CA
CB
)+
site
T
+
year
A1
+
siteyear
TA1
+
quadrat
T1A1

(
site
T
+
year
B1
+
siteyear
TB1
+
quadrat
T1B1
)
(
site
C
+
year
A1
+
siteyear
CA1
+
quadrat
C1A1
)+
(
site
C
+
year
B1
+
siteyear
CB1
+
quadrat
C1B1
)
=
TA
TB
CA
+
CB
+
siteyear
TA1

sitetime
TB1

siteyear
CA1
+
sitetime
CB1
+
quadrat
T1A1

quadrat
T1B1

quadrat
C1A1
+
quadrat
C1B1
Now notice that both the year and site effect both cancel out (as the must with any blocking variable) because
ALL sites were both measured in the same year before and after impact. The site-year interaction variance
component still remains but now is no longer problematic as it is no longer confounded with the means.
This is primary reason why multiple sites (at least multiple control sites) should be used with any BACI
experiment.
The variance of the BACI contrast is found to be:
V (BACI) = 2
2
site-year
n
sites
C
+ 2
2
site-year
n
sites
T
+
2
quadrat
n
CA
n
sites
C
+
2
quadrat
n
CB
n
sites
C
+
2
quadrat
n
TA
n
sites
T
+
2
quadrat
n
TB
n
sites
T
The standard deviation of the BACI contrast is the square root of this value. Finally, the BACI value
and the standard deviation of the BACI contrast can be used with any one-sample power analysis program
to determine the power of the design.
The previous analysis of this data gave the following estimates of the variance components;
The BACI effect size is determined in the usual way. Again assume that a BACI contrast of 5 is of
biological interest.
The baseline number of subsamples and sites is
As in the previous section, we used the baci-power.sas routine. Here is some sample code to do the
power analysis based on the baseline values and several other scenarios:
proc datasets; /
*
*
/
delete all_power;
run;
data scenarios;
input alpha
n_TA n_TB n_CA n_CB
ns_T ns_C
ny_B ny_A
datalines;
.05 3.831 10 1.296 3.30 5 5 5 5 1 2 1 1 30 35 35 35
.05 3.831 10 1.296 3.30 40 40 40 40 1 2 1 1 30 35 35 35
.05 3.831 10 1.296 3.30 5 5 5 5 1 4 1 1 30 35 35 35
.05 3.831 10 1.296 3.30 20 20 20 20 1 4 1 1 30 35 35 35
.05 3.831 10 1.296 3.30 20 20 5 5 1 4 1 1 30 35 35 35
.05 3.831 10 1.296 3.30 5 5 5 5 1 10 1 1 30 35 35 35
.05 3.831 10 1.296 3.30 5 5 5 5 2 10 1 1 30 35 35 35
.05 3.831 10 1.296 3.30 40 40 40 40 2 10 1 1 30 35 35 35
run;
data _null_; /
*
*
/
set scenarios;
call execute(
" , n_TA=" || n_ta ||
" , n_TB=" || n_tb ||
" , n_CA=" || n_CA ||
" , n_CB=" || n_CB ||
" , ns_T=" || ns_t ||
" , ns_C=" || ns_C ||
" , ny_B=" || ny_B ||
" , ny_A=" || ny_A ||
" , mu_TA=" || mu_TA ||
" , mu_TB=" || mu_TB ||
" , mu_CA=" || mu_CA ||
" , mu_CB=" || mu_CB || ");" );
*
put "
***
temp
***
" temp;
run;
giving the following power computations:
alpha
n
TA
n
TB
n
CA
n
CB
ns
T
ns
C
ny
B
ny
A
mu
TA
mu
TB
mu
CA
mu
CB Power
0.05 5 5 5 5 1 2 1 1 30 35 35 35 0.10
0.05 40 40 40 40 1 2 1 1 30 35 35 35 0.13
0.05 5 5 5 5 1 4 1 1 30 35 35 35 0.21
0.05 20 20 20 20 1 4 1 1 30 35 35 35 0.32
0.05 20 20 5 5 1 4 1 1 30 35 35 35 0.28
0.05 5 5 5 5 1 10 1 1 30 35 35 35 0.34
0.05 5 5 5 5 2 10 1 1 30 35 35 35 0.56
0.05 40 40 40 40 2 10 1 1 30 35 35 35 0.84
Notice that increasing the sub-sampling from 5 to 40 in each site-year combination has little effect on the
power. The problem is that these are pseudo-replicates and not true replicates. Consequently, replication of
the number of sites is much more effective. Now, because the number of treatment sites is limited (there is
only 1), it may be sensible to increase sampling at the treatment site at the quadrat level as well. For this
particular experiment, quite a few additional sites would be needed to have a reasonable chance of detecting
this size of a biologically important difference.
13.13.5 Power: One sites in control/impact; multiple years before/after; no subsam-
pling
The derivation for the power analysis for this design is relatively straightforward because there is no sub-
sampling in each site.
We start again with the full statistical model for a general BACI design is:
Y
ijkl
=
ik
+
site
ij
+
year
ky
+
siteyear
ijky
+
quadrat
ijkyl
where Y
ijkyl
ik
and control-after);
site
ij
2
site
;
year
ky
siteyear
ijky
is
2
site-year
; and
quadrat
ijkl
2
quadrat
.
In this BACI experiment, there are now only one site in each of the treatment or control SiteClasses
so the k subscript only takes the value of 1, but now multiple years measured in each of the two period
(Before/After) so the y subscript no takes many values. There is only one measurement at each site-year
combination, so the l subscript only takes the value 1 as well.
V (Y
ijkyl
) =
2
site
+
2
year
+
2
site-year
+
2
quadrat
Because both sites are measured every sampling event (a paired design), we rst take the difference in
response between the treatment and control sites at each time point:
d
ky1
=Y
T1ky1
Y
C1ky1
=
Tk
+
site
T1
+
year
ky
+
siteyear
T1ky
+
quadrat
T1kyl
(
Ck
+
site
C1
+
year
ky
+
siteyear
C1ky
+
quadrat
C1kyl
)
=
Tk
+
site
T1
+
siteyear
T1ky
+
quadrat
T1kyl
(
Ck
+
site
C1
+
siteyear
C1ky
+
quadrat
C1kyl
)
Notice that the sampling-variation caused by different sampling times cancels this is exactly what
happens with paired designs.
The average differences in the before and after periods is then found:
d
B1
=
TB
+
site
T1
+
siteyear
T1B
+
quadrat
T1B1
(
CB
+
site
C1
+
siteyear
C1B
+
quadrat
C1B1
)
d
A1
=
TA
+
site
T1
+
siteyear
T1A
+
quadrat
T1A1
(
CA
+
site
C1
+
siteyear
C1A
+
quadrat
C1A1
)
Finally the BACI contrast is the difference of these differences
BACI =d
B1
d
A1
=
TB
CB
TA
+
CA
+
siteyear
T1B

siteyear
C1B

siteyear
T1A
+
siteyear
C1A
+
quadrat
T1B1

quadrat
C1B1

quadrat
T1A1
+
quadrat
C1A1
Notice how the site-effects again cancel out because the same sites are measured in both periods (before and
after).
The BACI contrast has variance:
V (BACI) =
2
2
site-year
n
B
+
2
2
site-year
n
A
+
2
2
n
B
+
2
2
n
A
This can then be used with any power program for a single mean.
Note however, that with only one measurement per year, the site-year and residual (subsampling) vari-
ances cannot be disentangled and are completely confounded. Consequently, the estimated residual variation
is a combination of the site-year and residual variation and is used as is. From the previous analysis of the
sh data, we found that the year-to-year variance component (the SamplingTime variance component had
a (standard deviation) value of 31.1; the residual variation (combination of the site-year and subsampling
variation) had a (standard deviation) value of 27.07.
power analysis based on the baseline values and several other scenarios:
proc datasets; /
*
*
/
delete all_power;
run;
data scenarios;
input alpha
n_TA n_TB n_CA n_CB
ns_T ns_C
ny_B ny_A
datalines;
.05 99 31.71 0 27.07 1 1 1 1 1 1 12 13 20 30 30 30
.05 99 31.71 0 27.07 1 1 1 1 1 1 24 26 20 30 30 30
.05 99 31.71 0 27.07 1 1 1 1 1 1 12 13 70 30 30 30
.05 99 31.71 0 27.07 1 1 1 1 1 1 12 13 80 30 30 30
run;
data _null_; /
*
*
/
set scenarios;
call execute(
" , n_TA=" || n_ta ||
" , n_TB=" || n_tb ||
" , n_CA=" || n_CA ||
" , n_CB=" || n_CB ||
" , ns_T=" || ns_t ||
" , ns_C=" || ns_C ||
" , ny_B=" || ny_B ||
" , ny_A=" || ny_A ||
" , mu_TA=" || mu_TA ||
" , mu_TB=" || mu_TB ||
" , mu_CA=" || mu_CA ||
" , mu_CB=" || mu_CB || ");" );
*
put "
***
temp
***
" temp;
run;
giving the following power computations:
alpha
n
TA
n
TB
n
CA
n
CB
ns
T
ns
C
ny
B
ny
A
mu
TA
mu
TB
mu
CA
mu
CB Power
0.05 1 1 1 1 1 1 12 13 20 30 30 30 0.10
0.05 1 1 1 1 1 1 24 26 20 30 30 30 0.15
0.05 1 1 1 1 1 1 12 13 70 30 30 30 0.71
0.05 1 1 1 1 1 1 12 13 80 30 30 30 0.88
Note that the power is extremely small for all but very large differences even with monthly samples
taken before and after impact. The key problem is the large variation in the differences among the different
months. This may indicate that the design should be modied to incorporate sub-sampling to try and reduce
some of the variation seen across months.
The power and sample size for the chironomid example is very similar and not repeated here.
13.13.6 Power: General BACI: Multiple sites in control/impact; multiple years be-
fore/after; subsampling
Whew! Now for a power/sample size analysis for the general BACI design with multiple sites in the Site-
Classes (typically only the Control sites are replicated), multiple years in each period (before/after), and
multiple subsamples.
The derivation of the BACI contrast and its variance follows the example shown in previous section and
only the nal results will be shown.
The BACI contrast is the difference of the differences after averaging over subsamples, multiple sites in
each SiteClass, and multiple years in each period:
BACI =d
B
d
A
=
TB
CB
TA
+
CA
+
siteyear
TB

siteyear
CB

siteyear
TA
+
siteyear
CA
+
quadrat
TB

quadrat
CB

quadrat
TA
+
quadrat
CA
Notice how the site-effects and year-effects again cancel out because the same sites are measured in all years
(before and after).
The generali BACI contrast has variance:
V (BACI) =
2
site-year
ny
B
ns
T
+
2
site-year
ny
B
ns
C
+
2
site-year
ny
A
ns
T
+
2
site-year
ny
A
ns
C
+

2
ny
B
ns
T
n
TB
+

2
ny
B
ns
C
n
CB
+

2
ny
A
ns
T
n
TA
+

2
ny
A
ns
C
n
CA
where ny
B
is the number of years monitored before impact; ny
A
is the number of years monitored after
impact; ns
T
is the number of sites measured in the Treatment SiteClass; ns
C
is the number of sites measured
in the Control SiteClass; n
TB
, n
TA
, n
CB
, and n
CA
is the number of subsamples taken in each site and year
in the treatment-before combination etc. This can then be used with any power program for a single mean
in the same fashion as before.
Consider the fry example. Here the variance components are:
site-to-site standard deviation 0.75; This is irrelevant because all sites are measured in all years.
year-to-year standard deviation 0.25; This is also irrelevant because all sites are measured in all years.
site-year standard deviation 0.1;
sub-sampling standard deviation 0.75.
A BACI contrast of 0.5 is of interest, so choose 4 means to give the relevant value. Finally, various combi-
nations of the number of cages, number of sites, number of years are tried to obtain the following power.
power analysis based on the baseline values and several other scenarios.
proc datasets; /
*
*
/
delete all_power;
run;
data scenarios;
input alpha
n_TA n_TB n_CA n_CB
ns_T ns_C
ny_B ny_A
datalines;
.05 .75 .25 .10 .75 3 3 3 3 3 3 3 2 5.0 5.5 4.5 4.5
.05 .75 .25 .10 .75 6 6 6 6 3 3 3 2 5.0 5.5 4.5 4.5
.05 .75 .25 .10 .75 9 9 9 9 3 3 3 3 5.0 5.5 4.5 4.5
.05 .75 .25 .10 .75 6 6 6 6 3 3 3 4 5.0 5.5 4.5 4.5
run;
options mprint;
data _null_; /
*
*
/
set scenarios;
call execute(
" , n_TA=" || n_ta ||
" , n_TB=" || n_tb ||
" , n_CA=" || n_CA ||
" , n_CB=" || n_CB ||
" , ns_T=" || ns_t ||
" , ns_C=" || ns_C ||
" , ny_B=" || ny_B ||
" , ny_A=" || ny_A ||
" , mu_TA=" || mu_TA ||
" , mu_TB=" || mu_TB ||
" , mu_CA=" || mu_CA ||
" , mu_CB=" || mu_CB || ");" );
*
put "
***
temp
***
" temp;
run;
giving the following power computations
alpha
n
TA
n
TB
n
CA
n
CB
ns
T
ns
C
ny
B
ny
A
mu
TA
mu
TB
mu
CA
mu
CB Power
0.05 3 3 3 3 3 3 3 2 5 5.5 4.5 4.5 0.30
0.05 6 6 6 6 3 3 3 2 5 5.5 4.5 4.5 0.51
0.05 9 9 9 9 3 3 3 3 5 5.5 4.5 4.5 0.76
0.05 6 6 6 6 3 3 3 4 5 5.5 4.5 4.5 0.67
You see that about 9 cages need to employed in each site-year combination to get reasonable power.
Chapter 14
Comparing proportions - Chi-square
(
2
) tests
Contents
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
14.2 Response variables vs. Frequency Variables . . . . . . . . . . . . . . . . . . . . . . . 816
14.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
14.4 Single sample surveys - comparing to a known standard . . . . . . . . . . . . . . . . 820
14.4.1 Resource selection - comparison to known habitat proportions . . . . . . . . . . . 820
14.4.2 Example: Homicide and Seasons . . . . . . . . . . . . . . . . . . . . . . . . . . 826
14.5 Comparing sets of proportions - single factor CRD designs . . . . . . . . . . . . . . . 830
14.5.1 Example: Elk habitat usage - Random selection of points . . . . . . . . . . . . . 830
14.5.2 Example: Ownership and viability . . . . . . . . . . . . . . . . . . . . . . . . . 834
14.5.3 Example: Sex and Automobile Styling . . . . . . . . . . . . . . . . . . . . . . . 839
14.5.4 Example: Marijuana use in college . . . . . . . . . . . . . . . . . . . . . . . . . 843
14.5.5 Example: Outcome vs. cause of accident . . . . . . . . . . . . . . . . . . . . . . 847
14.5.6 Example: Activity times of birds . . . . . . . . . . . . . . . . . . . . . . . . . . 851
14.6 Pseudo-replication - Combining tables . . . . . . . . . . . . . . . . . . . . . . . . . . 853
14.7 Simpsons Paradox - Combining tables . . . . . . . . . . . . . . . . . . . . . . . . . . 857
14.7.1 Example: Sex bias in admissions . . . . . . . . . . . . . . . . . . . . . . . . . . 857
14.7.2 Example: - Twenty-year survival and smoking status . . . . . . . . . . . . . . . . 858
14.8 More complex designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
14.9 Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
14.10Appendix - how the test statistic is computed . . . . . . . . . . . . . . . . . . . . . . . 860
14.11Fishers Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
14.11.1 Sampling Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
14.11.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
814
CHAPTER 14. COMPARING PROPORTIONS - CHI-SQUARE (
2
) TESTS
14.11.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
14.11.4 Example: Relationship between Aspirin Use and MI . . . . . . . . . . . . . . . . 867
14.11.5 Avoidance of cane toads by Northern Quolls . . . . . . . . . . . . . . . . . . . . 870
14.1 Introduction
In previous chapters, the Analysis of Variance (ANOVA) was used as the general methodology for testing
hypotheses about population MEANS. In many cases, the response variable can often be dichotomized into
classes. Often the proportion of the population that falls into the various classes is of interest. There are two
types of hypotheses that are often considered
Are the observed proportions comparable to a known set of proportions. For example, a sample of
habitat usage by animals is sampled, and we wish to know if this is the same as the actual proportion
of available habitat as determined using a GIS. Notice that the proportion of habitat available by type
is known EXACTLY because the entire area is covered by the GIS system.
Are the observed proportions across treatment groups the same. For example, both males and female
animals could be sample for habitat preference and it is of interest to know if males and females
behave similarly. In this case proportions in both groups are determined from samples and NOT
known exactly - they include sampling error.
For both cases, the standard methodology to test hypotheses about population proportions is called a
chi-square test, sometimes shown as
2
-test.
Warning: The use of the generic term chi-square test is unfortunate as there are many statistical tests
where the nal test statistic is also compared to a chi-squared distribution which are NOT tests of proportions.
As well, not all tests of proportions lead to chi-square tests.
Some examples of the type of studies to be considered in this chapter are:
Resource selection studies where a randomsample of habitat usages by a group of animals is compared
to a known habitat classication.
Resource selection studies where random samples are selected for two or more groups and proportions
between the groups are to be compared.
An experiment is conducted to investigate the effects of dissolved gases upon sh survival. Fish are
randomly assigned to either a control group or a experimental group. At the end of the experiment
each sh is classied as live or dead. Is the proportion of live/dead the same in both treatment groups?
People are cross-classied by the amount of education (e.g. high school, college, post-graduate) and
their socio-economic status (low, middle, high). Is there a relationship?
2
) TESTS
Animals are cross-classied by breeding status and age. Is there a relationship between the two vari-
ables?
Students are cross classied by their usage of marijuana and alcohol compared to their parents usage.
Is there evidence of a relationship?
Warning: This chapter limited to single factor CRD. As mentioned many times in the previous chap-
ters, it is important that the analysis match the experimental design. This chapter will review ONLY the
analysis of single-factor completely randomized designs the analysis of blocked designs, multi-factor de-
signs, split-plot designs, or sub-sampling designs when testing proportions is VERY complex. Please consult
suitable help before proceeding. For example, if the data are paired, then McNemars test, rather than the
chi-square tests of this chapter, would be used. See, for example, Hoffman, J.I.E. 1976. The incorrect use
of Chi-sqare analysis for paired data. Clin. Exp. Journal, 24, 227-229. http://www.ncbi.nlm.nih.
gov/pmc/articles/PMC1538510/?tool=pubmed.
Warning: Sacricial pseudo-replication. Hurlbert (1984) found that pseudo-replication is a particu-
lar problem with experiments that compare proportions - particularly sacricial pseudo-replication where
studies are (inappropriately) pooled before conducting a chi-square test. A related problem of pooling is
Simpsons Paradox. Both are explored in more detail later in this chapter. For example, what is the differ-
ence between a study with 10 sh in each tank and 1 sh in each of 10 tanks?
Warning: Power and sample size hard to determine. There is no easy way to determine power and
sample size for these tests as the hypothesis is so nebulous. It is possible to use JMP to determine power and
sample size when testing for changes in a single proportion (e.g. differences in death rates among treatments).
Please seek help if you are interested in doing this.
Warning: This is NOT compositional data. This chapter does NOT deal with the analysis of compo-
sitional data. For example, you may observe birds for 2 hour periods and compute the proportion of time
spent in various activities. Despite the similarities to contingency tables (where row or column percentages
add to 100%), compositional data is NOT analyzed using the chi-square test discussed in this chapter.
Warning: Avoid duplicate counts. The techniques presented in this chapter are NOT applicable if a
subject can appear in more than one category. For example, you will often see surveys where respondents
are asked to check all the sports that they enjoy watching; or animals are watched and ALL of the activities
are recorded. A key assumption of the analyses in this chapter is that every experimental unit appears once
and only once in the contingency table. Methods have been developed to deal with multi-response data
seek help if you have a research problem in this area.
14.2 Response variables vs. Frequency Variables
The data for a test of proportions can come in two formats:
individual records
2
) TESTS
summarized records
If data are individual records, each observation in the dataset consists of the classication for a single
individual. For example, consider a study that examined the condition factor of deer after a particularly nasty
winter. As each deer is spotted, its sex and condition factor are noted where the condition factor is classied
into two classes (good and poor). Here are some data:
Individual Sex Condition
1 m good
2 f poor
3 m poor
4 m good
5 m poor
6 f good
7 f good
. . .
Each deer has two variables recorded: sex (nominal scale) and condition factor (ordinal scale). The
study would consider the sex to be the explanatory variable (the X variable) and the condition factor to be
the response variable (the Y variable).
In some cases, the data are summarized and information on individual animals is not given. For example,
the summary information for the above study could take the form:
Sex Condition Count
m good 10
f good 10
m poor 23
f poor 18
The variables recorded are sex (nominal scale), condition factor (ordinal scale), and a count. The study
would now consider sex to be the explanatory variable (the X variable), the condition factor to be the
response variable (the Y variable), and the count as a frequency variable. The count variable is NOT the
response variable it merely summarizes the information from a number of individuals. This often causes
confusion when it comes to an analysis as it is tempting to consider the count variable as the response. The
easiest way to avoid this confusion is to ask what variable would be the response variable if information on
individuals was recorded the same variable will be the response variable when summary data are given.
JMP will requires that both the X and Y variables are nominal or ordinal scale.
2
) TESTS
14.3 Overview
Regardless of the statistical procedure used or the type of response variable examined, it is extremely im-
portant that the analysis match the experimental/survey design. As noted earlier, most computer packages
assume that you have done the matching of design with the analysis.
Recall that a single factor completely randomized design has the following structures:
treatment structure. A single factor with at least 2 levels describes the population groups for which
comparisons of the population parameter are to be made. As there is only one factor, the treatments
correspond to the levels of the factor.
experimental unit structure. There is a single size of experimental unit; the observational unit is the
experimental unit.
randomization structure. Each experimental unit is randomly and independently assigned to each
treatment group, or in the case of a analytical studies, the units are a simple random sample from the
relevant treatment groups.
There are two variables of interest:
The explanatory variable (X variable) that denes the treatment groups. This should be nominal or
ordinal scale.
1
The response variable (Y variable) which is the category that the unit is classied into. This should
be nominal or ordinal scale.
If the data has been summarized, there will also be a frequency variable which counts the number of indi-
viduals for each combination of the explanatory and response variables.
The response and explanatory variables can also be interval or ratio variables if the interval or ratio
variables are rst broken into categories and the categories are used as nominal or ordinal variables. For
example, after measuring a persons height, you could create a new variable called height group that had
the values short, average, or tall; or the actual amount of ber in a cereal was used to classify a cereal as a
low, medium, or high ber cereal.
The hypotheses of interest can be written in several (equivalent) ways:
Equality of proportions. The null hypothesis is that the proportion in each the response variable
categories is the same for all treatment groups. The alternate hypothesis is that the set of proportions
differs somewhere among the treatment groups.
1
If the case where a single group is being compared to known proportions, the explanatory variable is not used
2
) TESTS
Independence. The null hypothesis is that the response category is independent of the the treatment
group. The alternate hypothesis is that there is some sort of (ill dened) association.
Both of the above hypotheses are exactly equivalent and can be used interchangeably.
It is possible to write the hypotheses in terms of population parameters. Let
ij
be the proportion of
treatment population i that is found in response category j. There are G treatment groups and K response
categories. Note that
j

ij
= 1, i.e., the proportions must add to 100% for each treatment population. The
null hypothesis is:
H :
11
=
21
=
31
= . . . =
G1
and
12
=
22
=
32
= . . . =
G2
and
. . .
1K
=
2K
=
3K
= . . . =
GK
The basic summary statistic in a chi-square analysis is the contingency table which summarizes the
number of observations for each combination of the explanatory and response variable. Sample percentages
are often computed (denoted as p
ij
to distinguish them from the
ij
in the population). The basic summary
graph is the mosaic chart which consists of side-by-side segmented bar charts.
The test statistic can be computed in several ways:
Pearson chi-square is a comparison of the observed and expected counts when the null hypothesis is
true.
Likelihood ratio chi-square is a weighted average of the ratio of the observed and expected counts.
Fishers exact test looks at how many contingency tables are more unusual than the one observed
here.
The rst two test statistics often are very similar and there is no objective way to choose between them.
[It can be shown that in large samples the two are equivalent]. However, both test statistics assume that the
counts in the contingency table are reasonably large. An often quoted rule of thumb is that the expected
count in each cell of the contingency table should be at least 5. The Fisher test statistic may be more suitable
for very sparse tables, i.e., tables with lots of 0s, 1s or 2 counts. For tables that are not sparse, Fishers test
statistic often takes too long to compute even with modern computers.
Regardless of which test procedure is used, the ultimate end-point is the p-value. This is interpreted in
exactly the same way as in all previous studies, i.e., it is a measure of how consistent the data is with the null
hypothesis. It does NOT measure the probability that the hypothesis is true! As before, small p-values
are strong evidence that the data are not consistent with the hypothesis leading to a conclusion against the
null hypothesis.
As in the ANOVA procedure, if strong enough evidence is found against the null hypothesis, you still
dont know where the hypothesis apparently failed. Some sort of multiple comparison procedure is required.
2
) TESTS
Unfortunately, statistical theory is not yet well developed in this area because of the complication that the
percentages within a treatment group must add to 100%. Consequently, it can be confusing (as you will see
later) to try and see which category is different among treatment groups. About the best that can be done is
to look at the individual cell contributions to the overall test-statistic (called the individual cell chi-squared
values) to see where the large contributions have occurred. A rough rule of thumb is that entries larger than
4 or 5 indicate some problems with the t.
In the ANOVA, a great deal of emphasis was placed on estimation of effect sizes (e.g., condence
intervals for differences). It is possible to construct such intervals for contingency tables, but these are more
readily done when log-linear models are used these are beyond the scope of this course.
14.4 Single sample surveys - comparing to a known standard
The simplest survey is when data is collected from a single population via a single sample and the question
of interest compares the proportions of each category in the population with a KNOWN set of proportions
(the standard). For example, to test if a coin is fair, a series of tosses are performed. The proportion of heads
for that coin is tested against the standard of .50 for a fair coin.
The sampling design is a simple random sample from the relevant population. For each unit in the
sample, a categorical response (e.g. heads or tails) is recorded. The sample statistics of interest are the
observed proportion of these categories (e.g. the observed proportion of heads or tails in the sample). Notice
that unlike previous chapters, the summary statistics is a sample proportion (refer back to the creel survey
where the the proportion of angling parties with sufcient life jackets was found).
As before, never present naked estimates some measure of precision needs to be reported, either the se
or a condence interval or both.
The hypothesis of interest is H:
heads
= .50;
tails
= .50. Again notice that we are now testing a
population PROPORTION rather than a population mean. A measure of discrepancy of the data relative
to the hypothesis is computed (the chi-square test statistic) and this eventually leads to a p-value which is
interpreted in the usual fashion. Again, never report a naked p-value always report an effect size size along
with a measure of precision.
14.4.1 Resource selection - comparison to known habitat proportions
Neu et al. (1974) considered selection of habitat by Moose Alces alces in the Little Sioux Burn area of
Minnesota in 1971-72. The authors determined the proportion of four habitat categories (see table below)
using an aerial photograph, and latter classied moose usage of 116 moose during later aerial surveys.
2
) TESTS
Habitat Actual Proportion Moose observations
In burn, interior .340 24
In burn, edge .101 22
Out of burn, edge .104 30
Out of burn, further .455 40
Total 1.000 116
The data are entered in the usual fashion in three columns. The data is available in the moosehabi-
tat.csv le in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms. The data are imported into SAS in the usual way:
data moose;
length habitat $20.;
infile moosehabitat.csv dlm=, dsd missover firstobs=2;
input habitat $ RealProp MooseCount;
run;
Obs habitat RealProp MooseCount
1 In burn, interior 0.340 24
2 In burn, edge 0.101 22
3 Out of burn, edge 0.104 30
4 Out of burn, further 0.455 40
The habitat variable is the response variable. The number of moose locations is the frequency of each habitat
observed - this is summarized data as noted in the introduction.
Does this survey meet the assumptions required for this procedure:
The proportions of actual habitat are determined by an aerial photograph and are known exactly (or
with negligible error). This seems satised.
The moose sighting are a random sample of moose? The paper is not clear on how moose were sighted
but these represent observations collected over a number of days and different animals.
Is each unit measured only once? Moose are indistinguishable so it unknown if the same moose is
measured multiple times. However, the survey was over a number of days, the moose move consid-
erable distances, so it seems reasonable that the observations are independent even though the same
moose may have been measured multiple times.
2
) TESTS
We use Proc Freq in SAS to do the hypothesis test and to compute the individual proportions for each of
the categories (along with 95% condence intervals).
ods graphics on;
proc freq data=moose;
title2 compute actual proportions and test;
/
*
the testp specifies the proportions in order of habitat classes (alphabetical)
*
/
/
*
Only the pearson chi-square test statistic is available - the likelihood ratio test
statistic cannot be obtained. Hence the test statistic (but not the p-value) differs
from that from GENMOD below
*
/
table habitat / nocum testp=(.101 .340 .104 .455) chisq cl all plots=all;
exact chisq lrchi;
weight moosecount;
table habitat / binomial(level=1);
ods output OneWayChiSq=chisq1;
ods output OneWayFreqs=freq1;
ods output BinomialProp=binprop1;
run;
ods graphics off;
The Table statements do the heavy living. In the rst Table statement, we specify the KNOWN propor-
tions from the null hypothesis and request the (large-sample)
2
test-statistic. In the remainder of the Table
statements, we requests the proportion for each level in the table along with the se and 95% condence in-
tervals using the Binomial option. Notice that (unfortunately), we must specify the Binomial option several
times (one for each level in the table) to get the estimates and condence intervals for each level in the table
(groan!).
Notice that because the input data was summarized, the Weight statement species the variable that
contains the frequency count for each level of season.
Several tables are produced:
First, is the table of the raw proportions and the hypothesized probabilities:
habitat Frequency Percent
Test
Percent
In burn, edge 22 18.97 10.10
In burn, interior 24 20.69 34.00
Out of burn, edge 30 25.86 10.40
Out of burn, further 40 34.48 45.50
2
) TESTS
along with a bar chart of the counts (but neither the proportion nor standard errors are given groan):
These side-by-side bar charts is not total satisfactory as the bars must add to 100%. A mosaic plot can
be created in JMP or R but is not easily created in SAS contact me for details.
The mosaic plot is a segmented bar chart that divides the bar into habitat usage by the number of moose
points observed. It adds to 100% is an alternate display to side-by-side histograms.
We now compute the raw proportions and their standard errors:
Proc Freq computed the individual proportions, se, and condence limits unfortunately, there is no
easy way to label these the individual habitats, and so one must know the order in which the habitats are
sorted by comparing the values to the previous table:
Statistic Value
Proportion 0.1897
ASE 0.0364
2
) TESTS
Statistic Value
95% Lower Conf Limit 0.1183
95% Upper Conf Limit 0.2610
Proportion 0.2069
ASE 0.0376
Proportion 0.2586
ASE 0.0407
Proportion 0.3448
ASE 0.0441
We observe that the sample proportion in Out of burn, further appears to be less than the actual proportion
as measured by the aerial photographs as its condence interval does not include the actual proportion of
45%. Similarly, the observed proportion of moose in the Out of burn, edge appears to be higher than the
actual proportion of 26% as again the condence interval does not cover the actual value. In fact, none of
the condence intervals cover the actual values.
At this point, one could stop, but often a formal hypothesis test is conducted. The hypothesis of interest
is
H :
in burn interior
= .34;
in burn edge
= .101;
out of burn edge
= .104;
out of burn further
= .455
where represent the proportion of habitat classes used by all the moose (and not just the sample observed).
The alternate hypothesis is that the used proportions do not match the known proportions of habitat.
Finally, we get the result of the chi-square test of the specied proportions.
Statistic Value
Chi-Square 44.8321
DF 3
Asymptotic Pr > ChiSq <.0001
Exact Pr >= ChiSq 3.558E-08
2
) TESTS
SAS also provides a graphic representing the deviations of each proportion from the hypothesized value:
It would be nice to place condence limits (from the previous table) on this plot so that you could easily
visually which category deviated substantially from the hypothesized value too bad this isnt done auto-
matically..
The p-values from both the Pearson and Likelihood ratio test are very small. There is strong evidence against
the hypothesis that moose use the habitat in the same proportion as availability. The condence intervals
clearly show where the preferences and avoidances lie.
The book:
Manley, B.F.J, McDonald, L.L., Thomas, D.L., McDonald, T.L. and Erickson, W.P. (2002).
Resource Selection by Animals. Kluwer Academic Publisers
continues with this example and shows how to compute resource selection probabilities that can be inter-
preted as the probability that a randomly selected moose will select each habitat at a randomly chosen point
2
) TESTS
in time.
14.4.2 Example: Homicide and Seasons
Is there a relationship between weather and violent crime? In the paper:
Cheatwood, D. (1988).
Is there a season for homicide? Criminology, 26, 287-306.
http://dx.doi.org/10.1111/j.1745-9125.1988.tb00842.x
the author classied 1361 homicides in Baltimore from 1974-1984 by season
2
as follows:
Winter 328
Spring 334
Summer 372
Fall 327
Is there evidence of difference in the number of homicides by season?
The data is available in the homicideseason.csv le in the Sample Program Library at http://www.
the usual way:
data homicide;
length season $20.;
infile homicideseason.csv dlm=, dsd missover firstobs=2;
input season $ HomCount;
run;
Obs season HomCount
1 Winter 328
2 Spring 334
3 Summer 372
4 Fall 327
2
Refer to Table 7 of the paper
2
) TESTS
Let

winter
,
spring
,
summer
and
fall
represent the true proportion of homicides.
We rst formulate the null hypothesis. This is always in terms of population parameters:
H :
winter
=
spring
=
summer
=
fall
= .25
where
season
represent the true proportion of all homicides in each of the seasons.
The assumptions seem to be satised for this analysis
The proportion of the year in each season is known exactly
Each homicide is measured once and only once presumably multiple murders are counted only once.
We use Proc Freq in SAS to do the hypothesis test and to compute the individual proportions for each of
the categories (along with 95% condence intervals).
ods graphics on;
proc freq data=homicide ;
title2 compute actual proportions and test;
/
*
the testp specifies the proportions in order of season classes (alphabetical)
*
/
table season / nocum testp=(.25 .25 .25 .25) chisq cl all plots=all;
table season / binomial(level=1);
weight HomCount;
ods output OneWayChiSq=chisq1;
ods output OneWayFreqs=freq1;
ods output BinomialProp=binprop1;
run;
ods graphics off;
The Table statements do the heavy living. In the rst Table statement, we specify the KNOWN propor-
tions from the null hypothesis and request the (large-sample)
2
test-statistic. In the remainder of the Table
statements, we requests the proportion for each level in the table along with the se and 95% condence in-
tervals using the Binomial option. Notice that (unfortunately), we must specify the Binomial option several
times (one for each level in the table) to get the estimates and condence intervals for each level in the table
(groan!).
2
) TESTS
Notice that because the input data was summarized, the Weight statement species the variable that
contains the frequency count for each level of season.
Several tables are produced:
First, is the table of the raw proportions and the hypothesized probabilities:
season Frequency Percent
Test
Percent
Fall 327 24.03 25.00
Spring 334 24.54 25.00
Summer 372 27.33 25.00
Winter 328 24.10 25.00
Second, is the result of the chi-square test of the specied proportions.
Statistic Value
Chi-Square 4.0345
DF 3
Pr > ChiSq 0.2578
Finally, we get the individual proportions, se, and condence limits unfortunately, there is no easy way to
label these the individual seasons, and so one must know the order in which the seasons are sorted.
Statistic Value
Proportion 0.2403
ASE 0.0116
Proportion 0.2454
ASE 0.0117
Proportion 0.2733
ASE 0.0121
2
) TESTS
Statistic Value
Proportion 0.2410
ASE 0.0116
SAS also provides a graphic representing the deviations of each proportion from the hypothesized value:
It would be nice to place condence limits (from the previous table) on this plot so that you could easily
visually which category deviated substantially from the hypothesized value too bad this isnt done auto-
matically..
As the p-value is quite large, there is no evidence that homicides occur unequally over the year. All of
the condence intervals for the individual proportions include the hypothesized value of 0.25. Note that we
have made adjustment in computing the condence intervals for each season for the multiple-testing problem
2
) TESTS
(e.g. similar to the Tukey-adjustment following ANOVA). This can be done, but is beyond the scope of this
course.
14.5 Comparing sets of proportions - single factor CRD designs
In many cases, the questions of interest lies if the proportions in the categories is the same across different
(treatment) groups. For example, is the proportion of heads the same for pennies produced in the United
States and Canada? Now the hypothesis of interest is written as:
H :
heads,US
=
heads,Canada
; AND
tails,US
=
tails,Canada
The actual proportion of heads/tails in either population really is NOT of interest (we are not testing if both
population are fairs) we are only interested in testing if BOTH population have the same set of proportions.
In this chapter, only single factor complete randomized designs are considered. This implies that every
object is independent of every other object and that the observational unit is the same as the experimental
unit. More advanced designs are covered in more advance courses.
AVERYCOMMONERRORis to use the simple
2
test presented in this chapter even for more complex
designs!
Note that if there are only two categories, an alternative (and equivalent) approach is to use logistic
regression and logistic ANOVA. Please refer to the appropriate chapter for more details.
14.5.1 Example: Elk habitat usage - Random selection of points
Marcum and Loftsgaarden (1980) use data from Marcum (1975) where 200 randomly selected points were
located on a map of a study area which contained a mixture of forest-canopy cover classes. These were
broken into 4 cover classes as shown below.
At the same time, 325 location of elk in the region were located as shown below.
Cover Class Random Location Elk Location
00% 15 3
01-25% 61 90
26-75% 84 181
76-100% 40 51
Total 200 325
The data is available in the elkhabitat.csv le in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
2
) TESTS
The data are imported into SAS in the usual way:
data elkhabitat;
infile elkhabitat.csv dlm=, dsd missover firstobs=2;
input canopy $ random elk;
/
*
stack the data
*
/
source = random;
count = random;
output;
source = elk;
count = elk;
output;
keep source canopy count;
run;
Obs canopy source count percent
1 0% elk 3 1
2 01-25% elk 90 28
3 26-75% elk 181 56
4 76-100% elk 51 16
5 0% random 15 8
6 01-25% random 61 31
7 26-75% random 84 42
8 76-100% random 40 20
The hypothesis of interest is if the proportion of elk usage of the canopy class matches the actual propor-
tion. Notice how this differs from the moose habitat example earlier in the chapter as the actual proportion
of cover class is NOT known exactly and is only estimated using the sample of random points. In modern
times with GIS and other aerial methods, there is little to be gained by sampling a random sample of points
rather than measuring the entire region.
We again consider how the data were collected to see if it meets the assumptions of the analysis:
Is this a CRD? The article is not clear, but it appears that if the same elk were measured multiple times,
that the timings are far enough apart to treat the locations as independent measurements.
Observational and experimental unit the same? Yes, the elk or random point.
Randomization structure is random in location and animals.
2
) TESTS
In order to get a mosaic chart, we need to compute the percentages in advance of making the segmented
bar-chart:
/
*
In order to make a mosaic chart, we need to compute the percents outselves
*
/
proc sort data=elkhabitat;
by source;
run;
proc means data=elkhabitat noprint;
by source;
var count;
output out=totalcount sum=totalcount;
run;
data elkhabitat;
merge elkhabitat totalcount;
by source;
percent = count / totalcount
*
100;;
format percent 7.0;
drop _type_ _freq_ totalcount;
run;
proc print data=elkhabitat;
title2 percents computed;
run;
and then we use Proc SGplot to create the segmented bar-chart:
proc sgplot data=elkhabitat;
title2 Side-by-side segmented barcharts;
vbar source / group=canopy
response=percent
groupdisplay=stack
stat=sum ;
run;
nally giving:
2
) TESTS
In order to do a statistical test, we must rst stack the data into a format similar to that used for ANOVA:
Notice how we stacked the data in the data step so that we have columns for the two types of points, the
canopy class, and the count of the number of moose locations.
We use Proc Freq to summarize the data into a contingency table and to compute the appropriate per-
centages and the test statistics (see later):
ods graphics on;
proc freq data=elkhabitat;
title2 Contingency table analysis;
table source
*
canopy / nocol nopercent chisq
plot=all;
weight count;
ods output ChiSq=chisq1;
ods output CrossTabFreqs=freq1;
run;
ods graphics off;
and we get the standard table:
2
) TESTS
Frequency
Row Pct Table of source by canopy
source canopy
0% 01-25% 26-75% 76-100% Total
elk
3
0.92
90
27.69
181
55.69
51
15.69
325
random
15
7.50
61
30.50
84
42.00
40
20.00
200
Total 18 151 265 91 525
We nally get the test statistic and p-value of the test of the hypothesis:
Table Statistic DF Value Prob
Table source * canopy Chi-Square 3 21.8835 <.0001
Table source * canopy Likelihood Ratio Chi-Square 3 21.9650 <.0001
The mosaic plot shows some evidence of a difference in usage by the elk compared to the random points.
The contingency table and row percentages show where the observed differences lie. Finally, the p-value
is very small indicating that there is very strong evidence that elk are not found in the canopy classes in
proportion to the availability.
To investigate where the differences occurred, look at the mosaic plot, and the table of observed and
expected counts.
The book:
Manley, B.F.J, McDonald, L.L., Thomas, D.L., McDonald, T.L. and Erickson, W.P. (2002)
Resource Selection by Animals. Kluwer Academic Publisers
continues with this example and shows how to compute resource selection probabilities that can be inter-
preted as the probability that a randomly selected moose will select each habitat at a randomly chosen point
in time.
14.5.2 Example: Ownership and viability
Does the type of ownership of a natural resource (e.g. oyster leases) affect the long term prospects? A sample
of oyster leases was classied by type of ownership and the long-term prospects as shown below:
2
) TESTS
Outlook
Unfavorable Neutral Favorable
Non-corporation 70 55 63
Corporation 90 77 75
The data is available in the oyster-lease.csv le in the Sample ProgramLibrary at http://www.stat.
The data are imported into SAS in the usual way:
data leases;
infile oyster-lease.csv dlm=, dsd missover firstobs=2;
length ownership $12. outlook $12.;
input ownership $ outlook $ count;
run;
Obs ownership outlook count percent
1 corp unfavorable 90 37
2 corp neutral 77 32
3 corp favorable 75 31
4 non-corp unfavorable 70 37
5 non-corp neutral 55 29
6 non-corp favorable 63 34
This is an analytical survey as there is no manipulation of experimental units.
Does this survey meet the conditions of a single factor complete randomized design? There is a single
factor (type of ownership) with 2 levels (corporate or non-corporate ownership). There is a single size
of survey unit (the lease) and the observational unit is the same as the survey unit. There is insufcient
information to assess if the sampled leases are a simple random sample from the respective populations, so
we will have to assume that they were selected appropriately.
This data is in the form of a summary table, i.e. there are 430 leases that have been collapsed into a
simple table.
The null hypothesis is that:
H: outlook is independent of ownership or
ij
=
i
j
where
j
is the overall proportion of leases with
2
) TESTS
outlook j (the columns) and where
i
is the overall proportion of leases with ownership i (the rows).
This is yet another way to write out the hypothesis of independence. Because independence implies that
if you know the marginal proportions (the
i
and
j
), then you know the individual cell proportions (the
ij
).
bar-chart:
/
*
*
/
proc sort data=leases;
by ownership;
run;
proc means data=leases noprint;
by ownership;
var count;
run;
data leases;
merge leases totalcount;
by ownership;
*
100;;
format percent 7.0;
run;
proc print data=leases;
run;
proc sgplot data=leases;
vbar ownership / group=outlook
response=percent
groupdisplay=stack
stat=sum ;
run;
nally giving:
2
) TESTS
The side-by-side segmented bar charts are very similar; so we dont expect to nd much evidence of an
association between the long-term outlook and ownership status of the restaurant.
ods graphics on;
proc freq data=leases;
table ownership
*
outlook / nocol nopercent chisq
plot=all;
weight count;
run;
ods graphics off;
Frequency
Row Pct
2
) TESTS
Table of ownership by outlook
ownership outlook
favorable neutral unfavorable Total
corp
75
30.99
77
31.82
90
37.19
242
non-corp
63
33.51
55
29.26
70
37.23
188
Total 138 132 160 430
Looking at the row percentages within the table, we see that the percentages for each outlook category
are similar for the two types of ownership, conrming our impression from the mosaic chart.
Table ownership * outlook Chi-Square 2 0.4356 0.8043
Table ownership * outlook Likelihood Ratio Chi-Square 2 0.4358 0.8042
The Pearson chi-square test statistic is computed by comparing the observed (n
ij
) and expected counts
(e
ij
) when the hypothesis of independence is true. The idea behind the test is that if the data are consistent
with the null hypothesis, then the expected and observed counts should be close; if the data are not consistent
with the null hypothesis, then the observed and expected counts should be substantially different. The last
entry in the cell is a measure of closeness of the two counts.
The Pearson chi-square value is formed as
(observedexpected)
2
expected
(refer to the technical details in the ap-
pendix for derivations). The overall Pearson chi-square test statistic is formed as the sum of these entries
over all cells in the table, i.e.,
2
= 0.0915 + 0.1177 + 0.0990 + 0.1274 + 0.0000 + 0.0000 = 0.436.
The p-value is computed by comparing this test statistic to the
2
distribution - hence the name of the
test (refer to the appendix).
The p-value is 0.8043 which is very large. Hence we conclude that there is no evidence of a difference
in the proportion of outlooks among the ownership types or that there is no evidence against outlook being
independent of ownership type.
We also examine that all expected counts are reasonably large (a rough rule of thumb is that they should
all be about 5 or greater). This can be relaxed somewhat but, as shown later, the interpretation of p-values in
cases with many cells with small counts is problematic.
2
) TESTS
14.5.3 Example: Sex and Automobile Styling
The example we will look at is the relationship between sex and the type of automobile preferred.
A random sample of 303 people were asked for the persons sex, marital status, and type of car preferred
(Japanese, European, or American).
Here is part of the raw data:
Sex Marital status Age Car pref
Male Married 34 American
Male Single 36 Japanese
Male Married 23 Japanese
Male Single 29 American
...
There are 303 records, one for each person.
The data is available in the carpoll.csv le in the Sample Program Library at http://www.stat.
way:
data carpoll;
infile carpoll.csv firstobs=2 dlm=, dsd missover;;
input sex $ marital_status $ age country $ size $ type $;
count = 1; /
*
number of people with these characteristics
*
/
;;;;
Obs sex marital_status age country size type count percent
1 Female Married 42 American Large Family 1 1
2 Female Married 40 European Medium Family 1 1
3 Female Married 26 American Medium Family 1 1
4 Female Married 26 European Small Sporty 1 1
5 Female Married 37 American Medium Sporty 1 1
6 Female Single 27 Japanese Medium Family 1 1
7 Female Married 25 Japanese Small Family 1 1
8 Female Single 29 European Small Sporty 1 1
2
) TESTS
Obs sex marital_status age country size type count percent
9 Female Married 31 European Medium Family 1 1
10 Female Married 26 European Medium Sporty 1 1
This is an analytical survey as sex or marital status cannot be (yet) randomly assigned to individuals.
The factor of interest is sex that has two levels. The marital status could also be used as a factor. If both
variables are to be analyzed together, then this would be a two-factor experiment which is beyond the scope
of this course.
The experimental unit is the same as the observational unit in this study and is the person interviewed.
Note that only one member of a family, i.e., either the husband, or the wife, or another adult over 18 would
be included in the survey, and only one response from a household would be used so that family inuences
would be minimized. If both husband and wife had been included in the dataset, it would no longer be a
completely randomized design as families would now tend to be treated as blocks.
We assume that people have been selected at random from the relevant treatment groups.
The response variable is automobile preference.
One could argue that there is no clear distinction between a response and an explanatory variable. Is sex
the explanatory variable to try and explain the car preference response, or is car preference the explanatory
variable to try and explain the sex of the subject. This is analogous to correlation between two interval/ratio
variables where there is no clear distinction. It turns out that the test statistic and results are identical
regardless of which variable is treated as the explanatory or response variable.
The rst thing to be done is to create a contingency table, which is a table of counts that cross-classies
the data.
bar-chart:
/
*
*
/
proc sort data=carpoll;
by sex;
run;
proc means data=carpoll noprint;
by sex;
var count;
run;
data carpoll;
merge carpoll totalcount;
by sex;
*
100;;
2
) TESTS
format percent 7.0;
run;
proc sgplot data=carpoll;
vbar sex / group=country
response=percent
groupdisplay=stack
stat=sum ;
run;
nally giving:
The mosaic chart shows that both sexes appear to have similar preferences for country of manufacture: It
is easy to compare cumulative percentages using this chart, and it is easy to compare the relative sizes of the
top and bottom segments (the American or Japanese brands), but it is more difcult to compare the middle
segment (the European brand). For this reason, mosaic charts are often drawn to put the two most important
categories at the top and bottom of each bar so that they are easily compared across treatment groups.
If you want a mosaic chart based on row percentages, you will need to reverse the roles of the country of
2
) TESTS
manufacture and sex variables, i.e., make country of manufacture an X variable, and sex a Y variable.
ods graphics on;
proc freq data=carpoll;
table sex
*
country / nocol nopercent chisq
plot=all;
run;
ods graphics off;
Frequency
Row Pct
Table of sex by country
sex country
American European Japanese Total
Female
54
39.13
19
13.77
65
47.10
138
Male
61
36.97
21
12.73
83
50.30
165
Total 115 40 148 303
Each row percentage is computed as the cell divided by the row total. Each row is done separately, and
each rows percentages add to 100%. Dont report too many decimal places - I would be inclined only to
report the percentages rounded to a whole number rather than two decimal places.
Comparing the row percentages we see that there doesnt seem to be much of a difference between the
country of manufacture preferences of males and females as the percentages are fairly close.
Finally, we examine the statistical test. The null hypothesis is that the proportion who choose a car
manufactured in the three countries is the same for all levels of sex. Another way to state the null hypothesis
is that preference for country of manufacturing is independent of sex.
2
) TESTS
Table sex * country Chi-Square 2 0.3118 0.8556
Table sex * country Likelihood Ratio Chi-Square 2 0.3119 0.8556
Not unexpectedly, the p-value is quite large (.8556) which indicates that there is little evidence against the
null hypothesis, i.e we nd no evidence against the hypothesis of independence.
Finally check that the expected counts in each cell are at least 5. NOTE: the EXPECTED counts need
to be checked, not the observed counts.
14.5.4 Example: Marijuana use in college
The following was taken from the paper "Marijuana Use in College ( Youth and Society (1979): 323-34).
Four hundred and forty ve college students were classied according to both frequency of marijuana use
and parental use of alcohol and psychoactive drugs.
The individual student responses are not available, only the summary data were given in the paper:
Parental use Student level
of Alcohol of marijuana
and Drugs use Count
Neither Never 141
Neither Occasional 54
Neither Regular 40
One Never 68
One Occasional 44
One Regular 51
Both Never 17
Both Occasional 11
Both Regular 19
The data is available in the carpoll.csv le in the Sample Program Library at http://www.stat.
way:
data marijuana;
infile marijuana.csv dlm=, dsd missover firstobs=2;
2
) TESTS
length parentuse $12. studentuse $12.;
input parentuse $ studentuse $ count;
run;
Obs parentuse studentuse count percent
1 Both Never 17 36
2 Both Occasional 11 23
3 Both Regular 19 40
4 Neither Never 141 60
5 Neither Occasional 54 23
6 Neither Regular 40 17
7 One Never 68 42
8 One Occasional 44 27
9 One Regular 51 31
Does this study meet the requirements for a single factor completely randomized design?
bar-chart:
/
*
*
/
proc sort data=marijuana;
by parentuse;
run;
proc means data=marijuana noprint;
by parentuse;
var count;
run;
data marijuana;
merge marijuana totalcount;
by parentuse;
*
100;;
format percent 7.0;
run;
proc print data=marijuana;
2
) TESTS
run;
proc sgplot data=marijuana;
vbar parentuse / group=studentuse
response=percent
groupdisplay=stack
stat=sum ;
run;
nally giving:
ods graphics on;
proc freq data=marijuana;
table parentuse
*
studentuse / nocol nopercent chisq
plot=all;
2
) TESTS
weight count;
run;
ods graphics off;
Frequency
Row Pct
Table of parentuse by studentuse
parentuse studentuse
Never Occasional Regular Total
Both
17
36.17
11
23.40
19
40.43
47
Neither
141
60.00
54
22.98
40
17.02
235
One
68
41.72
44
26.99
51
31.29
163
Total 226 109 110 445
The mosaic plot shows substantial differences (how can you tell?). This impression is conrmed in the
contingency table (how can you tell?) where it appears that students are more prone to use these substances
if the parents also used these substances.
The Analysis of Deviance (no pun intended) table is shown below. The null hypothesis is that student
usage is independent of parental usage.
Table parentuse * studentuse Chi-Square 4 22.3731 0.0002
Table parentuse * studentuse Likelihood Ratio Chi-Square 4 22.2536 0.0002
The p-value is very small (.0002) indicating very strong evidence against the null hypothesis.
2
) TESTS
Check to see that all expected counts are at least 5 (how is this done?) to ensure that the conditions
necessary for the validity of the chi-square test are satised.
At this point, it is fairly clear looking at the mosaic plots and contingency table where the non-independence
is occurring there appears to be a shift in regular usage among students as the use by parents increases.
A more sophisticated analysis could be performed taking into account that the levels of X and Y are
ordinal scale, but this is beyond the scope of this course.
14.5.5 Example: Outcome vs. cause of accident
Accidents along the Trans-Canada highway were cross-classied by the cause of the accident and by the
outcome of the accident. Here are some summary statistics:
Cause Outcome Number of Accidents
speeding death 42
speeding no death 88
drinking death 61
drinking no death 185
reckless death 20
reckless no death 74
other death 12
other no death 86
The data is available in the accident.csv le in the Sample Program Library at http://www.stat.
way:
data accident;
length cause $12. result $12.;
infile accident.csv dlm=, dsd missover firstobs=2;
input cause $ result $ count;
run;
Obs cause result count percent
1 drinking death 61 25
2 drinking no death 185 75
2
) TESTS
Obs cause result count percent
3 other death 12 12
4 other no death 86 88
5 reckless death 20 21
6 reckless no death 74 79
7 speeding death 42 32
8 speeding no death 88 68
Does this study meet the assumptions for a single factor completely randomized design? In particular,
how are multiple people within a car treated? Is it reasonable to treat these as independent events?
ods graphics on;
proc freq data=accident;
table cause
*
result / nocol nopercent chisq
plot=all;
weight count;
run;
ods graphics off;
Frequency
Row Pct
Table of cause by result
cause result
death no death Total
drinking
61
24.80
185
75.20
246
other
12
12.24
86
87.76
98
2
) TESTS
Table of cause by result
cause result
death no death Total
reckless
20
21.28
74
78.72
94
speeding
42
32.31
88
67.69
130
Total 135 433 568
bar-chart:
/
*
*
/
proc sort data=accident;
by cause;
run;
proc means data=accident noprint;
by cause;
var count;
run;
data accident;
merge accident totalcount;
by cause;
*
100;;
format percent 7.0;
run;
proc print data=accident;
run;
proc sgplot data=accident;
vbar cause / group=result
response=percent
groupdisplay=stack
stat=sum ;
run;
2
) TESTS
nally giving:
Our hypotheses are:
H: the outcome (fatal or not fatal) is independent of the cause of the accident.
A: the outcome (fatal or not fatal) is not independent of the cause of the accident.
Table cause * result Chi-Square 3 12.8800 0.0049
Table cause * result Likelihood Ratio Chi-Square 3 13.6420 0.0034
The overall test statistic is found by summing the individual cell chi-square values (the square of the
residuals if present) and is 12.880. The test statistic will be compared to a chi-square distribution with 3 df .
The observed p-value is .0049 which provides strong evidence against the null hypothesis.
We nd the evidence sufcient strong against the null hypothesis and conclude that there is evidence
against the independence of outcome and cause of accident.
Check to see that the expected counts at least 5 to ensure the validity of the chi-square test.
Looking at the cell chi-square values in the contingency table, we see that the major differences appear to
2
) TESTS
occur with other and speeding as the cause of the accident with a lower and higher fatality rate compared
to the other two causes. Their contributions to the overall chi-square test statistics are about 4 or higher.
14.5.6 Example: Activity times of birds
A survey was conducted on feeding behaviour of sand pipers at two sites close to Vancouver, B.C. The two
sites are Boundary Bay (BB) and Sidney Island (SI).
An observer went to each site and observed ocks of sand pipers. The observer recorded the time spent
by a ock between landing (most often to feed) and departing (either because of a predator nearby or other
cues).
The protocol called for the observer to stop recording and to move to a new ock if the time spent
exceeded 10 minutes. This is a form of censoring - the actual time the ock spent on the ground is unknown;
all that is known in these cases is if the time exceeded 10 minutes.
Had the exact times been recorded for every ock, the analysis would be straightforward. It would
analyzed as a single factor (site with 2 levels) with a completely randomized design. Of course, the inference
is limited to these two sites and we are making assumptions that once ocks leave, their subsequent return,
reformation, and activity may be treated as introducing a new, independent ock.
Sophisticated methods are available to deal with the censored data, but a simple analysis can be con-
ducted by classifying the time spent into three categories: 0-5 minutes, 5-10 minutes, and 10+ minutes. [The
division into 5 minute time blocks is somewhat arbitrary].
The hypothesis is that the proportion of ocks that appear in each time block is independent of site.
The data is available in the ocks.csv le in the Sample Program Library at http://www.stat.sfu.
data flocks;
infile flocks.csv dlm=, dsd missover firstobs=2;
length site $12. timeclass $12.;
input site $ time;
if 0 <= time < 300 then timeclass = 00-05 minutes;
if 300 <= time < 600 then timeclass = 05-10 minutes;
if 600 <= time then timeclass = 10+ minutes;
count = 1;
run;
2
) TESTS
Obs site timeclass time count percent
1 bb 10+ minutes 600 1 4
2 bb 10+ minutes 600 1 4
3 bb 10+ minutes 600 1 4
4 bb 10+ minutes 600 1 4
5 bb 10+ minutes 600 1 4
6 bb 10+ minutes 600 1 4
7 bb 10+ minutes 600 1 4
8 bb 10+ minutes 600 1 4
9 bb 10+ minutes 600 1 4
10 bb 10+ minutes 600 1 4
The usual mosaic plots, contingency tables, and test statistics are produced:
Frequency
Row Pct
2
) TESTS
Table of site by timeclass
site timeclass
00-05 minute 05-10 minute 10+ minutes Total
bb
1
4.35
2
8.70
20
86.96
23
si
20
37.04
7
12.96
27
50.00
54
Total 21 9 47 77
Table site * timeclass Chi-Square 2 10.1804 0.0062
Table site * timeclass Likelihood Ratio Chi-Square 2 12.2183 0.0022
The p-value is very small - very strong evidence against the null hypothesis and it is quite clear where
the difference lies between the two sites. However, before writing up the results, look at the expected counts
one is quite small. Fortunately, the contribution of that cell to the total test statistics is very small it
contributes only .1762 to the overall value of 12.218, so this is not a problem.
Notice that one of the expected values is quite small (less than 5). In cases like this the chi-square test
should be treated with caution as very small expected counts tend to inate the test statistic and make the
results look more signicant than they really are.
14.6 Pseudo-replication - Combining tables
Hurlbert (1984) states:
Chi-square is one of the most misapplied of all statistical procedures.
According to Hurlbert (1984), the major problem is that individual units are treated as independent
objects, when in fact, there are not. Experimenters often pool experimental units from disparate sets of
observations. He specically labels this pooling as sacricial pseudo-replication.
Hurlbert (1984) cites the example of an experiment to investigate the effect of fox predation upon the
sex ratio of mice. Four plots are established. Two of the plots are randomly chosen and a fox-proof fence is
erected around the plots. The other two plots are controls.
Here are the data (Table 6 of Hurlbert (1984)):
2
) TESTS
Plot % Males Number males Number females
Foxes A
1
63% 22 13
A
2
56% 9 7
No foxes B
1
60% 15 10
B
2
43% 97 130
Many researchers would pool over the replicates to give the pooled table:
Plot % Males Number males Number females
Foxes A
1
+A
2
61% 31 20
No foxes B
1
+B
2
44% 112 140
If a
2
test is applied to the pooled data, the p-value is less than 5% indicating there is evidence that the
sex ratio is not independent of the presence of foxes.
Hurlbert (1984) identies at least 4 reasons why the pooling is not valid:
non-independence of observation. The 35 mice caught in A
1
can be regarded as 35 observations
all subject to a common cause, as can the 16 mice in A
2
, as each group were subject to a common
inuence in the patches. Consequently, the pooled mice are NOT independent; they represent two sets
of interdependent or correlated observations. The pooled data set violates the fundamental assumption
of independent observations.
throws away some information. The pooling throws out the information on the variability among
replicate plots. Without such information there is no proper way to assess the signicance of the dif-
ferences between treatments. Note that in previous cases of ordinary pseudo-replication (e.g. multiple
sh within a tank), this information is also discarded but is not needed - what is needed is the varia-
tion among tanks, not among sh. In the latter case, averaging over the pseudo-replicates causes no
problems.
confusion of experimental and observational units. If one carries out a test on the pooled data,
one is implicitly redening the experimental unit to be individual mice and not the eld plots. The
enclosures (treatments) are applied at the plot level and not the mouse level. This is similar to the
problem of multiple sh within a tank that is subject to a treatment.
unequal weighting. Pooling weights the replicate plots differentially. For example, suppose that one
enclosure had 1000 mice with 90% being male; and a second enclosure has 10 mice with 10% being
male. The pooled data would have 1000 + 10 mice with 900 + 1 being male for an overall male
ratio of 90%. Had the two enclosures been given equal weight, the average male percentage would be
(90%+10%)/2=50%. In the above example, the number of mice captured in the plots varies from 16
to over 200; the plot with over 200 mice essentially drives the results.
Hurlbert (1984) suggests the proper way to analyze the above experiment is to essentially compute a
single number for each plot and then do a two-sample t-test on the percentages. [This is equivalent to the
2
) TESTS
ordinary averaging process that takes place in ordinary pseudo-replication.] For example, with the above
table, the data for the t-test would be:
Treatment % males
Foxes 63
Foxes 56
No Foxes 60
No Foxes 43
The results from a simple t-test conducted in SAS are:
Variable treatment N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
p_males foxes 2 0.5955 0.0330 0.1758 1.0153
p_males no.foxes 2 0.5137 0.0863 -0.5834 1.6108
p_males Diff (1-2) _ 0.0819 0.0924 -0.3159 0.4796
The estimated difference in the sex ratio between colonies that are subject to fox predation and colonies
not subject to fox predation is .082 (SE .092) with p-values of .46 (pooled t-test) and .51 (unpooled t-test)
respectively. As the p-values are quite large, there is NO evidence of a predation effect.
With only two replicates (the colonies), this experiment is likely to have very poor power to detect
anything but gross differences.
The above analysis is not entirely satisfactory. The proportion of males have different variabilities be-
cause they are based on different number of total mice. A more rened analysis is now available using
Generalized Linear Mixed Models (GLIMM).
For this experiment, the model would be specied (in the usual short-hand notation) as:
logit(p
males
) = Treatment Colony(Treatment)(R)
where the Colony(Treatment) would be the random effect of the experimental units (the colonies). A
logistic type model is used.
The output from GLIMMIX (SAS 9.3) follows.
First is an estimate of the variability among colonies (on the logit scale):
2
) TESTS
Parameter Estimate
Standard
Error
colony(treatment) 0.06892 0.1675
Next is the test of the overall treatment effect:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
treatment 1 1.847 1.22 0.3919
The p-value is .39; again no evidence of a predation effect on the proportion of males in the colonies. Finally,
an estimate of the treatment effect:
Effect Label Estimate
Standard
Error DF
t
Value
Pr
>
treatment trt effect 0.5269 0.4763 1.847 1.11 0.3919 0.05 -1.6940 2.7478
Some caution is required. The estimate of .53 (SE .47) is for the difference in the logit(proportions) between
males and females. If you take exp(.53) = 1.69, this is the estimated odds-ratio of males to females
comparing colonies with predators to colonies without predators. The 95% condence interval for the odds-
ratio is exp(1.6950) = .183) to exp(2.7478) = 15.60 which includes the value of 1 (indicating no effect).
Consult the chapter on logistic regression for an explanation of odds and odds-ratios. Consult the chapter on
advanced logistic regression for more details.
Hurlbert (1984) then continues:
The commonness of this type of chi-square misuse probably is traceable to the kinds of examples
found in statistics texts, which too often are only from genetics, or from mensurative rather
than manipulative experiments, or from manipulative experiments (e.g. medical ones) in which
individual organisms are the experimental units and not simply components of them, as in the
mammal eld studies cited. . . . I know of no statistics textbook that provides clear and reliable
guidance on this matter.
We (statisticians) are guilty as charged! Hopefully, things will change in the future.
One of the reasons for the urge to pool, is that until recently, good software was not readily available.
This is no longer the case - please seek assistance before pooling.
2
) TESTS
14.7 Simpsons Paradox - Combining tables
Another problem related to pooling of tables is that the pooled table may show different results than the
individual tables. This is generically known as Simpsons Paradox.
Simpsons paradox is an example of the dangers of lurking variables.
14.7.1 Example: Sex bias in admissions
Is there a sex bias in admissions? Consider the following tables on the number of admissions to an MBA
program and a Law program cross-classied by sex. For each table compute the % admitted for each sex
(row percentage).
Business School
Admit Deny
Male 480=80% 120=20%
Female 180=90% 20=10%
It would appear that females are admitted at a slightly higher rate than males in the Business school.
Similarly, look at a table for Law school.
Law School
Admit Deny
Male 10=10% 90=90%
Female 100=33% 200=66%
Again, it appears that females are admitted at a higher rate than males in the Law school.
And what happens when the tables are combined?
Business and Law Schools
Admit Deny
Male 490=70% 210=30%
Female 280=56% 220=44%
Now females seem to be admitted at a lower rate than males!
Why has this happened? This is caused by the different percentages in admission in the two tables they
really shouldnt be combined. It is not caused by different sample sizes.
2
) TESTS
14.7.2 Example: - Twenty-year survival and smoking status
This is based on Ignoring a covariate: An example of Simpsons Paradox by Appleton, D.R. French, J.M.
and Vanderpump, M.P (1996, American Statistician, 50, 340-341).
In 1972-1994 a one-in-six survey of the electoral roll, largely concerned with thyroid disease and heart
disease was carried out in Wichkham, a mixed urban and rural district near Newcastle-upon-Tyne, in the
UK. Twenty years later, a follow-up study was conducted.
Here are the results for two age groups of females. Each table shows the twenty-year survival status for
smokers and non-smokers.
Age 55-64
Dead Alive
Smokers 51=44% 64=56%
Non-smokers 40=33% 81=67%
Age 65-74
Dead Alive
Smokers 29=80% 7=20%
Non-smokers 101=78% 28=22%
It appears that smokers die at a higher rate than non-smokers in each table.
And what happens when the tables are combined?
Ages 55-74 combined
Dead Alive
Smokers 80=53% 71=47%
Non-smokers 141=56% 109=44%
Now smokers seem to have a lower death rate!
What has happened? Most of the smokers have died off before reaching the older age classes, and so the
higher number of deaths (in absolute numbers) for the non-smokers in the older age classes has obscured the
result.
The original article had 7 age classes. In each age class the smoker had from 1.2 to 2 times the death rate
of non-smokers, yet in the pooled table, the non-smokers had a 50% higher death rate!
2
) TESTS
14.8 More complex designs
It is possible to extend the above analyses to more complex situations, e.g., multi-factor designs, blocked
designs, sub-sampling etc. As with the analysis of means where the general methodology of ANOVA was
developed, comparable methods, log-linear modeling or generalized linear modeling, can handle all of these
situations. Routines to do these analyses are available in SAS and many other computer packages. As before,
be sure that your brain in in gear before engaging the computer!
The hypotheses can remain fairly simple, like Is there a dependence in the occurrences of the two
species on fungus brackets of the same age? They can also become more complex. One might also be
interested in, for example, the possibility of differences in the dependence patterns in young vs. old brackets.
Or you may have a four-way breakdown of animals by sex, breeding status, age, and condition factor and
are interested in relationships among these attributes.
Unfortunately, these types of analyses are beyond the scope of these notes.
14.9 Final notes
There are special formula for 2x2 tables. However, the methods presented in this chapter are general
to all sizes of tables and will give the same results.
The concept of one-sided and two-sidedness for the hypothesis of independence does not exist except
when the contingency table is a 2x2 table. A one-sided analysis is covered in more advanced classes
in statistics.
Caution: if some of the cells have expected values < 5 there may be problems since the individual
2
ij
value may be unusually large (e.g. if e
ij
= 0.01, then a difference between the observed and
expected values of 1 is expanded to a
2
ij
of 100!) and the total
2
statistic may no longer be reliable.
This is a particular problem if you conclude that there is evidence against the null hypothesis - always
check the expected counts in each cell and the individual
2
ij
statistics to see if a single cells results
are distorting the nal statistic.
If the contingency table has small counts, then the Fisher Exact Test is a better way to compute the test-
statistic and p-value as it doesnt make any large sample approximations as in the Pearson chi-square
or the likelihood ratio G
2
test.
The preferred language is to conclude that there is evidence against independence, or there is insuf-
cient evidence against the hypothesis of independence. The following phrases should be avoided:
conclude that the variables are dependent - you dont know - all you have done is collected
evidence against independence.
conclude that the variables are independent - you dont know - all you have done is failed to
nd evidence against independence.
It sounds picky, but is the same reason that juries either convict or fail to convict people, rather than
declaring if the party is innocent or guilty.
2
) TESTS
As always, Type I and II errors are possible. A Type I error (a false positive result) would be to
conclude that the two variables are not independent (nd evidence against the null hypothesis) when
in fact they are independent (null is true). A Type II error (a false negative result) would be to conclude
that there is no evidence against independence (fail to nd evidence against the null hypothesis) when
in fact the variables are not independent.
14.10 Appendix - how the test statistic is computed
This section is not required for Stat 403/650.
Refer to an earlier section, where a study was conducted to determine if the type of ownership of a
natural resource (e.g. oyster leases) affects the long term prospects? A sample of oyster leases was classied
by type of ownership and the long-term prospects as shown below:
Outlook
Favorable Neutral Unfavorable Total
Corporation 75 77 90 242
Non-corporation 63 55 70 188
Total 138 132 160 430
Dene the notation:

ij
= true proportion of outlook j (favor, neutral, or unfavourable) in group i (corp or non-corp own-
ership). These are the population parameters.
n
ij
= observed count of outlook j (favor, neutral, or unfavourable) in group i (corp or non-corp
ownership). These are the sample statistics.
e
ij
= expected count in cell (i, j) if the hypothesis of independence is true.
The Pearson chi-square test statistic is computed by comparing the observed (n
ij
) and expected counts
(e
ij
) when the hypothesis is true. The idea behind the test is that if the data are consistent with the null
hypothesis, then the expected and observed counts should be close; if the data are not consistent with the
null hypothesis, then the observed and expected counts should be substantially different.
How would an expected count be found when the hypothesis is true? Well it seems reasonable to estimate
i
(the proportion of leases of ownership i) by the marginal count (in this case the row sum) divided by the
overall sample size. Hence
i
= row i marginal total/total sample size and

1
= r
1
/n = 242/430

2
= r
2
/n = 188/430
2
) TESTS
where r
1
and r
2
are the marginal total for row 1 and row 2 and n is the total sample size.
Similarly, it seems reasonable (if the hypothesis were true and there was no difference between the
proportions in each column) to estimate
j
(the proportion of leases with outlook j) by the marginal column
total divided by the overall sample size. Hence
j
= column j marginal total /total sample size and:

1
= c
1
/n = 138/430

2
= c
2
/n = 132/430

3
= c
3
/n = 160/430.
where c
1
, c
2
, and c
3
are the marginal total for columns 1-3 and n is the total sample size.
Putting these together, it seems reasonable (if the hypothesis were true and there was no difference
between proportions in each column) to estimate
ij
=
i

j
and furthermore, and estimate of the expected
number in each cell, as total sample size x
ij
or
e
ij
= n
ij
= n
i

j
= n (r
i
/n) (c
j
/n) = r
i
c
j
/n.
This is the third entry in each cell labeled Expected. For example:
the entry in the (favour, corp) cell is found as 138(242)/430=77.67
the entry in the (neutral, non-corp) cell is found as 132(188)/430=57.71
The test statistic is computed for that cell as:
X
2
ij
=
(observedijexpectedij)
2
expectedij
=
(nijeij)
2
eij
.
which is the last entry in the table. For example:
X
2
11
= (75 77.67)
2
/77.67 = 0.0915
X
2
22
= (55 57.71)
2
/57.71 = 0.1274.
The overall test statistic is then found by summing all these individual entries
X
2
=
(observed
ij
expected
ij
)
2
expected
ij
=
(n
ij
e
ij
)
2
e
ij
= 0.0915 + 0.1177 + 0.0990 + 0.1274 + 0.0000 + 0.0000 = 0.436.
2
) TESTS
It is compared to a chi-square distribution with (r 1)(c 1) degrees of freedom where r=number of
rows, c=number of columns. In this case, it would be compared to a chi-square distribution with (2-1)(3-1)
= 2 degrees of freedom.
The p-value is found as the probability that the appropriate chi-square distribution exceeds the test statis-
tic. [Despite its appearance, this is not a one-sided test in the usual sense of the word. As in ANOVA, the
concept of one- or two-sidedness for test really has no meaning in chi-square tests.]
The approximate p-value can be obtained from the chi-square distribution and looking at the 2 df line.
We see that this table indicates that the p-value is larger than 0.30 which is consistent with JMP.
0.3000 0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0010 0.0005
df -------------------------------------------------------------------------------------------
1 1.074 1.642 2.072 2.706 3.841 5.024 6.635 7.879 10.828 12.116
2 2.408 3.219 3.794 4.605 5.991 7.378 9.210 10.597 13.816 15.202
The rough rule of thumb to investigate where differences could occur is derived from the fact that each
cells contribution to the overall test statistic has an approximate chi-square distribution with a single degree
of freedom and the 95
th
percentile of such a distribution is about 4.
14.11 Fishers Exact Test
The traditional
2
test for independence relies upon an approximation to the sampling distribution of the
test statistic (the X
2
value) in order to obtain the p-value. However, this approximation works well only
when the EXPECTED count in each cell of the table is reasonably large the rules of thumb are that each
expected count should be at least 5, but the approximation still works well if the expected counts are as low
as 3.
In some case with very sparse data, the simple
2
test for independence does not work well. For example,
consider an experiment similar to that of:
McCleery, R.A., Lopez, R.R, Silvy, N.J., and Gallant, D.L. (2008).
Fox squirrel survival in urban and rural environments.
http://dx.doi.org/10.2193/2007-138
They released fox squirrels with radio collars and tracked their subsequent survival.
Here is a (ctitious) data summary for rural areas:
2
) TESTS
Site Sex Predated Not-predated
Rural M 0 15
F 3 12
The traditional
2
-test gives:
Frequency
Expected
Row Pct
Table of sex by outcome
sex outcome
not-pred predated Total
f
12
13.5
80.00
3
1.5
20.00
15
m
15
13.5
100.00
0
1.5
0.00
15
Total 27 3 30
Statistic DF Value Prob
Chi-Square 1 3.3333 0.0679
Likelihood Ratio Chi-Square 1 4.4929 0.0340
Continuity Adj. Chi-Square 1 1.4815 0.2235
Mantel-Haenszel Chi-Square 1 3.2222 0.0726
Phi Coefcient -0.3333
Contingency Coefcient 0.3162
Cramers V -0.3333
WARNING: 50% of the cells have expected counts less than 5.
(Asymptotic) Chi-Square may not be a valid test.
i.e. a test statistic of 4.49 with a (likelihood) p-value of .0340 and the Pearson
2
test p-value of .068.
However, the expected counts
3
are small in the second column and the
2
approximation is dubious. The
key problem is that when the X
2
is computed, each cells discrepancy between the observed and expected
value is inated by the expected value (i.e. divided by the expected value). Consequently, small expected
counts lead to very large contributions from that particular cell.
Fishers Exact Test is traditionally used in these circumstances. It is most often used in 2 2 tables,
but can be used in larger tables as well, but the computations quickly become tedious. Modern statistical
software (e.g. SAS) has good algorithms for large problems.
3
The expected count in the cell in row i and column j is obtained as E
ij
=
r
i
c
j
n
where r
i
is the row total, c
j
is the column total,
and n is the grand total. For example, E
11
=
r
1
c
1
n
=
2715
30
= 13.5
2
) TESTS
There is also a version of Fishers Exact Test suitable for goodness-of-t tests; this is not examined in
this section of the notes.
14.11.1 Sampling Protocol
The same sampling protocol as for ordinary
2
tests is required, i.e. a completely randomized design with
a single observation per experimental unit. A common problem is the unthinking use of Fishers Exact Test
(and other statistics) in cases where they are not appropriate.
14.11.2 Hypothesis
The null hypothesis is the same in Fishers Exact Test as in the ordinary
2
test, i.e. the null hypothesis is
that the row and column variables are independent, or that the proportions in each row (or column) are equal.
In the example of the fox squirrels above, the hypothesis is that the predation rate of males is the same
as the predation rate of females in the population of interest. Again, the hypothesis refers to the population
of interest and NOT to the observed sample.
The alternate hypothesis in a 2 2 table can either be one-sided (predation rate of males is less than
the predation rate of females) or two-sided (the predation rate of males is different than the predation rate
of females). In larger tables, the concept of one-sided or two-sidedness isnt relevant, and the alternate
hypothesis is that there is a difference in proportions somewhere in the experiment.
14.11.3 Computation
This section will outline the conceptual framework for the computation of Fishers Exact Test using com-
plete enumerations. The algorithms used in modern software do NOT do a complete enumeration and use
sophisticated algorithms to speed up the computations.
Fishers Exact Test starts by enumerating all the possible tables where the row and column total are xed.
In the fox-squirrel example, it enumerates all of the possible tables where there are 27 squirrels not predated;
3 squirrels predated; 15 females; and 15 males.
There are 4 possible tables:
Rural M 0 15
F 3 12
2
) TESTS
Rural M 1 14
F 2 13
Rural M 2 13
F 1 14
Rural M 3 12
F 0 15
For each table, the probability of each table is computed (conditional upon the set of possible tables).
Note that for the tables above, we can index them by the number of males squirrels that were predated (the
number in the upper left corner of the tables). Once n
11
is specied, all of the other values are found by
subtraction since the row and column total are xed. The probability of each tables is found as:
P(N
11
= n
11
) =
(
n
i+
!)(
n
+j
!)
n
++
!
n
ij
!
where n
i+
are row totals for row i; n
+j
are the totals for column j, and n
++
is the grand total. For example,
P(n
11
= 0) =
15!15!27!3!
30!0!15!3!12!
= .112
The complete table of probabilities is:
n
11
Probability
0 .112
1 .388
2 .388
3 .112
In general, the table is NOT symmetric in the probabilities.
For the one-sided alternatives, we add together all of the probabilities of tables that correspond to the
observed outcome or more extreme. For example, the observed number of predated-male squirrels is 0. If
alternate hypothesis is that male squirrels had a lower predation rate than females, then number of predated-
males should be 0 or less. There is only 1 table, and hence the one-sided p-value for this alternative is
.112.
If the alternate hypothesis is that male squirrels had a higher predation rate than females, then the prob-
abilities of tables corresponding to n
11
0 are added together. This is actually all of the tables, so the
2
) TESTS
one-sided p-value for this alternative is 1.00.
There are several suggested ways to compute the p-value for two-sided alternatives - refer to Agresti
(2002), p.93. One method is to simply double the smallest one-sided p-value using a similar rule in normal
theory tests. This would give a two-sided p-value of .224. A second method is to add probabilities of all
tables with observed probabilities the probability of the observed table. In this case, this corresponds to
the tables with n
11
equal to 0 or 3, again giving a two-sided p-value of .224.
The data are read into SAS in the usual way:
data squirrel;
length sex outcome $10.;
input sex $ outcome $ count;
datalines;
m predated 0
m not-pred 15
f predated 3
f not-pred 12
;;;;
Proc Freq is used to obtain the contingency tables notice the use of the Exact statement to request Fishers
Exact Test:
ods graphics on;
proc freq data=squirrel;
title2 simple two-way table analysis;
table sex
*
outcome / chisq measures cl relrisk riskdiff
nocol nopercent expected;
exact fisher or chisq ;
weight count;
ods output FishersExact=FishersExact;
run;
ods graphics off;
This gives the output for the Fisher Exact Test:
Statistic Value
Cell (1,1) Frequency (F) 12
Left-sided Pr <= F 0.1121
Right-sided Pr >= F 1.0000
2
) TESTS
Statistic Value
Table Probability (P) 0.1121
Two-sided Pr <= P 0.2241
SAS also computes a number of measures of relative risk and odds ratios.
14.11.4 Example: Relationship between Aspirin Use and MI
This example is taken from Agresti (2002), p. 72, Table 3.1. It concerns a long term Swedish study to
examine the impact of taking aspirin upon subsequent heart attacks (mycardial infarctions).
The data are:
MI
Aspirin Yes No
Yes 18 658
No 28 656
This is a classical randomized design and fullls the sampling requirements for Fishers Exact Test.
The data are read into SAS in the usual way:
data aspirin;
length aspirin mi $4.;
input aspirin $ mi $ count;
datalines;
yes yes 18
yes no 658
no yes 28
no no 656
;;;;
Proc Freq is used to obtain the contingency tables notice the use of the Exact statement to request Fishers
Exact Test:
ods graphics on;
proc freq data=aspirin;
2
) TESTS
table aspirin
*
mi / chisq measures cl relrisk riskdiff nocol nopercent;
exact fisher or chisq ;
weight count;
run;
ods graphics off;
Statistic Value
SAS also computes a number of measures of relative risk and odds ratios.
Mechanics of the test
There are now 47 possible tables with the following probabilities where n
11
is the number of people who
took aspirin and had a heart-attack. Our observed value is 18.
n11 Probability
0 8.547087e-15
1 4.159315e-13
2 9.870249e-12
3 1.522164e-10
4 1.715339e-09
5 1.505870e-08
6 1.072153e-07
7 6.364053e-07
8 3.212936e-06
9 1.400604e-05
2
) TESTS
10 5.334182e-05
11 1.791460e-04
12 5.345671e-04
13 1.426018e-03
14 3.418037e-03
15 7.392312e-03
16 1.447590e-02
17 2.574072e-02
18 4.166081e-02
19 6.148833e-02
20 8.288309e-02
21 1.021500e-01
22 1.152002e-01
23 1.189359e-01
24 1.124306e-01
25 9.729742e-02
26 7.704779e-02
27 5.578509e-02
28 3.688792e-02
29 2.224374e-02
30 1.220853e-02
31 6.084544e-03
32 2.745707e-03
33 1.117974e-03
34 4.090136e-04
35 1.337738e-04
36 3.887400e-05
37 9.961707e-06
38 2.230215e-06
39 4.311260e-07
40 7.088462e-08
41 9.716430e-09
42 1.080170e-09
43 9.354616e-11
44 5.919893e-12
2
) TESTS
45 2.434601e-13
46 4.882510e-15
The one-sided alternative that the probability of a heart-attack is LOWER when taking aspirin is found
by adding up the individual cell probabilities for all tables where n
11
18 or
P = 8.54 10
15
+. . . +.04166 = .09489
The two-sided alternative is that the probability of a heart attack is DIFFERENT when taking aspirin
compared to not taking aspirin is found by adding up ail the probabilities of tables whose probability is
.04166. This corresponds to tables with n
11
in the range of 0, . . . , 18 and 29, . . . , 46 or a total of .1768.
14.11.5 Avoidance of cane toads by Northern Quolls
This example was created by Nathan Nastili in 2012.
In 1935, the highly toxic cane toad was introduced to Australia to aid with pest control of scarab beetles.
The beetles were wreaking havoc on sugarcane crops. Unfortunately, this decision led to an unforeseen and
devastating effect on Australias wildlife due to animals consuming the toxic toad. Damage has included
the possible extinction of the Northern Australian quoll. Although initiatives such as relocating the quoll on
the nearby islands have been taken in order to save the species, there is no guarantee the new habitats will
remain toadless. Scientists have developed a new plan of attack using conditioned taste aversion (CTA) in
order to save the quoll population.
The goal of CTA is to have a subject associate a negative experience (illness) with an edible substance.
More specically, CTA is a conditioning technique applied to subjects (predators) in order to deter them
from consuming poisonous substances (prey). CTA works as follows: Subjects (quolls) are given a non-
lethal dose of the poisonous substance (toad) injected with a nausea-inducing chemical. Scientists hope the
subject will remember the experience and avoid the substance in the future. In this case, scientists are hoping
to have the quolls avoid the highly toxic toad.
The paper:
ODonnell, S., Webb, J.K. and Shine, R. (2010).
Conditioned taste aversion enhances the survival of an endangered predator imperiled by a toxic
invader.
Journal of Applied Ecology, 47, 558-565.
http://dx.doi.org/10.1111/j.1365-2664.2010.01802.x
discusses an experiment with Northern Quolls begin conditioned to avoid cane tides.
2
) TESTS
A sample of 62 quolls was taken (32 males and 30 females). The quolls were then split up into two
treatment groups; toad smart (quolls that were given the CTA treatment, 15 males and 16 females) and
toad naive (Control group, 17 males and 14 females). A samples of 34 quolls (21 males, 13 females) were
subjected to a prescreening trial (prior to release into the wild).
The trial proceeded as follows: Both toad naive and toad smart quolls were subjected to a live cane
toad in a controlled environment. The quolls were monitored using hidden cameras and their response was
recorded. The response variable had three levels; attack (attacked the toad), reject(sniffed but did not pursue)
and ignore. The observations for male and female quolls have been tabulated in the contingency tables below.
Male Quolls
Attack Avoid Reject
Toad Naive 4 1 5
Toad Smart 1 0 10
Female Quolls
Attack Avoid Reject
Toad Naive 0 4 1
Toad Smart 0 2 6
The data are available in quolls.csv available in the http://www.stat.sfu.ca/ cschwarz/Stat-650/Notes/MyPrograms.
It is read into SAS in the usual way.
data quolls;
infile quolls.csv dlm=, dsd missover firstobs=2;
length sex trt response $10.;
input sex $ trt $ response count;
if count = 0 then delete;
run;
giving
Obs sex trt response count
1 F Naive Avoid 4
2 F Naive Reject 1
3 F Smart Avoid 2
2
) TESTS
Obs sex trt response count
4 F Smart Reject 6
5 M Naive Attack 4
6 M Naive Avoid 1
7 M Naive Reject 5
8 M Smart Attack 1
9 M Smart Reject 10
We are interested in determining whether there is a treatment effect on the response of the male and female
quolls. In other words, we are testing the row (treatment) and column (response) of each table for indepen-
dence. Notice in tables bigger than 2 2, the concept of one-side or two-sided tests does not exist and so
the alternative is simply not independent.
H
o
: CTA had no effect on the proportion in each reaction class between naive and smart quolls.
H
a
: CTA had an effect on the proportion in each reaction class between naive and smart quolls.
We start with the usual
2
test for each sex separately and print out the expected counts for each cell:
Proc Freq is used to compute the
2
test and obtain the expected counts for each cell:
proc freq data=quolls;
by sex;
table trt
*
response / chisq measures cl relrisk riskdiff nocol nopercent expected;
exact fisher chisq ;
weight count;
run;
giving
sex=F
Frequency
Expected
Row Pct
2
) TESTS
Table of trt by response
trt response
Avoid Reject Total
Naive
4
2.3077
80.00
1
2.6923
20.00
5
Smart
2
3.6923
25.00
6
4.3077
75.00
8
Total 6 7 13
sex=M
Frequency
Expected
Row Pct
Table of trt by response
trt response
Attack Avoid Reject Total
Naive
4
2.381
40.00
1
0.4762
10.00
5
7.1429
50.00
10
Smart
1
2.619
9.09
0
0.5238
0.00
10
7.8571
90.91
11
Total 5 1 15 21
The
2
test statistics are:
sex=F
Pearson Chi-Square Test
Chi-Square 3.7452
DF 1
Asymptotic Pr > ChiSq 0.0530
Exact Pr >= ChiSq 0.1026
sex=M
2
) TESTS
Pearson Chi-Square Test
Chi-Square 4.4291
DF 2
Asymptotic Pr > ChiSq 0.1092
Exact Pr >= ChiSq 0.0777
As we can see from above, the estimated expected value in the majority of cells for both contingency
tables (male and female) have values less than 5, which suggests that a chi-squared test might not be trust-
worthy.
However, a Fishers exact test may be used. Fishers exact test is a procedure which determines the
probability of observing tables with the same row and columns totals of the observed table. Using these
probabilities it is possible to determine how much the observed table may deviate from what is expected if
the hypothesis of independence is true.
The Fisher Exact test is requested using the Exact statement in the Proc. The results are:
sex=F
Fishers Exact Test
sex=M
Fishers Exact Test
Pr <= P 0.0777
For neither sex, is there enough evidence to reject the null hypothesis of independence between the
treatment and response variables. This is not surprising given the small sample sizes.
2
) TESTS
It is possible to combine the results from the two-tables. You should NOT simply pool the data from
the two sexes this would be an example of sacricial pseudo-replication. Rather you should combine the
p-values from the two sexes using the discrete version of Fishers method to combine p-values. Yes, this is
the same Fisher that created the Exact Test.
For large sample cases, the distribution of the p-value will follow a Uniform distribution if the null
hypothesis is true. Then you can combined the results using
2
= 2
log(p
i
)
where p
i
is the p-value from the ith test.
4
This is compared to a
2
distribution with 2k df where k is the
number of tests being combined.
This method doesnt work well in the case of sparse contingency tables because now the null distribution
of the p-value is no longer uniformly distributed. The paper:
Mielke, P.W., Johnston, J.E, and Berry, K.J. (2004).
Combining probability values fromindependent permutations tests: Adiscrete analog of Fishers
classical method.
Psychological Reports, 95, 449-458.
http://dx.doi.org/10.2466/pr0.95.2.449-458
describes a method of combining the two p-values. Following this method, the joint p-value is found to
.0483 and so there is some (but not very strong) evidence of an effect of training the quolls. Consult the R
code for details on computing the test.
How is the Fishers Exact test computed? This section will outline the conceptual framework for the
computation of Fishers Exact Test using complete enumerations. The algorithms used in modern software
do NOT do a complete enumeration and use sophisticated algorithms to speed up the computations.
Computation for Female data.
The Female data is a 2 2 table after dropping the column with all zeros. We start by enumerating all the
possible tables where the row and column total are xed in the same way as previously done. There are six
possible tables with n
11
ranging from 0 to 5 with probabilities
n
11
n
12
n
21
n
22
Probability
0 5 6 2 0.016317016
1 4 5 3 0.163170163
2 3 4 4 0.407925408
3 2 3 5 0.326340326
4 1 2 6 0.081585082
5 0 1 7 0.004662005
4
Refer to http://en.wikipedia.org/wiki/Fishers_method.
2
) TESTS
Note that for the tables above, we can index them by the number of males squirrels that were predated (the
number in the upper left corner of the tables). Once n
11
is specied, all of the other values are found by
subtraction since the row and column total are xed. The probability of each tables is found as:
P(N
11
= n
11
) =
(
n
i+
!)(
n
+j
!)
n
++
!
n
ij
!
where n
i+
+j
++
P(n
11
= 0) =
5!10!7!6!
13!0!5!6!2!
= .0163
In general, the table is NOT symmetric in the probabilities.
For the one-sided alternatives, we add together all of the probabilities of tables that correspond to the
observed outcome or more extreme. For example, the observed number of Naive-Avoid quolls is 4. If
alternate hypothesis is that naive quolls had a lower predation rate than smart quolls, then number of naive-
avoid quolls should be 4 or more. There are 2 such tables, and hence the one-sided p-value for this alternative
is .08159 +.004662 = .08625.
There are several suggested ways to compute the p-value for two-sided alternatives - refer to Agresti
(2002), p.93. One method is to simply double the smallest one-sided p-value using a similar rule in normal
theory tests. This would give a two-sided p-value of .1725. A second method is to add probabilities of all
tables with observed probabilities the probability of the observed table. In this case, this corresponds to
the tables with n
11
equal to 0, 4 or 5, giving a two-sided p-value of .08159+.004662+.016317 = .1025641
which is the value reported.
Computation for Male data.
It is possible to do a Fisher Exact test for larger than 2 2 tables, but it can be tedious and good comput-
ing algorithms have been developed to enumerate all of the possible tables subject to the row and column
constraints. Here, we simply enumerated then by brute force.
2
) TESTS
n
11
n
12
n
13
n
21
n
22
n
23
Probability
0 1 9 5 0 6 0.014189886
1 1 8 4 0 7 0.091220699
2 1 7 3 0 8 0.182441398
3 1 6 2 0 9 0.141898865
4 1 5 1 0 10 0.042569659
5 1 4 0 0 11 0.003869969
0 0 10 5 1 5 0.008513932
1 0 9 4 1 6 0.070949432
2 0 8 3 1 7 0.182441398
3 0 7 2 1 8 0.182441398
4 0 6 1 1 9 0.070949432
5 0 5 0 1 10 0.008513932
The probability of each tables is found as:
P(Table) =
(
n
i+
!)(
n
+j
!)
n
++
!
n
ij
!
where n
i+
+j
++
P(Table with n
11
= 0) =
10!11!5!1!15!
21!4!1!5!1!0!10!
= 0.014189886
Notice that in generally, you would not do all that multiplication of factorials then followed by division by all
those factorials as this is VERY subject to round-off and other numerical difculties. The above computation
is for illustration purposes only.
In larger tables, the concept of one-sided or two-sided doesnt really exist (similar to what happens in
ANOVA with 3+ levels). There are several suggested ways to compute the p-value for larger tables. One
method is to add probabilities of all tables with observed probabilities the probability of the observed table.
In this case, this corresponds to the table in the 5th row above, or all tables with probabilities le.04256966.
This gives a p-value of .04256966+0.014189886+0.003869969+0.008513932+0.008513932 = 0.07766
which is the value reported.
Combining the p-values from the two tables.
The joint p-value is computed by rst nding the product of the probabilities of the observed tables, i.e.
p
obs,joint
= .08159 .04256966 = 0.003473049. This is compared to the product of all possible pairs of
tables (one from M and one from F). In this case there are 12 6 = 72 possible pairs of tables. We nd
again the sum of all the joint table probabilities that dont exceed the value 00347 and get a joint p-value of
0.0483845.
Chapter 15
Correlation and simple linear regression
Contents
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879
15.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
15.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
15.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882
15.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886
15.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887
15.3.2 Correlation coefcient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
15.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
15.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
15.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
15.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
15.4.2 Equation for a line - getting notation straight (no pun intended) . . . . . . . . . . 895
15.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896
15.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
15.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900
15.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902
15.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
15.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
15.4.9 Example - Mercury pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914
15.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 923
15.4.11 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
15.4.12 Example: Monitoring Dioxins - transformation . . . . . . . . . . . . . . . . . . . 925
15.4.13 Example: Weight-length relationships - transformation . . . . . . . . . . . . . . . 937
15.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 947
878
CHAPTER 15. CORRELATION AND SIMPLE LINEAR REGRESSION
15.5 A no-intercept model: Fultons Condition Factor K . . . . . . . . . . . . . . . . . . . 950
15.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957
15.6.1 Do I need a random sample; power analysis . . . . . . . . . . . . . . . . . . . . 957
15.1 Introduction
A nice book explaining how to use JMP to perform regression analysis is: Freund, R., Littell, R., and
Creighton, L. (2003) Regression using JMP. Wiley Interscience.
Much of statistics is concerned with relationships among variables and whether observed relationships
are real or simply due to chance. In particular, the simplest case deals with the relationship between two
variables.
Quantifying the relationship between two variables depends upon the scale of measurement of each of
the two variables. The following table summarizes some of the important analyses that are often performed
to investigate the relationship between two variables.
Type of vari-
ables
X is Interval or Ratio or
what JMP calls Continu-
ous
X is Nominal or Ordinal
Y is Inter-
val or Ratio
or what JMP
calls Contin-
uous
Scatterplots
Running me-
dian/spline t
Regression
Correlation
Side-by-side dot plot
Side-by-side box
plot
ANOVA or t-tests
Y is Nomi-
nal or Ordi-
nal
Logistic regression Mosaic chart
Contingency tables
Chi-square tests
In JMP these combination of two variables are obtained by the Analyze->Fit Y-by-X platform, the
Analyze->Correlation-of-Ys platform, or the Analyze->Fit Model platform.
When analyzing two variables, one question becomes important as it determines the type of analysis that
will be done. Is the purpose to explore the nature of the relationship, or is the purpose to use one variable
to explain variation in another variable? For example, there is a difference between examining height and
weight to see if there is a strong relationship, as opposed to using height to predict weight.
Consequently, you need to distinguish between a correlational analysis in which only the strength of the
relationship will be described, or regression where one variable will be used to predict the values of a second
variable.
The two variables are often called either a response variable or an explanatory variable. A response
variable (also known as a dependent or Y variable) measures the outcome of a study. An explanatory
variable (also known as an independent or X variable) is the variable that attempts to explain the observed
outcomes.
15.2 Graphical displays
15.2.1 Scatterplots
The scatter-plot is the primary graphical tool used when exploring the relationship between two interval or
ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X platform be sure that both
variables have a continuous scale.
In graphing the relationship, the response variable is usually plotted along the vertical axis (the Y axis)
and the explanatory variables is plotted along the horizontal axis (the X axis). It is not always perfectly
clear which is the response and which is the explanatory variable. If there is no distinction between the two
variables, then it doesnt matter which variable is plotted on which axis this usually only happens when
nding correlation between variables is the primary purpose.
For example, look at the relationship between calories/serving and fat from the cereal dataset using JMP.
[We will create the graph in class at this point.]
What to look for in a scatter-plot
Overall pattern. - What is the direction of association? A positive association occurs when above-average
values of one variable tend to be associated with above-average variables of another. The plot will
have an upward slope. A negative association occurs when above-average values of one variable are
associated with below-average values of another variable. The plot will have a downward slope. What
happens when there is no association between the two variables?
Form of the relationship. Does a straight line seem to t through the middle of the points? Is the line
linear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem to form
a curve)?
Strength of association. Are the points clustered tightly around the curve? If the points have a lot of scatter
above and below the trend line, then the association is not very strong. On the other hand, if the
amount of scatter above and below the trend line is very small, then there is a strong association.
Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far from the
trend curve - i.e., they are further away from the trend curve than you would expect from the usual
level of scatter. There is no formal rule for detecting outliers - use common sense. [If you set the
role of a variable to be a label, and click on points in a linked graph, the label for the point will be
displayed making it easy to identify such points.]
Ones usual initial suspicion about any outlier is that it is a mistake, e.g., a transcription error. Every
effort should be made to trace the data back to its original source and correct the value if possible. If
the data value appears to be correct, then you have a bit of a quandary. Do you keep the data point
in even though it doesnt follow the trend line, or do you drop the data point because it appears to be
anomalous? Fortunately, with computers it is relatively easy to repeat an analysis with and without an
outlier - if there is very little difference in the nal outcome - dont worry about it.
In some cases, the outliers are the most interesting part of the data. For example, for many years the
ozone hole in the Antarctic was missed because the computers were programmed to ignore readings
that were so low that they must be in error!
Lurking variables. A lurking variable is a third variable that is related to both variables and may confound
the association.
For example, the amount of chocolate consumed in Canada and the number of automobile accidents
are positively related, but most people would agree that this is coincidental and each variable is inde-
pendently driven by population growth.
Sometimes the lurking variable is a grouping variable of sorts. This is often examined by using
a different plotting symbol to distinguish between the values of the third variables. For example,
consider the following plot of the relationship between salary and years of experience for nurses.
The individual lines show a positive relationship, but the overall pattern when the data are pooled,
shows a negative relationship.
It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points.
From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markers
menu.
15.2.2 Smoothers
Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example,
consider the following data:
There are several common methods available to t a line through this data.
By eye The eye has remarkable power for providing a reasonable approximation to an underlying trend,
but it needs a little education. A trend curve is a good summary of a scatter-plot if the differences between
the individual data points and the underlying trend line (technically called residuals) are small. As well, a
good trend curve tries to minimize the total of the residuals. And the trend line should try and go through
the middle of most of the data.
Although the eye often gives a good t, different people will draw slightly different trend curves. Several
automated ways to derive trend curves are in common use - bear in mind that the best ways of estimating
trend curves will try and mimic what the eye does so well.
Median or mean trace The idea is very simple. We choose a window width of size w, say. For
each point along the bottom (X) axis, the smoothed value is the median or average of the Y -values for
all data points with X-values lying within the window centered on this point. The trend curve is then
the trace of these medians or means over the entire plot. The result is not exactly smooth. Generally, the
wider the window chosen the smoother the result. However, wider windows make the smoother react more
slowly to changes in trend. Smoothing techniques are too computationally intensive to be performed by
hand. Unfortunately, JMP is unable to compute the trace of data, but splines are a very good alternative (see
below).
The mean or median trace is too unsophisticated to be a generally useful smoother. For example, the
simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights of troughs.
(Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of trying to summarize
a pattern in a weak relationship for a moderately large data set. In a very weak relationship it can even help
you to see the trend.
Box plots for strips The following gives a conceptually simple method which is useful for exploring a
weak relationship in a large data set. The X-axis is divided into equal-sized intervals. Then separate box
plots of the values of Y are found for each strip. The box-plots are plotted side-by-side and the means or
median are joined. Again, we are able to see what is happening to the variability as well as the trend. There
is even more detailed information available in the box plots about the shape of the Y -distribution etc. Again,
this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a new variable that
groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X platform using these
groupings. This is illustrated below:
Spline methods A spline is a series of short smooth curves that are joined together to create a larger
smooth curve. The computational details are complex, but can be done in JMP. The stiffness of the spline
indicates how straight the resulting curve will be. The following shows two spline ts to the same data with
different stiffness measures:
15.3 Correlation
WARNING!: Correlation is probably the most abused concept in statistics. Many people use the
word correlation to mean any type of association between two variables, but it has a very strict technical
meaning, i.e. the strength of an apparent linear relationship between the two interval or ratio scaled
variables.
The correlation measure does not distinguish between explanatory and response variables and it treats the
two variables symmetrically. This means that the correlation between Y and X is the same as the correlation
between X and Y.
Correlations are computed in JMP using the Analyze->Correlation of Ys platform. If there are several
variables, then the data will be organized into a table. Each cell in the table shows the correlation of the
two corresponding variables. Because of symmetry (the correlation between variable
1
and variable
2
is the
same as between variable
2
and variable
1
), only part of the complete matrix will be shown. As well, the
correlation between any variable and itself is always 1.
15.3.1 Scatter-plot matrix
To illustrate the ideas of correlation, look at the FITNESS dataset in the DATAMORE directory of JMP.
This is a dataset on 31 people at a tness centre and the following variables were measured on each subject:
name
gender
age
weight
oxygen consumption (high values are typically more t people)
time to run one mile (1.6 km)
average pulse rate during the run
the resting pulse rate
maximum pulse rate during the run.
We are interested in examining the relationship among the variables. At the moment, ignore the fact
that the data contains both genders. [It would be interesting to assign different plotting symbols to the two
genders to see if gender is a lurking variable.]
One of the rst things to do is to create a scatter-plot matrix of all the variables. Use the Analyze-
>Correlation of Ys to get the following scatter-plot:
Interpreting the scatter plot matrix
The entries in the matrix are scatter-plots for all the pairs of variables. For example, the entry in row 1
column 3 represents the scatter-plot between age and oxygen consumption with age along the vertical axis
and oxygen consumption along the horizontal axis, while the entry in row 3 column 1 has age along the
horizontal axis and oxygen consumption along the vertical axis.
There is clearly a difference in the strength of relationships. Compare the scatter plot for average
running pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and resting pulse
rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2).
Similarly, there is a difference in the direction of association. Compare the scatter plot for the average
running pulse rate and maximum pulse rate (row 5 column 7) and that for oxygen consumption and running
time (row 3, column 4).
15.3.2 Correlation coefcient
It is possible to quantify the strength of association between two variables. As with all statistics, the way the
data are collected inuences the meaning of the statistics.
The population correlation coefcient between two variables is denoted by the Greek letter rho () and
is computed as:.
=
1
N
N
i=1
(X
i
X
)
x
(Y
i
Y
)
y
The corresponding sample correlation coefcient is denoted r has a similar form:
1
r =
1
n 1
n
i=1
_
X
i
X
_
s
x
_
Y
i
Y
_
s
y
If the sampling scheme is simple randomsample fromthe corresponding population, then r is an estimate
of . This is a crucial assumption. If the sampling is not a simple random sample, the above
denition of the sample correlation coefcient should not be used! It is possible to nd a condence interval
for and to perform statistical tests that is zero. However, for the most part, these are rarely done in
ecological research and so will not be pursued further in this course.
The form of the formula does provide some insight into interpreting its value.
and r (unlike other population parameters) are unitless measures.
the sign of and r is largely determined by the pairing of the relationship of each of the (X,Y) values
with their respective means, i.e. if both of X and Y are above the mean, or both X and Y are below
their mean, this pair contributes a positive value towards or r, while if X is above and Y is below, or
X is below and Y is above their respective means contributes a negative value towards or r.
and r ranges from -1 to 1. A value of or r equal to -1 implies a perfect negative correlation; a
value of or r equal to 1 implies a perfect positive correlation; a value of or r equal to 0 implied
no correlation. A perfect population correlation (i.e. or r equal to 1 or -1) implies that all points lie
1
Note that this formula SHOULD NOT be used for the actual computation of r, it is numerically unstable and there are better
computing formulae available.
exactly on a straight line, but the slope of the line has NO effect on the correlation coefcient. This
latter point is IMPORTANT and often is wrongly interpreted - give some examples.
and r are unaffected by linear transformations of the individual variables, e.g. unit changes such as
converting from imperial to metric units.
and r only measures the linear association, and is not affected by the slope of the line, but only by
the scatter about the line.
Because correlation assumes both variables have an interval or ratio scale, it makes no sense to compute
the correlation
between gender and oxygen (gender is a nominal scale data);
between non-linear variables (not shown on graph);
for data collected without a known probability scheme. If a sampling scheme other than simple ran-
dom sampling is used, it is possible to modify the estimation formula; if a non-probability sample
scheme was used, the patient is dead on arrival, and no amount of statistical wizardry will revive the
corpse.
The data collection scheme for the tness data set is unknown - we will have to assume that a some sort
of random sample form the relevant population was taken before we can make much sense of the number
computed.
Before looking at the details of its computation, look at the sample correlation coefcients for each
scatter plot above. These can be arranged into a matrix:
Variable Age Weight Oxy Runtime RunPulse RstPulse MaxPulse
Age 1.00 -0.24 -0.31 0.19 -0.31 -0.15 -0.41
Weight -0.24 1.00 -0.16 0.14 0.18 0.04 0.24
Oxy -0.31 -0.16 1.00 -0.86 -0.39 -0.39 -0.23
Runtime 0.19 0.14 -0.86 1.00 0.31 0.45 0.22
RunPulse -0.31 0.18 -0.39 0.31 1.00 0.35 0.92
RstPulse -0.15 0.04 -0.39 0.45 0.35 1.00 0.30
MaxPulse -0.41 0.24 -0.23 0.22 0.92 0.30 1.00
Notice that the sample correlation between any two variables is the same regardless of ordering of the
variables this explains the symmetry in the matrix between the above and below diagonal elements. As
well each variable has a perfect sample correlation with itself this explains the value of 1 along the main
diagonal.
Compare the sample correlations between the average running pulse rate and the other variables and
compare them to the corresponding scatter-plot above.
15.3.3 Cautions
Random Sampling Required Sample correlation coefcients are only valid under simple random
samples. If the data were collected in a haphazard fashion or if certain data points were oversampled,
then the correlation coefcient may be severely biased.
There are examples of high correlation but no practical use and low correlation but great practical use.
These will be presented in class. This illustrates why I almost never talk about correlation.
correlation measures strength of a linear relationship; a curvilinear relationship may have a correla-
tion of 0, but there will still be a good correlation.
the effect of outliers and high leverage points will be presented in class
effects of lurking variables. For example, suppose there is a positive association between wages of
male nurses and years of experience; between female nurses and years of experience; but males are
generally paid more than females. There is a positive correlation within each group, but an overall
negative correlation when the data are pooled together.
ecological fallacy - the problem of correlation applied to averages. Even if there is a high correlation
between two variables on their averages, it does not imply that there is a correlation between individual
data values.
For example, if you look at the average consumption of alcohol and the consumption of cigarettes,
there is a high correlation among the averages when the 12 values from the provinces and territories
are plotted on a graph. However, the individual relationships within provinces can be reversed or
non-existent as shown below:
The relationship between cigarette consumption and alcohol consumption shows no relationship for
each province, yet there is a strong correlation among the per-capita averages. This is an example of
the ecological fallacy.
correlation does not imply causation. This is the most frequent mistake made by people. There are
set of principles of causal inference that need to be satised in order to imply cause and effect.
15.3.4 Principles of Causation
Types of association
An association may be found between two variables for several reasons (show causal modeling gures):
There may be direct causation, e.g. smoking causes lung cancer.
There may be a common cause, e.g. ice cream sales and number of drownings both increase with
temperature.
There may be a confounding factor, e.g. highway fatalities decreased when the speed limits were
reduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and people drove
fewer miles.
There may be a coincidence, e.g., the population of Canada has increased at the same time as the
moon has gotten closer by a few miles.
Establishing cause-and effect.
How do we establish a cause and effect relationship? Bradford Hill (Hill, A. B.. 1971. Principles of
Medical Statistics, 9th ed New York: Oxford University Press) outlined 7 criteria that have been adopted by
many epidemiological researchers. It is generally agreed that most or all of the following must be considered
before causation can be declared.
Strength of the association. The stronger an observed association appears over a series of different studies,
the less likely this association is spurious because of bias.
Dose-response effect. The value of the response variable changes in a meaningful way with the dose (or
level) of the suspected causal agent.
Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The ability to
establish this time pattern will depend upon the study design used.
Consistency of the ndings. Most, or all, studies concerned with a given causal hypothesis produce similar
ndings. Of course, studies dealing with a given question may all have serious bias problems that can
diminish the importance of observed associations.
Biological or theoretical plausibility. The hypothesized causal relationship is consistent with current bi-
ological or theoretical knowledge. Note, that the current state of knowledge may be insufcient to
explain certain ndings.
Coherence of the evidence. The ndings do not seriously conict with accepted facts about the outcome
variable being studied.
Specicity of the association. The observed effect is associated with only the suspected cause (or few other
causes that can be ruled out).
IMPORTANT: NO CAUSATION WITHOUT MANIPULATION!
Examples:
Discuss the above in relation to:
amount of studying vs. grades in a course.
amount of clear cutting and sediments in water.
fossil fuel burning and the greenhouse effect.
15.4 Single-variable regression
15.4.1 Introduction
Along with the Analysis of Variance, this is likely the most commonly used statistical methodology in
ecological research. In virtually every issue of an ecological journal, you will nd papers that use a regression
analysis.
There are HUNDREDS of books written on regression analysis. Some of the better ones (IMHO) are:
Draper and Smith. Applied Regression Analysis. Wiley.
Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin.
Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury.
Zar. Biostatistics. Prentice Hall.
Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review of regres-
sion analysis. Please consult the above references for all the gory details.
It turns out that both Analysis of Variance and Regression are special cases of a more general statistical
methodology called General Linear Models which in turn are special cases of Generalized Linear Models
(covered in Stat 402/602), which in turn are special cases of Generalized Additive Models, which in turn are
special cases of .....
The key difference between a Regression analysis and an ANOVA is that the X variable is nominal
scaled in ANOVA, while in regression analysis the X variable is continuous scaled. This implies that in
ANOVA, the shape of the response prole was unspecied (the null hypothesis was that all means were
equal while the alternate was that at least one mean differs), while in regression, the response prole must
be a linear line.
Because both ANOVA and regression are from the same class of statistical models, many of the assump-
tions are similar, the tting methods are similar, hypotheses testing and inference are similar as well.
15.4.2 Equation for a line - getting notation straight (no pun intended)
In order to use regression analysis effectively, it is important that you understand the concepts of slopes and
intercepts and how to determine these from data values.
This will be QUICKLY reviewed here in class.
In previous courses at high school or in linear algebra, the equation of a straight line was often written
y = mx +b where m is the slope and b is the intercept. In some popular spreadsheet programs, the authors
decided to write the equation of a line as y = a+bx. Now a is the intercept, and b is the slope. Statisticians,
for good reasons, have rationalized this notation and usually write the equation of a line as y =
0
+
1
x or
as Y = b
0
+b
1
X. (the distinction between
0
and b
0
will be made clearer in a few minutes). The use of the
subscripts 0 to represent the intercept and the subscript 1 to represent the coefcient for the X variable then
readily extends to more complex cases.
Review denition of intercept as the value of Y when X=0, and slope as the change in Y per unit change
in X.
15.4.3 Populations and samples
All of statistics is about detecting signals in the face of noise and in estimating population parameters from
samples. Regression is no different.
First consider the the population. As in previous chapters, the correct denition of the population is
important as part of any study. Conceptually, we can think of the large set of all units of interest. On
each unit, there is conceptually, both an X and Y variable present. We wish to summarize the relationship
between Y and X, and furthermore wish to make predictions of the Y value for future X values that may
be observed from this population. [This is analogous to having different treatment groups corresponding to
different values of X in ANOVA.]
If this were physics, we may conceive of a physical law between X and Y , e.g. F = ma or PV = nRt.
However, in ecology, the relationship between Y and X is much more tenuous. If you could draw a scatter-
plot of Y against X for ALL elements of the population, the points would NOT fall exactly on a straight
line. Rather, the value of Y would uctuate above or below a straight line at any given X value. [This is
analogous to saying that Y varies randomly around the treatment group mean in ANOVA.]
We denote this relationship as
Y =
0
+
1
X +
where now
0
,
1
are the POPULATION intercept and slope respectively. We say that
E[Y ] =
0
+
1
X
is the expected or average value of Y at X. [In ANOVA, we let each treatment group have its own mean;
here in regression we assume that the means must t on a straight line.]
The term represent random variation of individual units in the population above and below the expected
value. It is assumed to have constant standard deviation over the entire regression line (i.e. the spread of data
points in the population is constant over the entire regression line). [This is analogous to the assumption of
equal treatment population standard deviations in ANOVA.]
Of course, we can never measure all units of the population. So a sample must be taken in order to
estimate the population slope, population intercept, and population standard deviation. Unlike a correlation
analysis, it is NOT necessary to select a simple random sample from the entire population and more elaborate
schemes can be used. The bare minimum that must be achieved is that for any individual X value found in
the sample, the units in the population that share this X value, must have been selected at random.
This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X from the
extremes and then only at those X value, randomly select from the relevant subset of the population, rather
than having to select at random from the population as a whole. [This is analogous to the assumptions made
in an analytical survey, where we assumed that even though we cant randomly assign a treatment to a unit
(e.g. we cant assign sex to an animal) we must ensure that animals are randomly selected from each group].
Once the data points are selected, the estimation process can proceed, but not before assessing the as-
sumptions!
15.4.4 Assumptions
The assumptions for a regression analysis are very similar to those found in ANOVA.
Linearity
Regression analysis assume that the relationship between Y and X is linear. Make a scatter-plot between
Y and X to assess this assumption. Perhaps a transformation is required (e.g. log(Y ) vs. log(X)). Some
caution is required in transformation in dealing with the error structure as you will see in later examples.
Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g.
quadratic curve), this usually indicates that the relationship between Y and X is not linear. Or, t a model
that includes X and X
2
and test if the coefcient associated with X
2
is zero. Unfortunately, this test could
fail to detect a higher order relationship. Third, if there are multiple readings at some X-values, then a test
of goodness-of-t can be performed where the variation of the responses at the same X value is compared
to the variation around the regression line.
Correct scale of predictor and response
The response and predictor variables must both have interval or ratio scale. In particular, using a numerical
value to represent a category and then using this numerical value in a regression is not valid. For example,
suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using these values in a
regression either as predictor variable or as a response variable is not sensible.
Correct sampling scheme
The Y must be a random sample from the population of Y values for every X value in the sample. Fortu-
nately, it is not necessary to have a completely random sample from the population as the regression line is
valid even if the X values are deliberately chosen. However, for a given X, the values from the population
must be a simple random sample.
No outliers or inuential points
All the points must belong to the relationship there should be no unusual points. The scatter-plot of Y vs.
X should be examined. If in doubt, t the model with the points in and out of the t and see if this makes a
difference in the t.
Outliers can have a dramatic effect on the tted line. For example, in the following graph, the single
point is an outlier and and inuential point:
Equal variation along the line
The variability about the regression line is similar for all values of X, i.e. the scatter of the points above and
below the tted line should be roughly constant over the entire line. This is assessed by looking at the plots
of the residuals against X to see if the scatter is roughly uniformly scattered around zero with no increase
and no decrease in spread over the entire line.
Independence
Each value of Y is independent of any other value of Y . The most common cases where this fail are time
series data where X is a time measurement. In these cases, time series analysis should be used.
This assumption can be assessed by again looking at residual plots against time or other variables.
Normality of errors
The difference between the value of Y and the expected value of Y is assumed to be normally distributed.
This is one of the most misunderstood assumptions. Many people erroneously assume that the distribution of
Y over all X values must be normally distributed, i.e they look simply at the distribution of the Y s ignoring
the Xs. The assumption only states that the residuals, the difference between the value of Y and the point
on the line must be normally distributed.
This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, for small
sample sizes, you have little power of detecting non-normality and for large sample sizes it is not that
important.
X measured without error
This is a new assumption for regression as compared to ANOVA. In ANOVA, the group membership was
always exact, i.e. the treatment applied to an experimental unit was known without ambiguity. However,
in regression, it can turn out that that the X value may not be known exactly.
This general problem is called the error in variables problem and has a long history in statistics.
It turns out that there are two important cases. If the value reported for X is a nominal value and the
actual value of X varies randomly around this nominal value, then there is no bias in the estimates. This is
called the Berkson case after Berkson who rst examined this situation. The most common cases are where
the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X that occurs
would vary randomly around this target value.
However, if the value used for X is an actual measurement of the true underlying X then there is
uncertainty in both the X and Y direction. In this case, estimates of the slope are attenuated towards zero
(i.e. positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimate
are no longer consistent, i.e. as the sample size increases, the estimates no longer tend to the true population
values! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may not
be located exactly at the plot where the crop is grown, but may be recorded a nearby weather station a fair
distance away. The reading at the weather station is NOT a true reection of the rainfall at the test plot.
This latter case of error in variables is very difcult to analyze properly and there are not universally
accepted solutions. Refer to the reference books listed at the start of this chapter for more details.
The problem is set up as follows. Let
Y
i
=
i
+
i
X
i
=
i
+
i
with the straight-line relationship between the true (but unobserved) values:
i
=
0
+
1
i
Note the (true, but unknown) regression equation uses
i
rather than the observed (with error) values X
i
.
Now if the regression is done on the observed X (i.e. the error prone measurement), the regression
equation reduces to:
Y
i
=
0
+
1
X
i
+ (
i
i
)
Now this violates the independence assumption of ordinary least squares because the new error term is not
independent of the X
i
variable.
If an ordinary least squares model is t, the estimated slope is biased (Draper and Smith, 1998, p. 90)
with
E[
1
] =
1

1
r( +r)
1 + 2r +r
2
where is the correlation between and ; and r is the ratio of the variance of the error in X to the error in
Y .
The bias is negative, i.e. the estimated slope is too small, in most practical cases ( + r > 0). This is
known as attenuation of the estimate, and in general, pulls the estimate towards zero.
The bias will be small in the following cases:
the error variance of X is small relative to the error variance in Y . This means that r is small (i.e.
close to zero), and so the bias is also small. In the case where X is measured without error, then r = 0
and the bias vanishes as expected.
if the X are xed (the Berkson case) and actually used
2
, then +r = 0 and the bias also vanishes.
The proper analysis of the error-in-variables case is quite complex see Draper and Smith (1998, p. 91)
for more details.
15.4.5 Obtaining Estimates
To distinguish between population parameters and sample estimates, we denote the sample intercept by b
0
and the sample slope as b
1
. The equation of a particular sample of points is expressed
Y
i
= b
0
+b
1
X
i
where
b
0
is the estimated intercept, and b
1
is the estimated slope. The symbol
Y indicates that we are referring to

the estimated line and not to a line in the entire population.
2
For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based on the thermostat
readings rather than the (true) unknown temperature, this corresponds to the Berkson case.
How is the best tting line found when the points are scattered? We typically use the principle of least
squares. The least-squares line is the line that makes the sum of the squares of the deviations of the data
points from the line in the vertical direction as small as possible.
Mathematically, the least squares line is the line that minimizes
1
n
_
Y
i
Y
i
_
2
where

Y
i
is the point
on the line corresponding to each X value. This is also known as the predicted value of Y for a given value
of X. This formal denition of least squares is not that important - the concept as expressed in the previous
paragraph is more important in particular it is the SQUARED deviation in the VERTICAL direction that
is used..
It is possible to write out a formula for the estimated intercept and slope, but who cares - let the computer
do the dirty work.
The estimated intercept (b
0
) is the estimated value of Y when X = 0. In some cases, it is meaningless
to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a plot of income vs.
year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretation of
the intercept, and it merely serves as a placeholder for the line.
The estimated slope (b
1
) is the estimated change in Y per unit change in X. For every unit change in the
horizontal direction, the tted line increased by b
1
units. If b
1
is negative, the tted line points downwards,
and the increase in the line is negative, i.e., actually a decrease.
As with all estimates, a measure of precision can be obtained. As before, this is the standard error of
each of the estimates. Again, there are computational formulae, but in this age of computers, these are not
important. As before, approximate 95% condence intervals for the corresponding population parameters
are found as estimate 2 se.
Formal tests of hypotheses can also be done. Usually, these are only done on the slope parameter as
this is typically of most interest. The null hypothesis is that population slope is 0, i.e. there is no relation-
ship between Y and X (can you draw a scatter-plot showing such a relationship?) More formally the null
hypothesis is:
H :
1
= 0
Again notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms of a
sample statistic.
The alternate hypothesis is typically chosen as:
A :
1
= 0
although one-sided tests looking for either a positive or negative slope are possible.
The test-statistics is found as
T =
b
1
0
se(b
1
)
and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. This is usually
automatically done by most computer packages. The p-value is interpreted in exactly the same way as in
ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relationship were true.
As before, the p-value does not tell the whole story, i.e. statistical vs. biological (non)signicance must
be determined and assessed.
15.4.6 Obtaining Predictions
Once the best tting line is found it can be used to make predictions for new values of X.
There are two types of predictions that are commonly made. It is important to distinguish between them
as these two intervals are the source of much confusion in regression problems.
First, the experimenter may be interested in predicting a SINGLE future individual value for a particular
X. Second the experimenter may be interested in predicting the AVERAGE of ALL future responses at a
particular X.
3
The prediction interval for an individual response is sometimes called a condence interval for
an individual response but this is an unfortunate (and incorrect) use of the term condence interval. Strictly
speaking condence intervals are computed for xed unknown parameter values; predication intervals are
computed for future random variables.
Both of the above intervals should be distinguished from the condence interval for the slope.
In both cases, the estimate is found in the same manner substitute the new value of X into the equation
and compute the predicted value

Y . In most computer packages this is accomplished by inserting a new
dummy observation in the dataset with the value of Y missing, but the value of X present. The missing
Y value prevents this new observation from being used in the tting process, but the X value allows the
package to compute an estimate for this observation.
What differs between the two predictions are the estimates of uncertainty.
In the rst case, there are two sources of uncertainty involved in the prediction. First, there is the un-
certainty caused by the fact that this estimated line is based upon a sample. Then there is the additional
uncertainty that the value could be above or below the predicted line. This interval is often called a predic-
tion interval at a new X.
In the second case, only the uncertainty caused by estimating the line based on a sample is relevant. This
interval is often called a condence interval for the mean at a new X.
The prediction interval for an individual response is typically MUCH wider than the condence interval
for the mean of all future responses because it must account for the uncertainty from the tted line plus
individual variation around the tted line.
Many textbooks have the formulae for the se for the two types of predictions, but again, there is little to
be gained by examining them. What is important is that you read the documentation carefully to ensure that
3
There is actually a third interval, the mean of the next m individuals values but this is rarely encountered in practice.
you understand exactly what interval is being given to you.
15.4.7 Residual Plots
After the curve is t, it is important to examine if the tted curve is reasonable. This is done using residuals.
The residual for a point is the difference between the observed value and the predicted value, i.e., the residual
from tting a straight line is found as: residual
i
= Y
i
(b
0
+b
1
X
i
) = (Y
i
Y
i
).
There are several standard residual plots:
plot of residuals vs. predicted (
Y );
plot of residuals vs. X;
plot of residuals vs. time ordering.
In all cases, the residual plots should show random scatter around zero with no obvious pattern. Dont
plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and dont mean
anything.
15.4.8 Example - Yield and fertilizer
We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. An
experiment was conducted in the Schwarz household one summer on 11 plots of land where the amount of
fertilizer was varied and the yield measured at the end of the season.
The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. While
the levels were not systematically chosen (e.g. they were not evenly spaced between the highest and lowest
values), they represent commonly used amounts based on a preliminary survey of producers. At the end of
the experiment, the yields were measured and the following data were obtained.
Interest also lies in predicting the yield when 16 kg/ha are assigned.
Fertilizer Yield
(kg/ha) (Liters)
12 24
5 18
15 31
17 33
14 30
6 20
11 25
13 27
15 31
8 21
18 29
The data is available in the fertilizer.csv le in the Sample Program Library at http://www.stat.
The data are imported into SAS in the usual fashion:
data tomato;
infile fertilizer.csv dlm=, dsd missover firstobs=2;
input fertilizer yield;
;;;;
Note that both variables are numeric (SAS doesnt have the concept of scale of variables) and that an
extra row was added to the data table with the value of 16 for the fertilizer and the yield left missing. The
ordering of the rows is NOT important; however, it is often easier to nd individual data points if the data is
sorted by the X value and the rows for future predictions are placed at the end of the dataset.
Obs fertilizer yield
1 5 18
2 6 20
3 8 21
4 11 25
5 12 24
6 13 27
7 14 30
8 15 31
9 15 31
10 17 33
11 18 29
12 16 .
In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the response variable
(Y ) is the yield.
The population consists of all possible eld plots with all possible tomato plants of this type grown under
all possible fertilizer levels between about 5 and 18 kg/ha.
If all of the population could be measured (which it cant) you could nd a relationship between the yield
and the amount of fertilizer applied. This relationship would have the form: Y=
0
+
1
(amount of fertilizer)+
where
0
and
1
represent the true population intercept and slope respectively. The term represents ran-
dom variation that is always present, i.e. even if the same plot was grown twice in a row with the same
amount of fertilizer, the yield would not be identical (why?).
The population parameters to be estimated are
0
- the true average yield when the amount of fertilizer
is 0, and
1
, the true average change in yield per unit change in the amount of fertilizer. These are taken
over all plants in all possible eld plots of this type. The values of
0
and
1
are impossible to obtain as the
entire population could never be measured.
The ordering of the rows in the data table is NOT important; however, it is often easier to nd individual
data points if the data is sorted by the X value and the rows for future predictions are placed at the end of
the dataset. Notice how missing values are represented.
Start by plotting the data using Proc SGplot:
proc sgplot data=tomato;
title2 Preliminary data plot;
scatter y=yield x=fertilizer / markerattrs=(symbol=circlefilled);
yaxis label=Yield offsetmin=.05 offsetmax=.05;
xaxis label=Fertilizer offsetmin=.05 offsetmax=.05;
run;
The relationship look approximately linear; there dont appear to be any outlier or inuential points; the
scatter appears to be roughly equal across the entire regression line. Residual plots will be used later to
check these assumptions in more detail.
We use Proc Reg to t the regression model:
ods graphics on;
proc reg data=tomato plots=all;
title2 fit the model;
model yield = fertilizer / all;
ods output OutputStatistics =Predictions;
ods output ParameterEstimates=Estimates;
run;
ods graphics off;
The model statement is what tells SAS that the response variable is yield because it appears to the left of
the equal sign, and that the predictor variable is fertilizer because it appears to the right of the equal sign. The
all options requests much output, and part of the output will be discussed below.The ods statement requests
that some output statistics are placed into a dataset called predictions.
Part of the output include the coefcients and their standard errors:
Variable DF
Parameter
Estimate
Standard
Error
t
Value
Pr
>
|t|
Lower
95%
CL
Parameter
Upper
95%
CL
Parameter
Intercept 1 12.85602 1.69378 7.59 <.0001 9.02442 16.68763
fertilizer 1 1.10137 0.13175 8.36 <.0001 0.80333 1.39941
The estimated regression line is
Y = b
0
+b
1
(fertilizer) = 12.856 + 1.10137(amount of fertilizer)
In terms of estimates, b
0
=12.856 is the estimated intercept, and b
1
=1.101 is the estimated slope.
The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit.
In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount is increased by
1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not the value of Y when
X = 1.
The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the esti-
mated yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningful
interpretation, but Id be worried about extrapolating outside the range of the observed X values. If the
intercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to 13?
Once again, these are the results from a single experiment. If another experiment was repeated, you
would obtain different estimates (b
0
and b
1
would change). The sampling distribution over all possible
experiments would describe the variation in b
0
and b
1
over all possible experiments. The standard deviation
of b
0
and b
1
over all possible experiments is again referred to as the standard error of b
0
and b
1
.
The formulae for the standard errors of b
0
and b
1
are messy, and hopeless to compute by hand. And
just like inference for a mean or a proportion, the program automatically computes the se of the regression
estimates.
The estimated standard error for b
1
(the estimated slope) is 0.132 L/kg. This is an estimate of the standard
deviation of b
1
over all possible experiments. Normally, the intercept is of limited interest, but a standard
error can also be found for it as shown in the above table.
Using exactly the same logic as when we found a condence interval for the population mean, or for
the population proportion, a condence interval for the population slope (
1
) is found (approximately) as
b
1
2(estimated se) In the above example, an approximate condence interval for
1
is found as
1.101 2 (.132) = 1.101 .264 = (.837 1.365)L/kg
of fertilizer applied.
The exact condence interval is based on the t-distribution and is slightly wider than our approximate
condence interval because the total sample size (11 pairs of points) is rather small. We interpret this interval
as being 95% condent that the true increase in yield when the amount of fertilizer is increased by one unit
is somewhere between (.837 to 1.365) L/kg.
Be sure to carefully distinguish between
1
and b
1
. Note that the condence interval is computed using
b
1
, but is a condence interval for
1
- the population parameter that is unknown.
In linear regression problems, one hypothesis of interest is if the true slope is zero. This would corre-
spond to no linear relationship between the response and predictor variable (why?) Again, this is a good
time to read the papers by Cherry and Johnson about the dangers of uncritical use of hypothesis testing. In
many cases, a condence interval tells the entire story.
The test of hypothesis about the intercept is not of interest (why?).
Let

1
be the true (unknown) slope.
b
1
be the estimated slope. In this case b
1
= 1.1014.
The hypothesis testing proceeds as follows. Again note that we are interested in the population parame-
ters and not the sample statistics.
1. Specify the null and alternate hypothesis:
H:
1
= 0
A:
1
= 0.
Notice that the null hypothesis is in terms of the population parameter
1
. This is a two-sided test as
we are interested in detecting differences from zero in either direction.
2. Find the test statistic and the p-value. The test statistic is computed as:
T =
estimate hypothesized value
estimated se
=
1.1014 0
.132
= 8.36
In other words, the estimate is over 8 standard errors away from the hypothesized value!
This will be compared to a t-distribution with n 2 = 9 degrees of freedom. The p-value is found to
very small (less than 0.0001).
3. Conclusion. There is strong evidence that the true slope is not zero. This is not too surprising given
that the 95% condence intervals show that plausible values for the true slope are from about .8 to
about 1.4.
It is possible to construct tests of the slope equal to some value other than 0. Most packages cant do
this. You would compute the T value as shown above, replacing the value 0 with the hypothesized value.
It is also possible to construct one-sided tests. Most computer packages only do two-sided tests. Proceed
as above, but the one-sided p-value is the two-sided p-value reported by the packages divided by 2.
If sufcient evidence is found against the hypothesis, a natural question to ask is well, what values of the
parameter are plausible given this data. This is exactly what a condence interval tells you. Consequently,
I usually prefer to nd condence intervals, rather than doing formal hypothesis testing.
What about making predictions for future yields when certain amounts of fertilizer are applied? For
example, what would be the future yield when 16 kg/ha of fertilizer are applied?
The predicted value is found by substituting the new X into the estimated regression line.
Y = b
0
+b
1
(fertilizer) = 12.856 + 1.10137(16) = 30.48L
Predictions for existing and new X values are obtained automatically with the All option on the model
statement and can be output to a dataset using the ODS statement and appropriate table name and then
merging it back with the original data:
data predictions; /
*
merge the original data set with the predictions
*
/
merge tomato predictions;
run;
Here is a listing of the prediction dataset.
Predicted
Value
Std
Err
Mean
Predict
Lower
95%
CL
Mean
Upper
95%
CL
Mean
Lower
95%
CL
Predict
Upper
95%
CL
Predict
1 5 18 18.3629 1.0901 15.8969 20.8288 13.6120 23.1138
2 6 20 19.4643 0.9779 17.2521 21.6764 14.8400 24.0885
3 8 21 21.6670 0.7723 19.9198 23.4141 17.2463 26.0877
4 11 25 24.9711 0.5632 23.6971 26.2451 20.7151 29.2271
5 12 24 26.0725 0.5418 24.8469 27.2981 21.8308 30.3142
6 13 27 27.1738 0.5519 25.9254 28.4223 22.9255 31.4222
7 14 30 28.2752 0.5919 26.9363 29.6142 23.9994 32.5511
8 15 31 29.3766 0.6564 27.8918 30.8614 25.0529 33.7003
9 15 31 29.3766 0.6564 27.8918 30.8614 25.0529 33.7003
Predicted
Value
Std
Err
Mean
Predict
Lower
95%
CL
Mean
Upper
95%
CL
Mean
Lower
95%
CL
Predict
Upper
95%
CL
Predict
10 17 33 31.5793 0.8342 29.6922 33.4665 27.1015 36.0572
11 18 29 32.6807 0.9384 30.5579 34.8035 28.0985 37.2629
12 16 . 30.4780 0.7389 28.8064 32.1495 26.0866 34.8693
As noted earlier, there are two types of estimates of precision associated with predictions using the
regression line. It is important to distinguish between them as these two intervals are the source of much
confusion in regression problems.
First, the experimenter may be interested in predicting a single FUTURE individual value for a particular
X. This would correspond to the predicted yield for a single future plot with 16 kg/ha of fertilizer added.
Second the experimenter may be interested in predicting the average of ALL FUTURE responses at a
particular X. This would correspond to the average yield for all future plots when 16 kg/ha of fertilizer is
added. The prediction interval for an individual response is sometimes called a condence interval for an
individual response but this is an unfortunate (and incorrect) use of the term condence interval. Strictly
speaking condence intervals are computed for xed unknown parameter values; predication intervals are
computed for future random variables.
Both of these intervals and the tted line can be plotted using Proc SGplot:
proc sgplot data=Predictions;
title2 Fitted line and confidence curves for mean and individual values;
band x=fertilizer lower=lowerCL upper=upperCL;
band x=fertilizer lower=lowerCLmean upper=upperCLmean / fillattrs=(color=red);
series y=PredictedValue X=Fertilizer;
scatter y=yield x=fertilizer / markerattrs=(symbol=circlefilled);
yaxis label=Yield offsetmin=.05 offsetmax=.05;
xaxis label=Fertilizer offsetmin=.05 offsetmax=.05;
run;
Later versions of SAS also produce these plots using the ODS Graphics options:
The innermost set of lines represents the condence bands for the mean response. The outermost band
of lines represents the prediction intervals for a single future response. As noted earlier, the latter must be
wider than the former to account for an additional source of variation.
Here the predicted yield for a single future trial at 16 kg/ha is 30.5 L, but the 95% prediction interval is
between 26.1 and 34.9 L. The predicted AVERAGE yield for ALL future plots when 16 kg/ha of fertilizer
is applied is also 30.5 L, but the 95% condence interval for the MEAN yield is between 28.8 and 32.1 L.
Residual (and other diagnostic plots to assess the t of the model) are automatically produced using the
ODS Graphics option prior to invoking Proc Reg:
There is no evidence of any problems except perhaps for some excess leverage for the last observation as
measured by Cooks D statistic.
The residuals are simply the difference between the actual data point and the corresponding spot on the
line measured in the vertical direction. The residual plot shows no trend in the scatter around the value of
zero.
15.4.9 Example - Mercury pollution
Mercury pollution is a serious problem in some waterways. Mercury levels often increase after a lake
is ooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessive
consumption of mercury is well known to be deleterious to human health. It is difcult and time consuming
to measure every persons mercury level. It would be nice to have a quick procedure that could be used to
estimate the mercury level of a person based upon the average mercury level found in sh and estimates
of the persons consumption of sh in their diet. The following data were collected on the methyl mercury
intake of subjects and the actual mercury levels recorded in the blood stream from a random sample of
people around recently ooded lakes.
Methyl Mercury Mercury in
Intake whole blood
(ug Hg/day) (ng/g)
180 90
200 120
230 125
410 290
600 310
550 290
275 170
580 375
600 150
105 70
250 105
60 205
650 480
The data is available in the mercury.csv le in the Sample Program Library at http://www.stat.
data mercury;
infile mercury.csv dlm=, dsd missover firstobs=2;
input intake blood;
plot_symbol=1;
if intake = 60 then plot_symbol=2; /
*
identify two potential outliers
*
/
if intake = 600 and blood <200 then plot_symbol=2;
run;
Note that both variables are numeric (SAS doesnt have the concept of scale of variables). I create a new
variable plot_symbol based on the analysis that follows to illustrate the presence of potential outliers. The
raw data are shown below:
Obs intake blood plot_symbol
1 180 90 1
2 200 120 1
3 230 125 1
4 410 290 1
5 600 310 1
6 550 290 1
7 275 170 1
8 580 375 1
9 600 150 2
10 105 70 1
11 250 105 1
12 60 205 2
13 650 480 1
The ordering of the rows in the data table is NOT important; however, it is often easier to nd individual
data points if the data is sorted by the X value and the rows for future predictions are placed at the end of
the dataset. Notice how missing values are represented.
The population of interest are the people around recently ooded lakes.
This experiment is an analytical survey as it is quite impossible to randomly assign people different
amounts of mercury in their food intake. Consequently, the key assumption is that the subjects chosen to be
measured are random samples from those with similar mercury intakes. Note it is NOT necessary for this to
be a random sample from the ENTIRE population (why?).
The explanatory variable is the amount of mercury ingested by a person. The response variable is the
amount of mercury in the blood stream.
Start by plotting the data using Proc SGplot. Notice the use of the Proc Template procedure to assign
different plotting symbols to the points depending on the value of the plot_symbol variable dened when the
data was read.
proc template;
Define style styles.mystyle;
Parent=styles.default;
Style graphdata1 from graphdata1 / MarkerSymbol="CircleFilled" Color=black Contrastcolor=black;
Style graphdata2 from graphdata2 / MarkerSymbol="X" Color=black Contrastcolor=black;
end;
run;
ods html style=mystyle;
proc sgplot data=mercury;
scatter y=blood x=intake / group=plot_symbol markerattrs=(size=10px);
yaxis label=Blood Mercury offsetmin=.05 offsetmax=.05;
xaxis label=Intake Mercury offsetmin=.05 offsetmax=.05;
run;
There appears to be two outliers (identied by an X). To illustrate the effects of these outliers upon the
estimates and the residual plots, the line was t using all of the data.
ods graphics on;
proc reg data=mercury plots=all;
title2 Fit the model to all of the data;
model blood = intake / all;
run;
ods graphics off;
The model statement is what tells SAS that the response variable is the blood mercury level because it
appears to the left of the equal sign, and that the predictor variable is food intake mercury level because
it appears to the right of the equal sign. The all options requests much output, and part of the output will
be discussed below.The ods statement requests that some output statistics are placed into a dataset called
predictions.
Variable DF
Parameter
Estimate
Standard
Error
t
Value
Pr
>
|t|
Lower
95%
CL
Parameter
Upper
95%
CL
Parameter
Intercept 1 12.85602 1.69378 7.59 <.0001 9.02442 16.68763
fertilizer 1 1.10137 0.13175 8.36 <.0001 0.80333 1.39941
The residual plot shows the clear presence of the two outliers, but also identies a third potential outlier
not evident from the original scatter-plot (can you nd it?).
The data were rechecked and it appears that there was an error in the blood work used in determining the
readings. Consequently, these points were removed for the subsequent t.
We remove the outliers:
data mercury2; /
*
delete the outliers
*
/
set predictions;
if plot_symbol ^= 1 then delete;
if residual > 100 then delete;
keep blood intake;
run;
The revised raw data are shown below:
Obs intake blood
1 180 90
2 200 120
3 230 125
4 410 290
5 600 310
6 550 290
7 275 170
8 580 375
9 105 70
10 250 105
We use Proc Reg to again t the data as shown previously. Part of the output include the coefcients and
their standard errors:
Variable DF
Parameter
Estimate
Standard
Error
t
Value
Pr
>
|t|
Lower
95%
CL
Parameter
Upper
95%
CL
Parameter
Intercept 1 -1.95169 22.71513 -0.09 0.9336 -54.33288 50.42950
intake 1 0.58122 0.05983 9.71 <.0001 0.44325 0.71919
The estimated regression line (after removing outliers) is
Blood = 1.951691 + 0.581218Intake
.
The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/day when
the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in the context of this
experiment. The negative value is merely a placeholder for the line. Also notice that the estimated intercept
is not very precise in any case (how do I know this and what implications does this have for worrying that it
is not zero?)
4
What was the impact of the outliers if they had been retained upon the estimated slope and intercept?
The estimated slope has been determined relatively well (relative standard error of about 10% how is
the relative standard error computed?). There is clear evidence that the hypothesis of no relationship between
blood mercury levels and food mercury levels is not tenable.
The two types of predictions would also be of interest in this study. First, an individual would like to
know the impact upon personal health. Secondly, the average level would be of interest to public health
authorities.
data predictions2; /
*
*
/
merge mercury2 predictions2;
run;
Obs intake blood
Predicted
Value
Std
Err
Mean
Predict
Lower
95%
CL
Mean
Upper
95%
CL
Mean
Lower
95%
CL
Predict
Upper
95%
CL
Predict
1 180 90 102.6676 14.0140 70.3511 134.9840 20.5945 184.7406
2 200 120 114.2919 13.2364 83.7687 144.8151 32.9083 195.6756
3 230 125 131.7285 12.1977 103.6004 159.8565 51.2125 212.2444
4 410 290 236.3477 11.2067 210.5051 262.1903 156.6014 316.0940
5 600 310 346.7791 18.7816 303.4687 390.0896 259.7881 433.7701
6 550 290 317.7182 16.3680 279.9734 355.4630 233.3600 402.0764
7 275 170 157.8833 11.0109 132.4921 183.2745 78.2821 237.4844
8 580 375 335.1548 17.7951 294.1191 376.1904 249.2737 421.0358
9 105 70 59.0762 17.3598 19.0443 99.1081 -26.3298 144.4822
10 250 105 143.3528 11.6083 116.5840 170.1216 63.3016 223.4041
4
It is possible to t a regression line that is constrained to go through Y = 0 when X = 0. These must be t carefully and are not
covered in this course.
and the two intervals can be plotted on the same graph in a similar fashion as in the Fertilizer example giv-
ing:
ODS Graphics option prior to invoking Proc Reg: and now show no problems.
15.4.10 Example - The Anscombe Data Set
Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable.
All four datasets gave exactly the same results when a regression line was t, yet are quite different in their
interpretation.
The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms. Fitting of regression lines to this data will be demonstrated in class.
15.4.11 Transformations
In some cases, the plot of Y vs. X is obviously non-linear and a transformation of X or Y may be used to
establish linearity. For example, many dose-response curves are linear in log(X). Or the equation may be
intrinsically non-linear, e.g. a weight-length relationship is of the form weight =
0
length
1
. Or, some
variables may be recorded in an arbitrary scale, e.g. should the fuel efciency of a car be measured in L/100
km or km/L? You are already with some variables measured on the log-scale - pH is a common example.
Often a visual inspection of a plot may identify the appropriate transformation.
There is no theoretical difculty in tting a linear regression using transformed variables other than an
understanding of the implicit assumption of the error structure. The model for a t on transformed data is of
the form
trans(Y ) =
0
+
1
trans(X) +error
Note that the error is assumed to act additively on the transformed scale. All of the assumptions of linear
regression are assumed to act on the transformed scale in particular that the population standard deviation
around the regression line is constant on the transformed scale.
The most common transformation is the logarithmic transform. It doesnt matter if the natural logarithm
(often called the ln function) or the common logarithm transformation (often called the log
10
transformation)
is used. There is a 1-1 relationship between the two transformations, and linearity on one transform is
preserved on the other transform. The only change is that values on the ln scale are 2.302 = ln(10) times
that on the log
10
scale which implies that the estimated slope and intercept both differ by a factor of 2.302.
There is some confusion in scientic papers about the meaning of log - some papers use this to refer to the
ln transformation, while others use this to refer to the log
10
transformation.
After the regression model is t, remember to interpret the estimates of slope and intercept on the trans-
formed scale. For example, suppose that a ln(Y ) transformation is used. Then we have
ln(Y
t+1
) = b
0
+b
1
(t + 1)
ln(Y
t
) = b
0
+b
1
t
and
ln(Y
t+1
) ln(Y
t
) = ln(
Y
t+1
Y
t
) = b
1
(t + 1 t) = b
1
.
exp(ln(
Y
t+1
Y
t
)) =
Y
t+1
Y
t
= exp(b
1
) = e
b1
Hence a one unit increase in X causes Y to be MULTIPLED by e
b1
. As an example, suppose that on
the log-scale, that the estimated slope was .07. Then every unit change in X causes Y to change by a
multiplicative factor or e
.07
= .93, i.e. roughly a 7% decline per year.
5
Similarly, predictions on the transformed scale, must be back-transformed to the untransformed scale.
In some problems, scientists search for the best transform. This is not an easy task and using simple
statistics such as R
2
to search for the best transformation should be avoided. Seek help if you need to nd
the best transformation for a particular dataset.
JMP makes it particularly easy to t regressions to transformed data as shown below. SAS and R have
an extensive array of functions so that you can create new variables based the transformation of an existing
variable.
15.4.12 Example: Monitoring Dioxins - transformation
An unfortunate byproduct of pulp-and-paper production used to be dioxins - a very hazardous material. This
material was discharged into waterways with the pulp-and-paper efuent where it bioaccumulated in living
organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms
takes a long time to degrade.
Government environmental protection agencies take samples of crabs from affected areas each year and
measure the amount of dioxins in the tissue. The following example is based on a real study.
Each year, four crabs are captured from a monitoring station. The liver is excised and the livers from
all four crabs are composited together into a single sample.
6
The dioxins levels in this composite sample is
measured. As there are many different forms of dioxins with different toxicities, a summary measure, called
the Total Equivalent Dose (TEQ) is computed from the sample.
Here is the raw data.
5
It can be shown that on the log scale, that for smallish values of the slope that the change is almost the same on the untransformed
scale, i.e. if the slope is .07 on the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughly a 7% increase
per year.
6
Compositing is a common analytical tool. There is little loss of useful information induced by the compositing process - the only
loss of information is the among individual-sample variability which can be used to determine the optimal allocation between samples
within years and the number of years to monitor.
Site Year TEQ
a 1990 179.05
a 1991 82.39
a 1992 130.18
a 1993 97.06
a 1994 49.34
a 1995 57.05
a 1996 57.41
a 1997 29.94
a 1998 48.48
a 1999 49.67
a 2000 34.25
a 2001 59.28
a 2002 34.92
a 2003 28.16
The data is available in the dioxinTEQ.csv le in the Sample Program Library at http://www.stat.
data teq;
infile "dioxinTEQ.csv" dlm=, dsd missover firstobs=2;
input site $ year TEQ;
logTEQ = log(TEQ); /
*
compute the log TEQ values
*
/
attrib logTEQ label=log(TEQ) format=7.2;
run;
extra row was added to the data table with the value of 2010 for the year and the TEQ left missing. The
sorted by the X value and the rows for future predictions are placed at the end of the dataset. The value of
ln(TEQ) is computed in the data step.
Obs site year TEQ log(TEQ)
1 a 1990 179.05 5.19
2 a 1991 82.39 4.41
Obs site year TEQ log(TEQ)
3 a 1992 130.18 4.87
4 a 1993 97.06 4.58
5 a 1994 49.34 3.90
6 a 1995 57.05 4.04
7 a 1996 57.41 4.05
8 a 1997 29.94 3.40
9 a 1998 48.48 3.88
10 a 1999 49.67 3.91
11 a 2000 34.25 3.53
12 a 2001 59.28 4.08
13 a 2002 34.92 3.55
14 a 2003 28.16 3.34
15 a 2010 . .
As with all analyses, start with a preliminary plot of the data. We use Proc SGplot to get a scatterplot:
proc sgplot data=teq;
scatter y=TEQ x=year / markerattrs=(symbol=circlefilled);
yaxis label=TEQ offsetmin=.05 offsetmax=.05;
run;
The preliminary plot of the data shows a decline in levels over time, but it is clearly non-linear. Why is
this so? In many cases, a xed fraction of dioxins degrades per year, e.g. a 10% decline per year. This can
be expressed in a non-linear relationship:
TEQ = Cr
t
where C is the initial concentration, r is the rate reduction per year, and t is the elapsed time. If this is
plotted over time, this leads to the non-linear pattern seen above.
If logarithms are taken, this leads to the relationship:
log(TEQ) = log(C) +t log(r)
which can be expressed as:
log(TEQ) =
0
+
1
t
which is the equation of a straight line with
0
= log(C) and
1
= log(r).
The log(TEQ) was computed in the dataset seen earlier.
A plot of log(TEQ) vs. year gives the following:
The relationship look approximately linear; there dont appear to be any outlier or inuential points; the
scatter appears to be roughly equal across the entire regression line. Residual plots will be used later to
check these assumptions in more detail.
ods graphics on;
proc reg data=teq plots=all;
title2 fit the model;
model logTEQ = year / all;
run;
ods graphics off;
The model statement is what tells SAS that the response variable is logTEQ because it appears to the left
of the equal sign, and that the predictor variable is year because it appears to the right of the equal sign. The
Variable DF
Parameter
Estimate
Standard
Error
t
Value
Pr
>
|t|
Lower
95%
CL
Parameter
Upper
95%
CL
Parameter
Intercept 1 218.91364 42.79187 5.12 0.0003 125.67816 312.14911
year 1 -0.10762 0.02143 -5.02 0.0003 -0.15432 -0.06092
The tted line is:
log(TEQ) = 218.9 .11(year)
.
The intercept (218.9) would be the log(TEQ) in the year 0 which is clearly nonsensical. The slope
(.11) is the estimated log(ratio) from one year to the next. For example, exp(.11) = .898 would mean
that the TEQ in one year is only 89.8% of the TEQ in the previous year or roughly an 11% decline per year.
The standard error of the estimated slope is .02.
The 95% condence interval for the slope is (.154 to .061). If you take the anti-logs of the endpoints,
this gives a 95% condence interval for the fraction of TEQ that remains from year to year, i.e. between
(0.86 to 0.94) of the TEQ in one year, remains to the next year.
Several types of predictions can be made. For example, what would be the estimated mean TEQ in 2010?
data predictions; /
*
*
/
merge teq predictions;
run;
Obs year TEQ log(TEQ)
Predicted
Value
Std
Err
Mean
Predict
Lower
95%
CL
Mean
Upper
95%
CL
Mean
Lower
95%
CL
Predict
Upper
95%
CL
Predict
1 1990 179.05 5.19 4.7516 0.1639 4.3944 5.1088 3.9618 5.5413
2 1991 82.39 4.41 4.6440 0.1462 4.3255 4.9624 3.8710 5.4170
3 1992 130.18 4.87 4.5364 0.1295 4.2542 4.8185 3.7776 5.2951
4 1993 97.06 4.58 4.4287 0.1144 4.1794 4.6780 3.6815 5.1759
5 1994 49.34 3.90 4.3211 0.1017 4.0996 4.5426 3.5827 5.0595
6 1995 57.05 4.04 4.2135 0.0922 4.0126 4.4144 3.4810 4.9459
7 1996 57.41 4.05 4.1059 0.0871 3.9162 4.2956 3.3764 4.8353
8 1997 29.94 3.40 3.9983 0.0871 3.8086 4.1880 3.2688 4.7277
9 1998 48.48 3.88 3.8906 0.0922 3.6898 4.0915 3.1582 4.6231
10 1999 49.67 3.91 3.7830 0.1017 3.5615 4.0045 3.0446 4.5214
11 2000 34.25 3.53 3.6754 0.1144 3.4261 3.9247 2.9282 4.4226
12 2001 59.28 4.08 3.5678 0.1295 3.2856 3.8499 2.8090 4.3266
13 2002 34.92 3.55 3.4602 0.1462 3.1417 3.7786 2.6871 4.2332
14 2003 28.16 3.34 3.3525 0.1639 2.9954 3.7097 2.5628 4.1423
15 2010 . . 2.5992 0.3020 1.9413 3.2572 1.6353 3.5631
Both of these intervals and the tted line can be plotted using Proc SGplot:
title2 Fitted line and confidence curves for mean and individual values;
band x=Year lower=lowerCL upper=upperCL;
band x=Year lower=lowerCLmean upper=upperCLmean / fillattrs=(color=red);
series y=PredictedValue X=Year;
scatter y=logTEQ x=Year / markerattrs=(symbol=circlefilled);
yaxis label=logTEQ offsetmin=.05 offsetmax=.05;
run;
The estimated mean log(TEQ) is 2.60 (corresponding to an estimated MEDIAN TEQ of exp(2.60) =
13.46). A 95% condence interval for the mean log(TEQ) is (1.94 to 3.26) corresponding to a 95% con-
dence interval for the actual MEDIAN TEQ of between (6.96 and 26.05).
7
Note that the condence interval
after taking anti-logs is no longer symmetrical.
Why does a mean of a logarithm transform back to the median on the untransformed scale? Basi-
cally, because the transformation is non-linear, properties such mean and standard errors cannot be simply
anti-transformed without introducing some bias. However, measures of location, (such as a median) are
unaffected. On the transformed scale, it is assumed that the sampling distribution about the estimate is sym-
metrical which makes the mean and median take the same value. So what really is happening is that the
median on the transformed scale is back-transformed to the median on the untransformed scale.
Similarly, a 95% prediction interval for the log(TEQ) for an INDIVIDUAL composite sample can be
found. Be sure to understand the difference between the two intervals.
Finally, an inverse prediction is sometimes of interest, i.e. in what year, will the TEQ be equal to some
particular value. For example, health regulations may require that the TEQ of the composite sample be
7
A minor correction can be applied to estimate the mean if required.
below 10 units.
Rather surprisingly, SAS does NOT have a function for inverse regression. A few people have written
one-off functions, but the code needs to checked carefully. For this class, I would plot the condence and
prediction intervals, and then work backward from the target Y value to see where it hits the condence
limits and then drop down to the X axis:
We compute the predicted values for a wide range of X values and get the plot of the two intervals and then
follow the example above. Note the use of the Ref command in Proc SGplot to get the horizontal reference
line at Y = 2.302.
title2 Demonstrating how to do inverse predictions at logTEQ=2.302;
band x=year lower=lowerCL upper=upperCL;
band x=year lower=lowerCLmean upper=upperCLmean / fillattrs=(color=red);
series y=PredictedValue X=Year;
scatter y=logTEQ x=Year / markerattrs=(symbol=circlefilled);
refline 2.302 / axis=y ;
yaxis label=logTEQ offsetmin=.05 offsetmax=.05;
run;
The predicted year is found by solving
2.302 = 218.9 .11(year)
and gives and estimated year of 2012.7. A condence interval for the time when the mean log(TEQ) is
equal to log(10) is somewhere between 2007 and 2026!
The residual plot looks ne with no apparent problems but the dip in the middle years could require
further exploration if this pattern was apparent in other site as well:
The application of regression to non-linear problems is fairly straightforward after the transformation is
made. The most error-prone step of the process is the interpretation of the estimates on the TRANSFORMED
scale and how these relate to the untransformed scale.
15.4.13 Example: Weight-length relationships - transformation
A common technique in sheries management is to investigate the relationship between weight and lengths
of sh.
This is expected to a non-linear relationship because as sh get longer, they also get wider and thicker.
If a sh grew equally in all directions, then the weight of a sh should be proportional to the length
3
(why?). However, sh do not grow equally in all directions, i.e. a doubling of length is not necessarily
associated with a doubling of width or thickness. The pattern of association of weight with length may
reveal information on how sh grow.
The traditional model between weight and length is often postulated to be of the form:
weight = a length
b
where a and b are unknown constants to be estimated from data.
If the estimated value of b is much less than 3, this indicates that as sh get longer, they do not get wider
and thicker at the same rates.
How are such models t? If logarithms are taken on each side, the above equation is transformed to:
log(weight) = log(a) +b log(length)
or
log(weight) =
0
+
1
log(length)
where the usual linear relationship on the log-scale is now apparent.
The following example was provided by Randy Zemlak of the British Columbia Ministry of Water, Land,
and Air Protection.
Length (mm) Weight (g)
34 585
46 1941
33 462
36 511
32 428
33 396
34 527
34 485
33 453
44 1426
35 488
34 511
32 403
31 379
30 319
33 483
36 600
35 532
29 326
34 507
32 414
33 432
33 462
35 566
34 454
35 600
29 336
31 451
33 474
32 480
35 474
30 330
30 376
34 523
31 353
32 412
32 407
The data is available in the wtlen.csv le in the Sample Program Library at http://www.stat.sfu.
ca/~cschwarz/Stat-650/Notes/MyPrograms.
data wtlen;
infile wtlen.csv dlm=, dsd missover firstobs=2 ;
input length weight;
run;
extra row was added to the data table with the value of 16 for the fertilizer and the yield left missing. The
sorted by the X value and the rows for future predictions are placed at the end of the dataset.
Obs length weight log_length log_weight
5 28.9 336 3.4 5.8
7 29.3 326 3.4 5.8
8 29.5 319 3.4 5.8
10 30.2 330 3.4 5.8
11 30.5 376 3.4 5.9
12 31.0 353 3.4 5.9
14 31.3 379 3.4 5.9
15 31.4 451 3.4 6.1
16 31.5 403 3.4 6.0
17 31.7 414 3.5 6.0
We create some additional plotting positions and compute the the log(weight) and log(length) are
added to the dataset. Note that the log() function is the natural logarithm (base e) function.
data wtlen2; /
*
create some plotting points
*
/
do length = 25 to 50 by 1;
weight = .;
output;
end;
run;
data wtlen; /
*
append the plotting points and compute log() transformation
*
/
set wtlen wtlen2;
log_length = log(length);
log_weight = log(weight);
format log_length log_weight 7.1;
run;
Start by plotting the data using Proc SGplot and add a lowess t to the points:
proc sgplot data=wtlen;
scatter y=weight x=length / markerattrs=(symbol=circlefilled);
loess y=weight x=length;
yaxis label=Weight offsetmin=.05 offsetmax=.05;
xaxis label=Length offsetmin=.05 offsetmax=.05;
run;
The t appears to be non-linear but this may simply be an artifact of the inuence of the two largest sh.
The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, the variance
appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30 mm.
We will t a model on the log-log scale: Note that there is some confusion in scientic papers about a
log transform. In general, a log-transformation refers to taking natural-logarithms (base e), and NOT the
base-10 logarithm. This mathematical convention is often broken in scientic papers where authors try to
use ln to represent natural logarithms, etc. It does not affect the analysis in anyway which transformation
is used other than that values on the natural log scale are approximately 2.3 times larger than values on the
log
10
scale. Of course, the appropriate back transformation is required.
ods graphics on;
proc reg data=wtlen plots=all;
title2 fit the model using all of the data;
model log_weight = log_length / all;
run;
ods graphics off;
The model statement is what tells SAS that the response variable is logTEQ because it appears to the left
of the equal sign, and that the predictor variable is year because it appears to the right of the equal sign. The
Here is the t:
The t is not very satisfactory. The curve doesnt seem to t the two outlier points very well. At
smaller lengths, the curve seems to under tting the weight. The residual plot appears to show the two
denite outliers and also shows some evidence of a poor t with positive residuals at lengths 30 mm and
negative residuals at 35 mm.
The t was repeated dropping the two largest sh with the following output:
We delete all large sh.
data wtlen3; /
*
append the plotting points and compute log() transformation
*
/
set wtlen;
if length > 40 then delete;
run;
Variable DF
Parameter
Estimate
Standard
Error
t
Value
Pr
>
|t|
Lower
95%
CL
Parameter
Upper
95%
CL
Parameter
Intercept 1 -3.55305 0.74431 -4.77 <.0001 -5.06735 -2.03875
log_length 1 2.76722 0.21319 12.98 <.0001 2.33348 3.20095
Here is a plot of the tted line:
Now the t appears to be much better. The relationship (on the log-scale) is linear, the residual plot looks
OK.
The estimated power coefcient is 2.76 (SE .21). We nd the 95% condence interval for the slope (the
power coefcient): from the previous output.
The 95% condence interval for the power coefcient is from (2.33 to 3.2) which includes the value of
3 hence the growth could be allometric, i.e. a sh that is twice the length also is twice the width and twice
the thickness. Of course, with this small sample size, it is too difcult to say too much.
The actual model in the population is:
log(weight) =
0
+
1
log(length) +
This implies that the errors in growth act on the LOG-scale. This seems reasonable.
For example, a regression on the original scale would make the assumption that a 20 g error in predicting
weight is equally severe for a sh that (on average) weighs 200 or 400 grams even though the "error" is
20/200=10% of the predicted value in the rst case, while only 5% of the predicted value in the second case.
On the log-scale, it is implicitly assumed that the errors operate on the log-scale, i.e. a 10% error in a 200
g sh is equally severe as a 10% error in a 400 g sh even though the absolute errors of 20g and 40g are
quite different.
Another assumption of regression analysis is that the population error variance is assumed to be constant
over the entire regression line, but the original plot shows that the standard deviation is increasing with
length. On the log-scale, the standard deviation is roughly constant over the entire regression line.
A non-linear t
It is also possible to do a direct non-linear least-squares t. Here the objective is to nd values of
0
and
1
to minimize:

(weight
0
length
1
)
2
directly.
The Proc NLin procedure can be used to t the non-linear least-squares line directly:
ods graphics on;
proc nlin data=wtlen3 plots=all;
title2 non-linear least squares;
parameters b0=.03 b1=3;
bounds b0 > 0;
model weight = b0
*
length
**
b1;
ods output ParameterEstimates=NlinEstimates;
run;
ods graphics off;
non-linear least squares
Parameter Estimate
Approx
Std
Error Alpha Lower Upper
t
Value
Approx
Pr
>
|t|
b0 0.0323 0.0275 0.05 -0.0237 0.0883 1.17 0.2485
b1 2.7332 0.2427 0.05 2.2395 3.2270 11.26 <.0001
The estimated power coefcient from the non-linear t is 2.73 with a standard error of .24. The estimated
intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to the previous t.
Which is a better method to t this data? The non-linear t assumes that error are additive on the original
scale. The consequences of this were discussed earlier, i.e. a 20 g error is equally serious for a 200 g sh as
for a 400 g sh.
For this problem, both the non-linear t and the t on the log-scale gave the same results, but this will
not always be true. In particular, look at the large difference in estimates when the models were t to the
all of the sh. The non-linear t was more inuenced by the two large sh - this is a consequence of the
minimizing the square of the absolute deviation (as opposed to the relative deviation) between the observed
weight and predicted weight.
15.4.14 Power/Sample Size
A power analysis and sample size determination can also be done for regression problems, but is (unfortu-
nately) rarely done in regression. This is for a number of reasons:
The power depends not only on the total number of points collected, but also on the actual distribution
of the X values. For example, a regression analysis is most powerful to detect a trend if half the
observations are collected at a small X value and half of the observations are collected at a large X
value. However, this type of data gives no information on the linearity (or lack there-of) between the
two X values and is not recommended in practice. A less powerful design would have a range of X
values collected, but this is often more of interest as lack-of-t and non-linearity can be collected.
Data collected for regression analysis is often opportunistic with little chance of choosing the X
values. Unless you have some prior information on the distribution of the X values, it is difcult to
determine the power.
The formula are clumsy to compute by hand, and most power packages tend not to have modules for
power analysis of regression.
For a power analysis, the information required is similar to that requested for ANOVA designs:
level. As in power analyses for ANOVA, this is traditionally set to = 0.05.
effect size. In ANOVA, power deals with detection of differences among means. In regression analy-
sis, power deals with detection of slopes that are different fromzero. Hence, the effect size is measured
by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X.
sample size. Recall in ANOVA with more than two groups, that the power depended not only only
the sample size per group, but also how the means are separated. In regression analysis, the power
will depend upon the number of observations taken at each value of X and the spread of the X values.
For example, the greatest power is obtained when half the sample is taken at the two extremes of the
X space - but at a cost of not being able to detect non-linearity.
standard deviation. As in ANOVA, the power will depend upon the variation of the individual objects
around the regression line.
This problem of power and sample size for regression is beyond what we can cover in this chapter. JMP
or R currently does not include a power computation module for regression analysis. However SAS (Version
9+) includes a power analysis module (GLMPOWER) for the power analysis Please consult suitable help
for details.
However, the problem simplies considerably when the the X variable is time, and interest lies in de-
tecting a trend (increasing or decreasing) over time. A linear regression of the quantity of interest against
time is commonly used to evaluate such a trend. For many monitoring designs, observations are taken on a
yearly basis, so the question reduces to the number of years of monitoring required.
The analysis of trend data and power/sample size computations is treated in a following chapter.
2
R
2
is a popular measure of the t of a regression model and is often quoted in research papers as evidence
of a good t etc. However, there are several fundamental problems of R
2
which, in my opinion, make it
less desirable. A nice summary of these issues is presented in Draper and Smith (1998, Applied Regression
Analysis, p. 245-246).
Before exploring this, how is R
2
computed and how is it interpreted?
While I havent discussed the decomposition of the Error SS into Lack-of-Fit and Pure error, this can be
done when there are replicated X values. A prototype ANOVA table would look something like:
Source df SS
Regression p 1 A
Lack-of-t n p n
e
B
Pure error n
e
C
Corrected Total n-1 D
where there are n observations and a regression model is t with p additional X values over and above the
intercept.
R
2
is computed as
R
2
=
SS(regression)
SS(total)
=
A
D
= 1
B +C
D
where SS() represents the sum of squares for that term in the ANOVA table. At this point, rerun the three
examples presented earlier to nd the value of R
2
.
For example, in the fertilizer example, the ANOVA table is:
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio p-value
Model 1 225.18035 225.180 69.8800 <.0001
Error 9 29.00147 3.222
C. Total 10 254.18182
Here R
2
= 225.18035/254.18182 = .885 = 88.5%.
R
2
is interpreted as the proportion of variance in Y accounted for by the regression. In this case, almost
90% of the variation in Y is accounted for by the regression. The value of R
2
must range between 0 and 1.
It is tempting to think that R
2
must be measure of the goodness of t. In a technical sense it is, but
R
2
is not a very good measure of t, and other characteristics of the regression equation are much more
informative. In particular, the estimate of the slope and the se of the slope are much more informative.
Here are some reasons, why I decline to use R
2
very much:
Overtting. If there are no replicate X points, then n
e
= 0, C = 0, and R
2
= 1
B
D
. B has n p
degrees of freedom. As more and more X variables are added to the model, n p, and B become
smaller, and R
2
must increase even if the additional variables are useless.
Outliers distort. Outliers produce Y values that are extreme relative to the t. This can inate the
value of C (if the outlier occurs among the set of replicate X values), or B if the outlier occurs at a
singleton X value. In any cases, they reduce R
2
, so R
2
is not resistant to outliers.
People misinterpret high R
2
as implying the regression line is useful. It is tempting to believe that
a higher value of R
2
implies that a regression line is more useful. But consider the pair of plots below:
The graph on the left has a very high R
2
, but the change in Y as X varies is negligible. The graph
on the right has a lower R
2
, but the average change in Y per unit change in X is considerable. R
2
measures the tightness of the points about the line the higher value of R
2
on the left indicates that
the points t the line very well. The value of R
2
does NOT measure how much actual change occurs.
Upper bound is not always 1. People often assume that a low R
2
implies a poor tting line. If you
have replicate X values, then C > 0. The maximum value of R
2
for this problem can be much less
than 100% - it is mathematically impossible for R
2
to reach 100% with replicated X values. In the
extreme case where the model ts perfectly (i.e. the lack of t term is zero), R
2
can never exceed
1
C
D
.
No intercept models If there is no intercept then D =
(Y
i
Y )
2
does not exist, and R
2
is not
really dened.
R
2
gives no additional information. In actual fact, R
2
is a 1-1 transformation of the slope and its
standard error, as is the p-value. So there is no new information in R
2
.
R
2
is not useful for non-linear ts. R
2
is really only useful for linear ts with the estimated regres-
sion line free to have a non-zero intercept. The reason is that R
2
is really a comparison between two
types of models. For example, refer back to the length-weight relationship examined earlier.
In the linear t case, the two models being compared are
log(weight) = log(b
0
) +error
vs.
log(weight) = log(b
0
) +b
1
log(length) +error
and so R
2
is a measure of the improvement with the regression line. [In actual fact, it is a 1-1 transform
of the test that
1
= 0, so why not use that statistics directly?]. In the non-linear t case, the two
models being compared are:
weight = 0 +error
vs.
weight = b
0
length b
1
+error
The model weight=0 is silly and so R
2
is silly.
Hence, the R
2
values reported are really all for linear ts - it is just that sometimes the actual linear t
is hidden.
Not dened in generalized least squares. There are more complex ts that dont assume equal
variance around the regression line. In these cases, R
2
is again not dened.
Cannot be used with different transformations of Y . R
2
cannot be used to compare models that
are t to different transformations of the Y variable. For example, many people try tting a model to
Y and to log(Y ) and choose the model with the highest R
2
. This is not appropriate as the D terms are
no longer comparable between the two models.
Cannot be used for non-nested models. R
2
cannot be used to compare models with different sets of
X variables unless one model is nested within another model (i.e. all of the X variables in the smaller
model also appear in the larger model). So using R
2
to compare a model with X
1
, X
3
, and X
5
to a
model with X
1
, X
2
, and X
4
is not appropriate as these two models are not nested. In these cases, AIC
should be used to select among models.
15.5 A no-intercept model: Fultons Condition Factor K
It is possible to t a regression line that has an intercept of 0, i.e., goes through the origin. Most computer
packages have an option to suppress the tting of the intercept.
The biggest problem lies in interpreting some of the output some of the statistics produced are mis-
leading for these models. As this varies from package to package, please seek advice when tting such
models.
The following is an example of where such a model may be sensible.
Not all sh within a lake are identical. How can a single summary measure be developed to represent
the condition of sh within a lake?
In general, the the relationship between sh weight and length follows a power law:
W = aL
b
where W is the observed weight; L is the observed length, and a and b are coefcients relating length to
weight. The usual assumption is that heavier sh of a given length are in better condition than than lighter
sh. Condition indices are a popular summary measure of the condition of the population.
There are at least eight different measures of condition which can be found by a simple literature
search. Conne (1989) raises some important questions about the use of a single index to represent the
two-dimensional weight-length relationship.
One common measure is Fultons
8
K:
K =
Weight
(Length/100)
3
This index makes an implicit assumption of isometric growth, i.e. as the sh grows, its body proportions and
specic gravity do not change.
How can K be computed from a sample of sh, and how can K be compared among different subset of
sh from the same lake or across lakes?
The B.C. Ministry of Environment takes regular samples of rainbow trout using a oating and a sinking
net. For each sh captured, the weight (g), length (mm), sex, and maturity of the sh was recorded.
The data is available in the rainbow-condition.csv le in the Sample Program Library at http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
data fish;
infile rainbow-condition.csv dlm=, dsd missover firstobs=2 ;
input net_type $ fish length weight species $ sex $ maturity $;
lenmod = (length/100)
**
3;
label lenmod=(length/100)
**
3;
condition_factor = weight / lenmod;
run;
The ordering of the rows is NOT important; however, it is often easier to nd individual data points if
the data is sorted by the X value and the rows for future predictions are placed at the end of the dataset. Part
of the raw data are shown below:
Obs net_type sh length weight species sex maturity (length/100)**3 condition_factor
1 Sinking 1 360 686 RB F MATURING 46.6560 14.7034
3 Sinking 3 295 284 RB M MATURING 25.6724 11.0625
8
There is some doubt about the rst authorship of this condition factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.
(2005). The Origin of Fultons Condition Factor Setting the Record Straight. Fisheries, 31, 236-238.
Obs net_type sh length weight species sex maturity (length/100)**3 condition_factor
K was computed for each individual sh, and the resulting histogram is displayed below:
scatter y=weight x=lenmod / markerattrs=(symbol=circlefilled);
xaxis label=(Length/100)
**
3 offsetmin=.05 offsetmax=.05;
run;
There is a range of condition numbers among the individual sh with an average (among the sh caught) K
of about 13.6.
Deriving a single summary measure to represent the entire population of sh in the lake depends heavily
on the sampling design used to capture sh.
Some case must be taken to ensure that the sh collected are a simple random sample from the sh in the
population. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically
more selective for sh of a certain size. In this experiment, several different mesh sizes were used to try and
ensure that all sh of all sizes have an equal chance of being selected.
As well, if regression methods have an advantage in that a simple random sample from the population is
no longer required to estimate the regression coefcients. As an analogy, suppose you are interested in the
relationship between yield of plants and soil fertility. Such a study could be conducted by nding a random
sample of soil plots, but this may lead to many plots with similar fertility and only a few plots with fertility
at the tails of the relationship. An alternate scheme is to deliberately seek out soil plots with a range of
fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then t a regression curve
to these selected data points.
Fultons index is often re-expressed for regression purposes as:
W = K
_
L
100
_
3
This looks like a simple regression between W and
_
L
100
_
3
but with no intercept.
A plot of these two variables:
scatter y=weight x=lenmod / markerattrs=(symbol=circlefilled);
xaxis label=(Length/100)
**
3 offsetmin=.05 offsetmax=.05;
run;
shows a tight relationship among sh but with possible increasing variance with length.
There is some debate about the proper way to estimate the regression coefcient K. Classical regression
methods (least squares) implicitly implies that all of the error in the regression is in the vertical direction,
i.e. conditions on the observed lengths. However, the structural relationship between weight and length
likely is violated in both variables. This would lead to the error-in-variables problem in regression, which
has a long history. Fortunately, the relationship between the two variables is often sufciently tight that it
really doesnt matter which method is used to nd the estimates.
ods graphics on;
proc reg data=fish plots=all;
title2 fit the model with NO intercept;
model weight = lenmod / noint all;
run;
ods graphics off;
Variable DF
Parameter
Estimate
Standard
Error
t
Value
Pr
>
|t|
Lower
95%
CL
Parameter
Upper
95%
CL
Parameter
lenmod 1 13.72947 0.09878 138.98 <.0001 13.53391 13.92502
Note that R
2
really doesnt make sense in cases where the regression is forced through the origin because
the null model to which it is being compared is the line Y = 0 which is silly.
9
The estimated value of K is 13.72 (SE 0.099).
The residual plot:
9
Consult any of the standard references on regression such as Draper and Smith for more details.
shows clear evidence of increasing variation with the length variable. This usually implies that a weighted
regression is needed with weights proportional to the 1/length
2
variable. In this case, such a regression
gives essentially the same estimate of the condition factor (
K = 13.67, SE = .11).
Comparing condition factors
This dataset has a number of sub-groups do all of the subgroups have the same condition factor? For
example, suppose we wish to compare the K value for immature and mature sh. This is covered in more
detail in the Chapter on the Analysis of Covariance (ANCOVA).
15.6 Frequent Asked Questions - FAQ
15.6.1 Do I need a random sample; power analysis
A student wrote:
I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstract at-
tached). I would like to dene a regional hydraulic geometry for a fairly small hydrologic/geologic
homogeneous area in the coast mountains close to SFU. Hydraulic geometry is the study of how
the primary ow variables (width, depth and velocity) change with discharge in a stream. Typi-
cally, a straight-regression line is tted to data plotted on a log-log plot. The equation is of the
form w = aQ
b
where a is the intercept, b is the slope, w is the water surface width, and Q is
the stream discharge.
I am struggling with the last part of my research proposal which is how do I select (randomly)
my eld sites and how many sites are required. My supervisor - suggests that I select stream seg-
ments for study based on a-priori knowledge of my eld area and select streams from across it.
My argument is that to dene a regionally applicable relationship (not just one that characterizes
my chosen sites) I must randomly select the sites.
I think that GIS will help me select my sites but have the usual questions of how many sites are
required to give me a certain level of condence and whether or not Im on the right track. As
well, the primary controlling variables that I am looking at are discharge and stream slope. I
will be plotting the ow variables against discharge directly but will deal with slope by breaking
my stream segments into slope classes - I guess that the null hypothesis would be that there is
no difference in the exponents and intercepts between slope classes.
You are both correct!
If you were doing a simple survey, then you are correct in that a randomsample fromthe entire population
must be selected - you cant deliberately choose streams.
However, because you are interested in a regression approach, the assumption can be relaxed a bit. You
can deliberately choose values of the X variables, but must randomly select from streams with similar X
values.
As an analogy, suppose you wanted to estimate the average length of male adult arms. You would need
a random sample from the entire population. However, suppose that you were interested in the relationship
between body height (X) and arm length (Y ). You could deliberately choose which X values to measure -
indeed it would be good idea to get a good contrast among the X values, i.e. nd people who are 4 ft tall, 5
ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then t the regression curve. However,
at each height level, you must now choose randomly among those people that meet that criterion. Hence you
could could deliberately choose to have 1/4 of people who are 4 ft tall, 1/4 who are 5 feet tall, 1/4 who are
6 feet tall, 1/4 who are 7 feet tall which is quite different from the proportions in the population, but at each
height level must choose people randomly, i.e. dont always choose skinny 4 ft people, and over-weight 7 ft
people.
Now sample size is a bit more difcult as the required sample size depends both on the number of
streams selected and how they are scattered along the X axis. For example, the highest power occurs when
observations are evenly divided between the very smallest X and very largest X value. However, without
intermediate points, you cant assess linearity very well. So you will want points scattered around the range
of X values.
If you have some preliminary data, a power/sample size can be done using JMP, SAS, and other packages.
If you do a google search for power analysis regression, there are several direct links to examples. Refer to
the earlier section of the notes.
Chapter 16
SAS CODE NOT DONE
959
Chapter 17
SAS CODE NOT DONE
960
Chapter 18
SAS CODE NOT DONE
961
Chapter 19
Estimating power/sample size using
Program Monitor
Contents
19.1 Mechanics of MONITOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
19.2 How does MONITOR work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972
19.3 Incorporating process and sampling error . . . . . . . . . . . . . . . . . . . . . . . . 977
19.4 Presence/Absence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
19.5 WARNING about using testing for temporal trends . . . . . . . . . . . . . . . . . . . 989
J. Gibbs has written a Windoze program to estimate the power and sample size requirements for many
common monitoring programs.
Gibbs, J. P., and Eduard Ene. 2010.
Program MONITOR: Estimating the statistical power of ecological monitoring programs. Ver-
sion 11.0.0.
http://www.esf.edu/efb/gibbs/monitor/
CAUTION: Version 11.0 of MONITOR appears to have some features that result in
incorrect power computations in certain cases. Please contact me in advance of using the results
from MONITOR in a critical planning situation to ensure that you have not stumbled on some of the
features.
Program MONITOR uses simulation procedures to evaluate how each component of a monitoring pro-
gram inuences its power to detect a linear (regression) change. The program has been cited in numerous
peer-reviewed publications since it rst became available in 1995.
962
CHAPTER 19. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM
MONITOR
Before using Program MONITOR, you will need to gather some basic information about the proposed
study.
What is the initial value of your population. This could be the initial population size, the initial density,
etc.
How precisely can you measure the population at a given sampling occasion? This can be given as the
standard error you expect to see at any occasion, or the relative standard error (standard error/estimate),
etc.
What is the process variation? Do you really expect that the measurements would fall precisely on the
trend line in the absence of measurement error?
What is the signicance level and target power? Traditional values are = 0.05 with a power of 80%
or = 0.10 with a target power of 90%.
19.1 Mechanics of MONITOR
Let us rst demonstrate the mechanics of MONITOR before looking at some real examples of how to use it
for monitoring designs.
Suppose we wish to investigate the power of a monitoring design that will run for 5 years. At each survey
occasion (i.e. every year), we have 1 monitoring station, and we make 2 estimates of the population size at
the monitoring station in each year. The population is expected to start with 1000 animals, and we expect
that the measurement error (standard error) in each estimate is about 200, i.e. the coefcient of variation
of each measurement is about 20% and is constant over time. We are interested in detecting increasing or
decreasing trends and to start, a 5% decline per year will be of interest. We will assume an UNREALISTIC
process error of zero so that the sampling error is equal to the total variation in measurements over time.
Launch Program MONITOR:
MONITOR
The screen starts with default values. We make some changes:
Change the sampling occasions to the values 0, 1, 2, 3, 4.
MONITOR
Change the number of survey plots/year to 2.
Check that the signicance level is set of 0.05.
Check that the desired power is set to 0.80.
Check that the range of desired trends encompasses -5%. You might want to increase the number
of trend powers computed to 21 to get power computations for every value rather than every second
value.
Check that the two-sided test is selected.
MONITOR
Then click on the Plots Tab and enter the initial population size (1000) and a variation (the STANDARD
DEVIATION) in measurements of 200 under Total Variation
MONITOR
Press the Run icon and the following results are shown: [Because the power computations are based on
a simulation, your results may vary slightly.]
MONITOR
Notice that the net change over the ve year period with a 5% decline/year is only a 18.5% total decline
over the ve year period. This is obtained as:
Mean % Total
Year Abundance Decline
0 1000 0.0%
1 950.0 = 1000(.95) 5.0%
2 902.5 = 1000(.95)
2
9.7%
3 857.4 = 1000(.95)
3
14.3%
4 814.5 = 1000(.95)
4
18.5%
By clicking on the Trend vs. Power Chart tab, you see a graph of the power by the size of the trend:
MONITOR
This design has a power around 15% of detecting this trend hardly worthwhile doing the study!
How many years would be needed to detect this trend with a 80% power? Try modifying the number of
sampling years until you get the approximate power needed:
MONITOR
MONITOR
So about 10 years of monitoring will be needed to detect a 5% decline PER YEAR with about an 80% power.
The difference in reported powers between the MONITOR and TRENDS programs are artifacts of the
different ways the two programs compute power (and potential because of some features of the MONITOR
program). TRENDS uses analytical formulae based on normal approximations, while MONITOR conducts
a simulation study and reports the number of trials (in this case out of 500) that detected the trend. In any
event, dont get hung up over these differences - the key point is that this proposed study has virtually no
power to detect a 5% decline/year.
Program MONITOR also has a hand calculator to convert between the trend per year and the total trend
over the course of the experiment.
MONITOR
For example, a 5% decline per year for 5 ADDITIONAL years translates into an overall decline of 22.6%
over the six years of the study (the one initial year + 5ADDITIONAl years). It is not a straight arithmetic
conversion because the changes are actually multiplicative rather than additive as shown earlier.
19.2 How does MONITOR work?
Program MONITOR estimates power using a simulation based approach as outlined in the help le. For
example, consider the situation outlined in the previous section. Again set up the control parameters in the
same way except change the trend lines to look only at a single value for the decline (-5% per year).
MONITOR
Then press the Step icon. The following display is obtained:
MONITOR
First the underlying deterministic trend is generated (the black line in the middle of the plot). Then based
on the variation expected in the measurements, actual data are generated (shown by circles below and note
that at time 1, the values are off the plot) and presented in the Survey count details tab:
MONITOR
MONITOR
Then it gets a bit odd, and the output is potentially misleading. The red line is regression line is t through
the points (the red line in the rst graph; estimates at the bottom of the data window). But this curve is not
the one used to estimate the power. Rather, a regression line is t through the log(data) and the results from
the regression on the log(data) is used to determine if the trend was detected. The analysis is done on the
log-scale because of the multiplicative way in which the deterministic trend is t. Refer to the analyses from
JMP below to see which statistics are used:
MONITOR
In this case, the estimated trend line (on the log-scale) was not statistically different from zero and the
trend was NOT detected.
The simulation is repeated many hundreds of times and the number of times that a statistically signicant
trend was detected is found and the proportion of times that a statistically signicant trend was detected is
then the estimated power for this design.
19.3 Incorporating process and sampling error
As noted in the chapter on Trend Analysis, there are often two sources of variation in any monitoring study.
First is sampling variation. This occurs because it is impossible to measure the population parameter
MONITOR
exactly in any one year. For example, if we are measuring the mean DDT level in birds, we must take a
sample (say of 10 birds), sacrice them, and nd the mean DDT in those 10 birds. If a different sample
of 10 birds were to be selected, then the sample mean DDT would vary in the second sample. This is
called sampling error (or the standard error) and can be estimated from the data taken in a single year. Or,
the parameter of interest may be the number of smolt leaving a stream and this is estimated using capture-
recapture methods. Again we would have a measure of uncertainty (the standard error) for each measurement
in each year. Sampling error (the standard error) can be reduced by increasing the effort in each year.
However, consider what happens when measurements are taken in different years. It is unlikely that the
population values would fall exactly on the trend line even if the sampling error was zero. This is known as
process error and is caused by random year effects (e.g. an El Nino). Process error CANNOT be reduced
by increasing the sampling effort in a year.
The two sources of variation are diagrammed below:
Unfortunately, process error is often the limiting factor in a monitoring study!
In order to estimate the process and sampling variation, you will need at least two years of data or some
MONITOR
educated guesses from previous years. The Program MONITOR website has a spreadsheet tool to help you
in the decomposition of process and sampling error.
For example, consider a study to monitor the density of white-tailed deer obtained by distance sampling
on Fire Island National Seabird (Underwood et al, 1998), presented as the example on the spreadsheet to
separate process and sampling variation.
The estimated density (and se) are:
Year Density SE
1995 79.6 23.47
1996 90.1 11.67
1997 107.1 12.09
1998 74.1 10.45
1999 64.2 13.90
2000 40.8 12.38
2001 41.2 7.40
Consider the plot of density over time (with approximate 95% condence intervals):
MONITOR
Assuming that the deer density is in steady state over the ve years of the study, you can see that there is
considerable process error as many of the 95% condence intervals for the deer density do not cover the
mean density over the ve years. So even if the sampling error (the se) was driven to zero by adding more
effort, the data points would not all lie exactly on the mean line over time.
There are many ways to separate process and sampling variation the chapter on the analysis of BACI
designs presents some additional ways. The following is an approximate analysis that should be sufcient
for most planning purposes.
MONITOR
First, examine a plot of the estimated se versus the density estimates:
In many cases, there is often a relationship between the se and the estimate with larger estimates tending to
have a higher se than smaller estimates. The previous plots shows that except for one year, the se is relatively
constant. If the se had a positive relationship to the estimate, a weighted procedure could be used (this is the
procedure used in Underwoods spreadsheet).
We being by nding the mean density and the total variation from the mean. [If the preliminary study
had an obvious trend, you could t the trend line and then nd the total variation from the trend line in a
MONITOR
similar fashion.]
We start by nding the total variation in the density estimates over time:
V ar
Total
= var(79.6, 90.1, . . . , 41.2) = 599.6
The total variation is equal to the process + sampling variation. An estimate of the average sampling variation
is found by averaging the se
2
:
V ar
Sampling
=
23.4
2
+ 11.67
2
+. . . + 7.40
2
7
= 191.9
Finally, the process variance is found by subtraction:
V ar
Process
=

V ar
Total

V ar
Sampling
= 599.6 191.9 = 407.7
We now launch Program Monitor, and are interested in a 10 year study to look at changes in the popula-
tion density following some management action. Notice we now specify a partitioning of the variation into
process and sampling error:
MONITOR
We use the sqrt() of the two variances estimated above when specifying the two sources of variation:
MONITOR
and then press the Run button as before to get:
MONITOR
The power to detect a 5% decline PER YEAR is not very good.
It is instructive to see what would happen if you believed that there was NO process variation and simply
used the average sampling variation as the sole source of variation:
MONITOR
Now the (incorrect) estimated power is much higher.
19.4 Presence/Absence Data
Sometimes, only presence/absence data can be collected on each plot, rather than a measure of density. In
cases like this, you may wish to consider occupancy modelling, but that is a topic for another course.
Despite not having an absolute measure of abundance, presence/absence data can be used to monitor
the density of species with relatively low abundances. This makes use of the Poisson distribution to predict
MONITOR
presence/absence as a function of density.
For example, according to the Poisson distribution, if the average density per plot is , then the probabil-
ity that a sampled plot will be labelled as a presence is 1 exp() and the probability that a sampled plot
will be labelled as an absence is exp(). So a change in the overall proportion of sites that are occupied
corresponds to a change in the overall average density.
Note that we are implicitly assuming that all absences are true absences, i.e. not a false negative. If false
negatives are possible, you really should be using an occupancy design rather than a simple presence/absence
design.
We will use the example that ships with Program MONITOR. This example focuses on the least bittern
(Ixobrychus exilis) , a secretive marsh bird. Least bittern populations are hard to monitor given their quirky
habitats, that is, its unpredictable calling behavior. Calls are the only way to detect the species presence
within the dense vegetation of the marshes where it lives. Consider that baseline surveys of least bitterns
between May 15-June 15 indicate that an average of about 0.20 calling least bitterns were heard on any
given visit. A water control structure on the marsh is being altered to generate a more stable water level that
should improve the situation for bitterns at the site. How much of a trend can be detected with 10 years of
monitoring and 10 visits to the marsh each year?
Here the average of 0.20 calls/visit implies that a presence was detected in about 1/5 visits to the
marsh.
Start by entering the data on the main page and then on the plots page.
MONITOR
MONITOR
With presence/absence data, the plot mean should have the approximate base rate of presences and
there is no need for a standard deviation estimator. On the main page, tests for trend in presence/absence
data are equivalent to chi-square tests (covered in another section of the notes). The Custom/ANOVA
area indicates a doubling of the presences frequency in the second through 10 year of monitoring.
Before computing the power, press the step button to get a feel for the data that are generated (not shown).
I think this is where Program MONITOR has a feature as the data in the 3rd and subsequent visits doesnt
ever have any non-detects.
Consequently, I wont continue with this example until I understand what MONITOR is doing! I have
SAS programs that can help in planning of presence/absence studies please contact me for assistance.
19.5 WARNING about using testing for temporal trends
The Patuxent Wildlife Research Center has some sage advice about power analysis for temporal trends.
Users should be aware (and wary) of the complexity of power analysis in general, and also ac-
knowledge some specic limitations of MONITORfor many real-world applications. Our chief,
immediate concern is that many users of MONITOR may be unaware of these limitations and
may be using the program inappropriately. Below are comments from one of our statisticians
on some of the aspects of MONITOR that users should be cognizant of: There are numerous
issues with how Program Monitor calculates statistical power and sample size. One issue con-
cerns the default option whereby the user assumes independence of plots or sites from one time
period to the next. If you are randomly sampling new sites or plots each time period, then it is
correct to assume independence (assuming that nite population correction factor is not an is-
sue, which depends on how many plots or sites you are sampling, relative to the total population
size of potential plots or sites). If you are sampling the same plots or sites repeatedly over time,
however, then the default option in Program Monitor is unlikely to give a correct calculation of
statistical power or sample size. If plots or sites are positively autocorrelated over time, as is
usually the case in biological surveys, then Program Monitor will underestimate sample size, or
conversely, it will overestimate the statistical power. The correct sample size estimate is likely
to be greater, and depending upon the amount of autocorrelation, the correct sample size could
MONITOR
be vastly greater to achieve a stated power objective.
We deal with some of these issues when we discuss the design and analysis of BACI surveys later in this
course.
Chapter 20
SAS CODE NOT DONE
991
Chapter 21
SAS CODE NOT DONE
992
Chapter 22
SAS CODE NOT DONE
993
Chapter 23
Logistic Regression - Advanced Topics
Contents
23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994
23.2 Sacricial pseudo-replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995
23.3 Example: Fox-proong mice colonies . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
23.3.1 Using the simple proportions as data . . . . . . . . . . . . . . . . . . . . . . . . 997
23.3.2 Logistic regression using overdispersion . . . . . . . . . . . . . . . . . . . . . . 999
23.3.3 GLIMM modeling the random effect of colony . . . . . . . . . . . . . . . . . . . 1000
23.4 Example: Over-dispersed Seeds Germination Data . . . . . . . . . . . . . . . . . . . 1002
23.1 Introduction
The previous chapters on chi-square tests, logistic regression, and logistic ANOVA only considered the
simplest of experiment designs where the data were collected under a completely randomized design, i.e.
every observation is independent of every other observation with complete randomization over experimental
units and treatments.
It is possible to extend logistic regression and logistic ANOVA to more complex experimental designs.
My course notes in a graduate course Stat-805 http://www.stat.sfu.ca/~cschwarz/Stat-805
have some details on these more advanced topics.
It is only recently that software has become readily available to analyze these types of experiments. The
illustrations below will use Proc GLIMMIX available in SAS v.9.1.3 or higher.
In this chapter some variations from the simple CRD will be discussed.
994
CHAPTER 23. LOGISTIC REGRESSION - ADVANCED TOPICS
23.2 Sacricial pseudo-replication
In many experiments, the experimental unit is a collection of individuals, but measurements take place on
the individual.
Hurlbert (1984) cites the example of an experiment to investigate the effect of fox predation upon the
sex ratio of mice. Four colonies of mice are established. Two of the colonies are randomly chosen and a
fox-proof fence is erected around the plots. The other two colonies serve as controls with out any fencing.
Colony % Males Number males Number females
Foxes A
1
63% 22 13
A
2
56% 9 7
No foxes B
1
60% 15 10
B
2
43% 97 130
This data has the characteristics of a chi-square test or logistic ANOVA. The factor (type of fencing) is
categorical. The response, the sex of the mouse, is also categorical. Many researchers would simply pool
over the replicates to give the pooled table:
Foxes A
1
+A
2
61% 31 20
No foxes B
1
+B
2
44% 112 140
If a
2
test is applied to the pooled data, the p-value is less than 5% indicating there is evidence that the
sex ratio is not independent of the presence of foxes.
This pooled analysis is INCORRECT. According to Hurlbert (1984), the major problem is that indi-
vidual units (the mice) are treated as independent objects, when in fact, there are not. Experimenters often
pool experimental units from disparate sets of observations in order to do simple chi-square tests or logistic
ANOVA. He specically labels this pooling as sacricial pseudo-replication.
Hurlbert (1984) identies at least 4 reasons why the pooling is not valid:
non-independence of observation. The 35 mice caught in A
1
can be regarded as 35 observations
all subject to a common cause, as can the 16 mice in A
2
, as each group were subject to a common
inuence in the patches. Consequently, the pooled mice are NOT independent; they represent two sets
of interdependent or correlated observations. The pooled data set violates the fundamental assumption
of independent observations.
throws away some information. The pooling throws out the information on the variability among
replicate plots. Without such information there is no proper way to assess the signicance of the dif-
ferences between treatments. Note that in previous cases of ordinary pseudo-replication (e.g. multiple
sh within a tank), this information is also discarded but is not needed - what is needed is the varia-
tion among tanks, not among sh. In the latter case, averaging over the pseudo-replicates causes no
problems.
confusion of experimental and observational units. If one carries out a test on the pooled data,
one is implicitly redening the experimental unit to be individual mice and not the eld plots. The
enclosures (treatments) are applied at the plot level and not the mouse level. This is similar to the
problem of multiple sh within a tank that is subject to a treatment.
unequal weighting. Pooling weights the replicate plots differentially. For example, suppose that one
enclosure had 1000 mice with 90% being male; and a second enclosure has 10 mice with 10% being
male. The pooled data would have 1000 + 10 mice with 900 + 1 being male for an overall male
ratio of 90%. Had the two enclosures been given equal weight, the average male percentage would be
(90%+10%)/2=50%. In the above example, the number of mice captured in the plots varies from 16
to over 200; the plot with over 200 mice essentially drives the results.
There are multiple ways to analyze this data that avoid the problem that render the pooled analysis
invalid. Unfortunately, JMP 7.0 does NOT have the ability to properly analyze this type of data. SAS will be
used to illustrate the various options in the sections that follow.
23.3 Example: Fox-proong mice colonies
Hurlbert (1984) cites the example of an experiment to investigate the effect of fox predation upon the sex
ratio of mice. Four colonies of mice are established. Two of the colonies are randomly chosen and a fox-
proof fence is erected around the plots. The other two colonies serve as controls with out any fencing.
Foxes A
1
63% 22 13
A
2
56% 9 7
No foxes B
1
60% 15 10
B
2
43% 97 130
We being by reading in the data:
data mice;
length treatment $10.;
input colony $ treatment $ sex $ count;
datalines;
a1 foxes m 22
a1 foxes f 13
a2 foxes m 9
a2 foxes f 7
b1 no.foxes m 15
b1 no.foxes f 10
b2 no.foxes m 97
b2 no.foxes f 130
;;;;
23.3.1 Using the simple proportions as data
Hurlbert (1984) suggests the proper way to analyze the above experiment is to essentially compute a single
number for each plot and then do a two-sample t-test on the percentages. [This is equivalent to the ordinary
averaging process that takes place in ordinary pseudo-replication or sub-sampling.]
We can have SAS compute the proportion of males directly using Proc Transpose:
/
*
transpose the data to compute the proportions
*
/
proc sort data=mice; by colony treatment;
proc transpose data=mice out=trans_mice;
by colony treatment;
var count;
id sex;
run;
data trans_mice;
set trans_mice;
p_males = m/(m+f);
drop _name_;
format m f 5.0 p_males 7.3;
run;
This gives:
colony treatment m f
p_males
a1 foxes 22 13 0.629
a2 foxes 9 7 0.563
b1 no.foxes 15 10 0.600
b2 no.foxes 97 130 0.427
Proc Ttest is then used to analyze the data:
proc ttest data=trans_mice ci=none;
title2 Simple ttest on the proportion of males;
class treatment;
var p_males;
ods output Statistics=ttest_statistics;
ods output TTests=ttest_tests;
run;
The simple summary statistics are:
Variable treatment N Mean
Std
Error
Lower
Limit
of
Mean
Upper
Limit
of
Mean
p_males foxes 2 0.5955 0.0330 0.1758 1.0153
p_males no.foxes 2 0.5137 0.0863 -0.5834 1.6108
p_males Diff (1-2) _ 0.0819 0.0924 -0.3159 0.4796
The results from a simple t-test conducted in SAS are:
t
Value
DF
Pr
>
|t|
p_males Pooled Equal 0.89 2 0.4692
p_males Satterthwaite Unequal 0.89 1.2866 0.5102
The estimated difference in the sex ratio between colonies that are subject to fox predation and colonies
not subject to fox predation is .082 (SE .092) with p-values of .47 (pooled t-test) and .51 (unpooled t-test)
respectively. As the p-values are quite large, there is NO evidence of a predation effect.
With only two replicates (the colonies), this experiment is likely to have very poor power to detect
anything but gross differences.
The above analysis is not entirely satisfactory. The proportion of males have different variabilities be-
cause they are based on different number of total mice. As well, there may be overdispersion among colonies
under the same treatment, i.e. the variation in the proportion of males may be larger among the two colonies
under the same treatment than expected.
23.3.2 Logistic regression using overdispersion
Another approximate method to deal with the potential overdispersion among the colonies within the same
treatment group (the colony effect) is to use a standard logistic regression but use the goodness-of-t test to
estimate an overdispersion effect. This overdispersion is then used to adjust the standard errors of estimates
and the test statistics for hypothesis tests. Please consult the chapter on Logistic Regression for more details.
Proc Genmod is then used to analyze the data:
proc genmod data=mice descending;
title2 Generalized Linear Model allowing for overdispersion;
class treatment colony;
model sex = treatment /
dist=binomial link=logit dscale aggregate=colony type3;
freq count;
lsmeans treatment / diff cl;
run;
The Dscale option on the model statement indicates that the deviance goodness-of-t test is used to estimate
the overdispersion factor.
The bottom line of the parameter estimates table:
Parameter Level1 DF Estimate
Standard
Error
95%
Lower
Condence
Limit
95%
Upper
Condence
Limit
Wald
Chi-Square
Pr
>
ChiSq
Intercept 1 -0.2231 0.1527 -0.5225 0.0762 2.13 0.1441
treatment foxes 1 0.6614 0.3778 -0.0791 1.4019 3.06 0.0800
treatment no.foxes 0 0.0000 0.0000 0.0000 0.0000
. .
Scale 0 1.2049 0.0000 1.2049 1.2049
_ _
estimates the overdispersion factor as 1.20. This implies that standard errors will be inated by
1.20 and
test-statistics for effect tests will be deated by a factor of 1.20.
The test for a treatment effect:
Source
Num
DF
Den
DF
F
Value
Pr
>
F
Chi-Square
Pr
>
ChiSq
Method
treatment 1 2 3.14 0.2185 3.14 0.0765 LR
gives a p-value of .0765 again indicating no evidence of an effect.
The estimated sex effect is:
Effect treatment _treatment Est SE
z
Value
Pr
>
|z|
Alpha Lower Upper
treatment foxes no.foxes 0.6614 0.3778 1.75 0.0800 0.05 -0.07912 1.4019
The estimate of .66 implies that the odds-ratio of the proportion of males between colonies with foxes and
without foxes is exp(.66) = 1.94x but the 95% condence interval for the odds ratio is from exp(.0791) =
.92 to exp(1.40) = 4.05 which includes the value of 1 (indicating no difference in the odds of males).
The use a simple overdispersion factor is not completely satisfactory. It assumes a single correction
factor for all of the estimates and again estimates the different amount of mice in each colony.
23.3.3 GLIMM modeling the random effect of colony
A more rened analysis is now available using Generalized Linear Mixed Models (GLIMM) which have
been implemented in SAS.
GLIMM allowthe specication of randomeffects in much the same way as in advanced ANOVAmodels.
This is a very general treatment and now allows us to analyze data from very complex experimental designs.
In this model, the model would be specied as:
logit(p
males
) = Treatment Colony(Treatment)(R)
where the Colony(Treatment) would be the random effect of the experimental units (the colonies). A
logistic type model is used.
This is specied in SAS as:
proc glimmix data=mice;
title2 Glimmix analysis;
class treatment colony ;
model sex(event=m) = treatment /
distribution=binary link=logit ddfm=kr;
random colony(treatment);
freq count;
lsmeans treatment / cl ilink;
lsmestimate treatment "trt effect" 1 -1 / cl;
ods output covparms=GlimMixCovParms;
ods output tests3 =GlimMixTest3;
ods output lsmestimates=GlimMixLSMestimates;
run;
The output from GLIMMIX (SAS 9.3.1) follows. First is an estimate of the variability among colonies
(on the logit scale):
Parameter Estimate
Standard
Error
colony(treatment) 0.06892 0.1675
Next is the test of the overall treatment effect:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
treatment 1 1.847 1.22 0.3919
The p-value is .39; again there is no evidence of a predation effect on the proportion of males in the colonies.
Finally, an estimate of the treatment effect:
Effect Label Estimate
Standard
Error DF
t
Value
Pr
>
treatment trt effect 0.5269 0.4763 1.847 1.11 0.3919 0.05 -1.6940 2.7478
Some caution is required. The estimate of .53 (SE .47) is for the difference in the logit(proportions) between
males and females. If you take exp(.53) = 1.69, this is the estimated odds-ratio of males to females
comparing colonies with predators to colonies without predators. The 95% condence interval for the odds-
ratio is exp(1.6950) = .183 to exp(2.7478) = 15.60 which includes the value of 1 (indicating no effect).
Consult the chapter on logistic regression for an explanation of odds and odds-ratios.
23.4 Example: Over-dispersed Seeds Germination Data
This data is from the SAS manual.
In a seed germination test, seeds of two cultivars were planted in pots of two soil conditions. The fol-
lowing data contains the observed proportion of seeds that germinated for various combinations of cultivar
and soil condition. Variable n represents the number of seeds planted in a pot, and r represents the num-
ber germinated. CULT and SOIL are indicator variables, representing the cultivar and soil condition,
respectively.
Pot n r Cult Soil
1 16 8 0 0
2 51 26 0 0
3 45 23 0 0
4 39 10 0 0
5 36 9 0 0
6 81 23 1 0
7 30 10 1 0
8 39 17 1 0
9 28 8 1 0
10 62 23 1 0
11 51 32 0 1
12 72 55 0 1
13 41 22 0 1
14 12 3 0 1
15 13 10 0 1
16 79 46 1 1
17 30 15 1 1
18 51 32 1 1
19 74 53 1 1
20 56 12 1 1
The SAS program that analyzed this data is available in the le germination.sas in the Sample Program
Library at: http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
Notice that the experimental unit is the pot (i.e. soil and cult were applied to the pot level), but the ob-
servational unit (what is actually measured) is the individual seed. The response variable for each individual
seed is the either yes or no depending if it germinated or not.
First, how big is the pot random effect? One way to estimate this would be to compare the variation
of p among pots within the same soil-cultivar combination with the theoretical variation based on binomial
sampling within each pot. In order to account for the differing sample sizes in each pot, we will compute a
standardized normal variable for pot i within soil-cultivar combination j as:
z
ij
=
p
ij
p
j
_
p
ij
(1 p
ij
)/n
ij
where p
j
=
i
rij
nij
is the average germination rate for the soil-cultivar combination j.
If the additional pot-to-pot random variation was negligible, then Z should have an approximate standard
normal distribution with a variance of 1. The actual variance of Z was found to be 4.5 indicating that the
pot-to-pot variation in p was about 4 larger than expected from a simple binomial variation.
Because of this extra-binomial variation, it is not proper to simply ignore the pot and pool over the
ve pots for each cultivar-soil combination. This would be an example of sacricial pseudo-replication as
outlined by Hurlbert (1984). As you will below, the pot-to-pot variation in the proportion that germinate is
more than can be explained by the simple binomial variation, i.e. there is a large random effect of pots that
must be incorporated.
A naive analysis could proceed by nding the proportion of seeds that germinated in each pot (e.g. for
pot 1, p = 8/16 = 0.50) and then doing a two-factor CRD analysis on these proportions using the model:
p = Soil Cult Soil Cult
This is not satisfactory because the number of seeds in each pot (n) varies considerably from pot-to-pot, and
hence the variance of p also varies
1
. A weighted analysis could be performed which would partially solve
this problem.
This naive analysis could be done using the Proc Mixed code:
proc mixed data=seeds;
title2 naive analysis on the proportions in each plot - using n as weight;
class cult soil;
model phat = cult soil cult
*
soil / ddfm=kr;
lsmeans soil cult soil
*
cult/ diff adjust=tukey;
weight n;
ods output CovParms=NaiveMixedCovparms;
ods output Tests3=NaiveMixedTests3;
ods output LSmeans=NaiveMixedLSMeans;
run;
This give an estimate of the residual variance of:
1
The binomial variance of each p for a pot would be found as

p(1 p)/n
Cov
Parm
Estimate
Residual 1.0057
The residual variance is a combination of pot-to-pot variance and the variability of the p in each pot.
As a rough guess, the average germination rate is around 0.5 with an average sample size of around 45.
This would give a binomial variance of .5(1 .5)/45 = .005. Hence the pot-to-pot variance is about
.026 .005 = .020 which is about 4 that of the binomial variance which we saw earlier.
The following results for the tests of the main effects and interactions:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
cult 1 16 1.57 0.2287
soil 1 16 10.86 0.0046
cult*soil 1 16 0.05 0.8177
Hence the naive analysis nd no evidence of an interaction effect of soil and cultivar, no evidence of a
main effect of cultivar, but strong evidence of a main effect of soil upon the germination rate.
Here are the estimates of the marginal means:
Effect cult soil Estimate
Standard
Error
DF
t
Value
Pr
>
|t|
soil 0 0.3720 0.04891 16 7.61 <.0001
soil 1 0.5952 0.04688 16 12.70 <.0001
cult 0 0.5260 0.05172 16 10.17 <.0001
cult 1 0.4412 0.04376 16 10.08 <.0001
cult*soil 0 0 0.4064 0.07334 16 5.54 <.0001
cult*soil 0 1 0.6455 0.07295 16 8.85 <.0001
cult*soil 1 0 0.3375 0.06473 16 5.21 <.0001
cult*soil 1 1 0.5448 0.05889 16 9.25 <.0001
The estimated marginal mean germination rates are relatively precise. The standard errors are not equal
because of the differing sample sizes in the pots in the various soil-cultivar combinations.
The pots serve as cluster in this experiment, so the ad hoc methods of correcting for overdispersion
caused by cluster effects can also be done, i.e. estimating the overdispersion factor (c) and multiplying the
standard errors by the
c. This can be done using Proc Genmod in SAS:

proc genmod data=seeds;
title2 overdispersion model;
class cult soil;
model r/n = cult soil cult
*
soil / dist=binomial link=logit scale=deviance type3;
lsmeans cult soil cult
*
soil / diff;
ods output ModelFit=GenmodModelFit;
ods output Type3 =GenmodType3;
ods output LSmeans =GenmodLSmeans;
run;
This will use the deviance to degrees of freedom to estimate the overdispersion factor. This gives the
following estimate for the overdispersion factor:
Criterion DF Value Value/DF
Deviance 16 68.3465 4.2717
Pearson Chi-Square 16 66.7619 4.1726
Once again we see that the overall variation is over 4 larger than expected under the binomial model.
It is not possible to obtain an explicit estimate of the actual pot-to-pot variance.
The test for effects are:
Source
Num
DF
Den
DF
F
Value
Pr
>
F
Chi-Square
Pr
>
ChiSq
Method
cult 1 16 1.55 0.2317 1.55 0.2138 LR
soil 1 16 10.39 0.0053 10.39 0.0013 LR
cult*soil 1 16 0.05 0.8325 0.05 0.8298 LR
The results are similar to the naive analysis seen earlier.
The estimated means are now on the logit scale:
Statement
Number
Standard
Error
z
Value
Pr
>
|z|
1 cult 0 0.1103 0.2199 0.50 0.6161
1 cult 1 -0.2473 0.1864 -1.33 0.1846
Statement
Number
Standard
Error
z
Value
Pr
>
|z|
1 soil 0 -0.5266 0.2087 -2.52 0.0116
1 soil 1 0.3896 0.1989 1.96 0.0501
1 cult*soil 0 0 -0.3788 0.3077 -1.23 0.2183
1 cult*soil 0 1 0.5993 0.3143 1.91 0.0565
1 cult*soil 1 0 -0.6745 0.2821 -2.39 0.0168
1 cult*soil 1 1 0.1798 0.2437 0.74 0.4607
These can be converted back to the regular scale using the inverse transformation:
p
cult=0
= expit(.1103) = .53
which is similar to the previous results. Note that the se must be converted using the delta-method and not
simply using the expit transformation and is found to be
se( p
cult=0
) = se(logit( p
cult=0
))( p
cult=0
)(1 p
cult=0
) = .2199(.53)(1 .53) = .055
which again matches the naive analysis.
Finally, a logistic ANOVA can be done that explicitly models the random effects of pots directly can be
done using Proc Glimmix in SAS:
proc glimmix data=seeds plots=all; /
*
plots= request residual and other plots
*
/
title2 random effect model;
class cult soil pot;
model r/n = cult soil cult
*
soil / dist=binomial link=logit ddfm=kr;
random pot(cult
*
soil) / type=vc;
lsmeans cult / diff adjust=tukey plots=(meanplot(cl ilink) diffplot);
lsmeans soil / diff adjust=tukey plots=(meanplot(cl ilink) diffplot);
lsmeans soil / diff cl oddsratio; /
*
get the odds ratio
*
/
lsmeans cult
*
soil / diff adjust=tukey plots=(meanplot(cl ilink) diffplot); /
*
plots= requests ods plots
*
/
ods output CovParms=GlimmixCovParms;
ods output Tests3 =GlimmixTests3;
ods output LSmeans =GlimmixLSmeans;
ods output Diffs =GlimmixDiffs;
run;
This corresponds to the model in the shorthand notation:
logit(p) = cult soil cult soil pots(cult soil) R
or the generalized linear model:
r
ij
Binomial(n
ij
, hp
ij
)
ij
=logit(p
ij
)
ij
=cult soil cult soil pots(cult soil) R
Note that pots are nested within each cultivar-soil combination.
This procedure can also produce residual and other plots as requested using the plots keyword
2
The estimated pot-to-pot variance (on the logit scale) is found as:
Parameter Estimate
Standard
Error
pot(cult*soil) 0.3208 0.1571
There is no easy way to convert this to an estimate of the pot-to-pot variation on the regular scale.
Tests for main effects and interactions are:
Effect
Num
DF
Den
DF
F
Value
Pr
>
F
cult 1 15.07 1.00 0.3342
soil 1 15.07 7.71 0.0141
cult*soil 1 15.07 0.02 0.8791
These match the results seen earlier.
Similarly, estimates of main effects (on the logit scale) are found to be:
Standard
Error
DF
t
Value
Pr
>
|t|
cult 0 0.02619 0.2164 16 0.12 0.9052
cult 1 -0.2705 0.2039 13.77 -1.33 0.2064
soil 0 -0.5350 0.2096 15.19 -2.55 0.0219
soil 1 0.2907 0.2109 14.96 1.38 0.1883
soil 0 -0.5350 0.2096 15.19 -2.55 0.0219
soil 1 0.2907 0.2109 14.96 1.38 0.1883
2
The ODS GRAPHICS ON; statement must proceed this request. See program for details.
Standard
Error
DF
t
Value
Pr
>
|t|
cult*soil 0 0 -0.4096 0.3001 15.87 -1.36 0.1913
cult*soil 0 1 0.4620 0.3118 16 1.48 0.1578
cult*soil 1 0 -0.6603 0.2927 14.51 -2.26 0.0400
cult*soil 1 1 0.1194 0.2841 13.04 0.42 0.6811
These are again comparable to the previous results.
The advantage of the Glimmix procedure is the availability of model diagnostic plots. For example, the
residual plot panel:
indicates the potential presence of at least one outlier with a germination rate (on the logit scale) is well below
that predicted. The normal-probability plot (on the logit scale) also identies this one potential outlier.
Estimates of the differences between the mean logits can also be found and the listing is available in the
SAS output. For example, the estimated difference (on the logit scale) between the two marginal means of
the soil combinations is:
Effect soil _soil Estimate
Standard
Error
DF
t
Value
Pr
>
|t|
Adjustment
Adj
P
soil 0 1 -0.8257 0.2973 15.07 -2.78 0.0141 Tukey-Kramer 0.0141
Odds
Ratio
Alpha
Lower
Condence
Limit
for
Odds
Ratio
Upper
Condence
Limit
for
Odds
Ratio
0.438 0.05 0.232 0.825
Hence the estimated difference in the mean logits is .83 (SE .30). This can be converted to an odds
ratio using the methods seen earlier, i.e.
OR
soil 0:1
= exp(.8257) = .44
which implies that the odds of germination in soil=0 is only 44% of the odds of germination in soil=1. The
95% condence interval for the odds-ratio is from (.23 .83).
In this simple experimental design, there is no obvious advantage to using logistic ANOVA with random
effects; however, in more complex designs such as split-plot designs, this is the only way to proceed.
Chapter 24
SAS CODE NOT DONE
1010
Chapter 25
A short primer on residual plots
Contents
25.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012
25.2 ANOVA residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
25.3 Logistic Regression residual plots - Part I . . . . . . . . . . . . . . . . . . . . . . . . 1015
25.4 Logistic Regression residual plots - Part II . . . . . . . . . . . . . . . . . . . . . . . . 1016
25.5 Poisson Regression residual plots - Part I . . . . . . . . . . . . . . . . . . . . . . . . . 1017
25.6 Poisson Regression residual plots - Part II . . . . . . . . . . . . . . . . . . . . . . . . 1019
Residual plots are one of the most important diagnostic tools available for model checking. However,
residual plots can take a variety of forms depending upon the type of model tted that can appear to be
confusing at rst glance.
At its simplest, the residual is dened as:
residual
i
= observed
i
predicted
i
where the i
th
residual is difference between the observed and predicted values for the i
th
observation.
These residuals are often standardized or studentized. Standardization occurs when all of the residuals
are divided by a common, average standard deviation of the residuals. Studentization occurs when each in-
dividual residual is divided by its own standard deviation which may vary among the residuals. For example,
in simple linear regression, the standardized residuals are divided by the
MSE which is an estimate of the

common standard deviation about the regression line. However, residuals near the middle of the regression
line (i.e. near to X) are less variable than residuals near the extremities of the line. The studentized residual
is divided by s
1 h
ii
where h
ii
are the leverage values for the i
th
observation.
Regardless if standardized or studentized residuals are used, these are plotted against the predicted val-
ues. A good model will have the residuals centered around zero with a high proportion (about 95%) within
1011
CHAPTER 25. A SHORT PRIMER ON RESIDUAL PLOTS
2, and no pattern to the residuals.
25.1 Linear Regression
For example, consider the Fitness data set available in the JMP sample data library. This consists of mea-
surements of males and females weight, age, pulse rates, and oxygen consumption as they completed a
standardized tness test.
Consider the model:
Oxy
i
=
0
+
1
Weight
i
+
i
or in a simplied notation
Oxy = Weight
This model was t, and the resulting residual plot
1
is:
This shows a random scatter around zero with only a few points outside the 2 limits.
1
This was constructed by (a) using the Analyze->Fit Model platform, (b) Red-triangle Saving Columns to the data table for the
predicted oxygen consumption and the studentized residual, (c) using Graph Overlay to get the base plot (d) clicking on the Y axis
and adding reference lines at 0, 2, and 2.
Notice that in simple regression, the Y variable is continuous, as is the X variable. Consequently,
predictions are also continuous and so the plot of the residuals will show this random scatter (assuming the
model ts well).
Similar plots are obtained in multiple regression, or ANCOVA models.
25.2 ANOVA residual plots
Consider now comparisons of Y values among different treatment groups. For example, is there a difference
in the mean oxygen consumption between males and females as sampled in the Fitness data set.
The model is now:
Oxy = Sex
The model was t, and the resulting residual plot
2
is:
2
This was constructed by (a) using the Analyze->Fit Model platform with Sex as the X variable, (b) Red-triangle Saving
Columns to the data table for the predicted oxygen consumption and the studentized residual, (c) using Graph Overlay to get the base
plot (d) clicking on the Y axis and adding reference lines at 0, 2, and 2.
At rst glance, this plot does not show a random scatter as there is a denite pattern with two vertical
lines. However, on a sober second thought, this is not surprising. There are only two levels of Sex and so
there are at most two distinct predicted values, one for males and one for females. All females will have
the same predicted value, and all males will have the sample predicted value. These correspond to the two
vertical positions on the plot. The scatter within each vertical line represents the variability of individuals in
their oxygen consumptions within their respective group.
Points of concern would be those individual whose studentized residual value is outside the 2 lines.
If the X variable had k treatment groups, there would be k vertical lines.
25.3 Logistic Regression residual plots - Part I
Suppose we wish to predict membership in a category as a function of a continuous covariate. For example,
can we predict the sex of an individual based on their weight? This is known as logistic regression and is
discussed in another chapter in this series of notes.
Again refer to the Fitness dataset. The (Generalized Linear) model is:
Y
i
distributed as Binomial(p
i
)
i
= logit(p
i
)
i
= Weight
The residual plot is produced automatically from the Generalized Linear Model option of the Analyze->Fit
Model platform and looks like
3
:
This plot looks a bit strange!
3
I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot
Along the bottom of the plot, is the predicted probability of being female
4
This is found by substituting
in the weight of each person into the estimated linear part, and then back-transforming from the logit scale
to the ordinary probability scale. The rst point on the plot, identied by a square box, is from a male who
weighs over 90 kg. The predicted probability of being female is very small, about 5%.
The rst question is exactly how is a residual dened when the Y variable is a category? For example,
how would the residual for this point be computed - it makes no sense to simply take the observed (male)
minus the predicted probability (.05)?
Many computer packages redene the categories using 0 and 1 labels. Because JMP was modeling the
probability of being female, all males are assigned the value of 0, and all females assigned the value of 1.
Hence the residual for this point is 0-.05-0.05 which after studentization, is plots as shown.
The bottom line in the residual plot corresponds to the male subjects, The top line corresponds to the
female subjects. Where are areas of concern? You would be concerned about females who have a very small
probability of prediction for being female, and males who have a large probability of prediction of being
female. These are located in the plot in the circled areas.
The residual plots strange appearance is an artifact of the modeling process.
25.4 Logistic Regression residual plots - Part II
What happens if the predictors in a logistic regression are also categorical. Based on what what seen for
the ordinary regression case, you can expect to see a set of vertical lines. But, there are only two possible
responses, so the plot reduces to a (non-informative) set of lattice points.
For example, consider predicting survival rates of Titanic passengers as a function of their sex. This
model is:
Y
i
distributed as Binomial(p
i
)
i
= logit(p
i
)
i
= Sex
Model platform and looks like
5
:
4
The rst part of the output from the platform states that the probability of being female is being modeled.
5
The same logic applies as in the previous sections. Because Sex is a discrete predictor with two possible
values, there are only two possible predicted probability of survival corresponding to the two vertical lines
in the plot. Because the response variable is categorical, it is converted to a 0 or 1 values, and the residuals
computed which then correspond to the two dots in each vertical line. Note that each dot represents several
hundred data values!
This residual plot is rarely informative after all, if there are only two outcomes and only two categories
for the predictors, some people have to lie in the two outcomes for each of the two categories of predictors.
25.5 Poisson Regression residual plots - Part I
Poisson regression is similar to the case of multiple regression, but also has some features of the logistic
regression case. For example, the responses are counts which can only take discrete values (like the logistic
case), but there can be a wide range of counts (like the multiple regression case).
For example, consider predicting the number of satellite males around female horseshoe crabs as a
function of the body mass of the female. The model t is:
Y
i
distributed as Poisson(
i
)
i
= log(
i
)
i
= Weight
Model platform and looks like:
6
:
The plot now has a series of lines. These correspond to the distinct values of Y (as in the logistic
regression case), with the lowest line corresponding to crabs with Y = 0, the next line corresponds to
Y = 1, then Y = 2 and so on.
Again the areas of concern are those points outside of 2. In this plot, there are several females with
large number of satellite males that were predicted to have only 2 or 3 satellite males.
6
25.6 Poisson Regression residual plots - Part II
Finally, consider the case where the X variable is also discrete. For example, consider trying to predict the
number of satellite males as a function of the color of the female crab. The model t is:
Y
i
distributed as Poisson(
i
)
i
= log(
i
)
i
= Color
Model platform and looks like:
7
:
Because the X variable is nominally scaled with 4 levels, there are four vertical lines on the plot (note that
two of the predicted values are very closed around the 2.25 area and can barely be distinguished). Because
the Y values are restricted to non-negative integer values, there are again a series of lines corresponding
to the discrete values of Y .
7
Again points outside the 2 reference line may be of concern and may require further investigation.
Chapter 26
SAS CODE NOT DONE
1021
Chapter 27
Tables
Contents
27.1 A table of uniform random digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022
27.2 Selected Binomial individual probabilities . . . . . . . . . . . . . . . . . . . . . . . . 1026
27.3 Selected Poisson individual probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 1034
27.4 Cumulative probability for the Standard Normal Distribution . . . . . . . . . . . . . 1037
27.5 Selected percentiles from the t-distribution . . . . . . . . . . . . . . . . . . . . . . . . 1039
27.6 Selected percentiles from the chi-squared-distribution . . . . . . . . . . . . . . . . . . 1040
27.7 Sample size determination for a two sample t-test . . . . . . . . . . . . . . . . . . . . 1041
27.8 Power determination for a two sample t-test . . . . . . . . . . . . . . . . . . . . . . . 1043
27.9 Sample size determination for a single factor, xed effects, CRD . . . . . . . . . . . . 1045
27.10Power determination for a single factor, xed effects, CRD . . . . . . . . . . . . . . . 1049
27.1 A table of uniform random digits
Row <--------- uniform random digits -------------------------->
1 57245 39666 18545 50534 57654 25519 35477 71309 12212 98911
2 42726 58321 59267 72742 53968 63679 54095 56563 09820 86291
3 82768 32694 62828 19097 09877 32093 23518 08654 64815 19894
4 97742 58918 33317 34192 06286 39824 74264 01941 95810 26247
5 48332 38634 20510 09198 56256 04431 22753 20944 95319 29515
6 26700 40484 28341 25428 08806 98858 04816 16317 94928 05512
7 66156 16407 57395 86230 47495 13908 97015 58225 82255 01956
8 64062 10061 01923 29260 32771 71002 58132 58646 69089 63694
9 24713 95591 26970 37647 26282 89759 69034 55281 64853 50837
1022
CHAPTER 27. TABLES
10 90417 18344 22436 77006 87841 94322 45526 38145 86554 42733
11 78886 86557 11295 07253 29289 44814 58898 36929 66839 81250
12 39681 54696 38482 48217 73598 93649 92705 34912 18981 74299
13 38265 45196 31143 82190 27279 79883 20219 38823 84543 22119
14 34270 41885 00079 63600 59152 10670 27951 77830 05368 58315
15 73869 34748 75787 88844 89522 71436 04166 06246 20952 56808
16 21732 36017 69149 70330 90500 73110 92908 55789 73450 68282
17 72583 49811 67519 98476 97889 37112 94963 91140 24571 23446
18 72678 49483 57039 18420 74773 16869 72077 27720 14058 66743
19 88572 01294 14117 56884 77107 53023 02243 26415 52233 12818
20 82868 59988 42323 96542 96733 00056 74887 21914 48300 96404
21 09949 56572 28104 64281 01217 76250 39511 19059 85172 35273
22 41942 91440 81609 38147 59406 88491 18079 29786 81499 85390
23 46777 74928 91290 55022 56629 01335 61379 71134 86187 70717
24 58280 17867 07990 85055 55279 83390 37598 93350 05666 55402
25 87042 55080 76185 19947 79551 77594 87381 99430 44251 30896
26 72183 39856 94385 55160 50680 68443 95437 74302 06204 71004
27 76768 16066 94109 90685 92058 81744 99133 36354 34292 90092
28 21703 64616 03431 47610 31968 61593 36259 70600 53491 95542
29 78269 12087 32204 81177 30333 83630 06026 89308 94179 54907
30 49285 16579 22109 63651 34778 28631 27285 95751 91704 59819
31 90016 10303 81862 41351 88681 76632 15336 91955 38436 43892
32 63651 93677 08027 80384 71134 79937 23322 10577 21413 86688
33 02780 37186 74076 33376 03782 64199 77333 12812 78027 89926
34 49414 09022 38644 53038 34634 36565 01984 88477 83879 60943
35 53861 74046 04778 08365 83104 79004 88335 54047 99675 41864
36 78677 55123 73447 00158 61482 02808 83475 59932 19044 27318
37 74550 84403 56850 83780 88847 65591 03859 58670 60057 25225
38 22866 64152 35023 35701 98228 53388 82321 34392 09589 97340
39 17601 32926 06120 27626 48687 42885 25858 53920 95764 84716
40 20862 64222 96951 19524 15866 52508 03763 98033 87268 71167
41 71490 83428 78903 81931 24345 37331 03971 38118 01065 36010
42 21050 12825 28217 99510 86900 09987 91244 06520 81108 87266
43 91632 96199 54191 77480 33049 00849 96668 65865 25164 98330
44 46988 84607 55711 43874 26532 76307 38846 55961 83227 16069
45 72200 24023 55848 09162 44976 15663 34697 83365 82930 63392
46 88621 25822 78463 72191 00625 85945 72522 29613 46473 51177
47 15384 03326 32091 20199 70046 64343 20566 79050 43837 15831
48 46499 94631 17985 09369 19009 51848 58794 48921 22845 55264
49 13520 96795 79714 66338 79836 44430 89290 06167 69090 29476
50 24323 00280 73922 43447 00319 92899 75411 91840 39594 17621
CHAPTER 27. TABLES
51 99090 55543 87734 80685 74261 70848 87196 59085 28471 74971
52 97585 33311 68919 33189 49987 24081 79404 45363 46920 94760
53 97622 85282 58594 83977 25002 39124 58350 67845 17771 58031
54 24260 21646 75111 41560 90082 57613 93807 04060 94811 60124
55 65250 83876 34806 08796 53719 94310 94363 55289 81226 18190
56 45817 37470 73508 84200 73933 80187 26207 69917 58064 95000
57 48898 28088 77723 81458 18981 35389 17199 85718 18019 66290
58 23900 87304 91349 27541 42047 23002 47976 99586 96453 06861
59 38635 66539 55139 56894 01608 05068 21910 41858 15382 98701
60 58095 49005 59108 12315 35856 19651 55545 79711 42424 67008
61 76474 40345 47744 45224 42903 86698 09851 87819 81523 34272
62 03535 70021 61645 84268 65636 94414 06266 12237 43147 16894
63 14364 82782 07176 53522 06834 46016 42758 04753 00023 15300
64 91751 29817 90578 31800 13393 35965 41128 92983 61660 50106
65 56151 59329 22926 66357 41724 68645 04327 27543 18723 11957
66 57881 15295 43246 47103 15977 84216 78875 06677 77219 50803
67 36126 70899 51669 79958 93311 62555 70694 16626 35623 18758
68 73389 33283 66929 73444 31434 10263 16868 74346 84838 82770
69 77383 40683 84063 45412 21358 84024 88935 77583 33522 53090
70 62798 96248 60474 36149 21187 23194 03696 74445 54525 12869
71 12283 00561 29955 05775 34520 47217 26059 35414 65998 49766
72 78433 49762 41177 80949 32843 64714 40450 15064 11389 78409
73 26348 29480 65497 34615 12888 19977 17597 25914 36394 79315
74 26078 36705 83043 61592 12459 61255 40550 59892 66163 97848
75 40115 70829 00654 12791 85668 19015 82785 92889 35041 18949
76 81560 62666 77627 09123 63484 49481 60451 88073 71000 63511
77 34074 51484 59356 20301 22365 95862 46995 26284 45273 35706
78 42176 81350 05941 09754 16987 98248 90319 33116 39120 34765
79 63288 62381 58461 13225 57138 19619 30877 82640 24888 02600
80 88820 33240 78977 98928 41160 29671 33299 95592 38493 05321
81 63532 20433 25690 09557 90207 95808 57383 68622 13359 25371
82 39033 68857 74705 91718 77485 32496 30737 28551 69056 95615
83 46964 90715 01804 14953 97658 71613 90353 78189 03195 73795
84 03528 92683 29740 31679 22941 92131 69021 21325 70930 19548
85 67027 36641 74347 54500 80074 94364 10164 99309 66272 24925
86 65462 73352 17392 09552 74361 46123 13020 63169 98318 91666
87 55797 95254 84279 88885 65569 96791 66118 05817 17867 88254
88 58697 56009 20438 06653 93978 51961 97609 97367 02795 04718
89 97876 76551 19215 87623 55326 85282 86292 18328 55016 84126
90 72443 02607 13183 06156 76680 62398 79369 77374 78292 41027
91 96152 80526 62087 12197 59252 68312 39759 63535 23675 47358
92 10277 64926 33378 48335 35488 47577 85954 97588 75873 31350
CHAPTER 27. TABLES
93 77557 25011 86663 97410 99845 42709 48407 63841 14727 00484
94 68784 85951 54232 30976 48666 15927 73072 00907 76237 56914
95 67778 30262 16944 36130 77604 34923 92336 66565 94490 68039
96 94104 06985 81837 53674 36266 21688 68769 18492 12242 34164
97 70107 17900 53497 71908 18186 59909 00400 53236 23016 70860
98 07847 64852 37719 68837 60757 92158 80433 17687 08916 01706
99 33167 35411 27473 13393 17714 59680 30888 98213 93364 03219
100 84527 88986 01665 23547 74666 25487 34977 59681 38520 57293
CHAPTER 27. TABLES
27.2 Selected Binomial individual probabilities
Selected values of p
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
2 0 0.8100 0.6400 0.4900 0.3600 0.2500 0.1600 0.0900 0.0400 0.0100
2 1 0.1800 0.3200 0.4200 0.4800 0.5000 0.4800 0.4200 0.3200 0.1800
2 2 0.0100 0.0400 0.0900 0.1600 0.2500 0.3600 0.4900 0.6400 0.8100
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
3 0 0.7290 0.5120 0.3430 0.2160 0.1250 0.0640 0.0270 0.0080 0.0010
3 1 0.2430 0.3840 0.4410 0.4320 0.3750 0.2880 0.1890 0.0960 0.0270
3 2 0.0270 0.0960 0.1890 0.2880 0.3750 0.4320 0.4410 0.3840 0.2430
3 3 0.0010 0.0080 0.0270 0.0640 0.1250 0.2160 0.3430 0.5120 0.7290
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
4 0 0.6561 0.4096 0.2401 0.1296 0.0625 0.0256 0.0081 0.0016 0.0001
4 1 0.2916 0.4096 0.4116 0.3456 0.2500 0.1536 0.0756 0.0256 0.0036
4 2 0.0486 0.1536 0.2646 0.3456 0.3750 0.3456 0.2646 0.1536 0.0486
4 3 0.0036 0.0256 0.0756 0.1536 0.2500 0.3456 0.4116 0.4096 0.2916
4 4 0.0001 0.0016 0.0081 0.0256 0.0625 0.1296 0.2401 0.4096 0.6561
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
5 0 0.5905 0.3277 0.1681 0.0778 0.0313 0.0102 0.0024 0.0003 0.0000
5 1 0.3280 0.4096 0.3601 0.2592 0.1563 0.0768 0.0284 0.0064 0.0005
5 2 0.0729 0.2048 0.3087 0.3456 0.3125 0.2304 0.1323 0.0512 0.0081
5 3 0.0081 0.0512 0.1323 0.2304 0.3125 0.3456 0.3087 0.2048 0.0729
5 4 0.0005 0.0064 0.0283 0.0768 0.1563 0.2592 0.3601 0.4096 0.3281
5 5 0.0000 0.0003 0.0024 0.0102 0.0313 0.0778 0.1681 0.3277 0.5905
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
6 0 0.5314 0.2621 0.1176 0.0467 0.0156 0.0041 0.0007 0.0001 0.0000
6 1 0.3543 0.3932 0.3025 0.1866 0.0938 0.0369 0.0102 0.0015 0.0001
CHAPTER 27. TABLES
6 2 0.0984 0.2458 0.3241 0.3110 0.2344 0.1382 0.0595 0.0154 0.0012
6 3 0.0146 0.0819 0.1852 0.2765 0.3125 0.2765 0.1852 0.0819 0.0146
6 4 0.0012 0.0154 0.0595 0.1382 0.2344 0.3110 0.3241 0.2458 0.0984
6 5 0.0001 0.0015 0.0102 0.0369 0.0938 0.1866 0.3025 0.3932 0.3543
6 6 0.0000 0.0001 0.0007 0.0041 0.0156 0.0467 0.1176 0.2621 0.5314
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
7 0 0.4783 0.2097 0.0824 0.0280 0.0078 0.0016 0.0002 0.0000 0.0000
7 1 0.3720 0.3670 0.2471 0.1306 0.0547 0.0172 0.0036 0.0004 0.0000
7 2 0.1240 0.2753 0.3177 0.2613 0.1641 0.0774 0.0250 0.0043 0.0002
7 3 0.0230 0.1147 0.2269 0.2903 0.2734 0.1935 0.0972 0.0287 0.0026
7 4 0.0026 0.0287 0.0972 0.1935 0.2734 0.2903 0.2269 0.1147 0.0230
7 5 0.0002 0.0043 0.0250 0.0774 0.1641 0.2613 0.3177 0.2753 0.1240
7 6 0.0000 0.0004 0.0036 0.0172 0.0547 0.1306 0.2471 0.3670 0.3720
7 7 0.0000 0.0000 0.0002 0.0016 0.0078 0.0280 0.0824 0.2097 0.4783
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
8 0 0.4305 0.1678 0.0576 0.0168 0.0039 0.0007 0.0001 0.0000 0.0000
8 1 0.3826 0.3355 0.1977 0.0896 0.0313 0.0079 0.0012 0.0001 0.0000
8 2 0.1488 0.2936 0.2965 0.2090 0.1094 0.0413 0.0100 0.0011 0.0000
8 3 0.0331 0.1468 0.2541 0.2787 0.2188 0.1239 0.0467 0.0092 0.0004
8 4 0.0046 0.0459 0.1361 0.2322 0.2734 0.2322 0.1361 0.0459 0.0046
8 5 0.0004 0.0092 0.0467 0.1239 0.2188 0.2787 0.2541 0.1468 0.0331
8 6 0.0000 0.0011 0.0100 0.0413 0.1094 0.2090 0.2965 0.2936 0.1488
8 7 0.0000 0.0001 0.0012 0.0079 0.0313 0.0896 0.1977 0.3355 0.3826
8 8 0.0000 0.0000 0.0001 0.0007 0.0039 0.0168 0.0576 0.1678 0.4305
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
9 0 0.3874 0.1342 0.0404 0.0101 0.0020 0.0003 0.0000 0.0000 0.0000
9 1 0.3874 0.3020 0.1556 0.0605 0.0176 0.0035 0.0004 0.0000 0.0000
9 2 0.1722 0.3020 0.2668 0.1612 0.0703 0.0212 0.0039 0.0003 0.0000
9 3 0.0446 0.1762 0.2668 0.2508 0.1641 0.0743 0.0210 0.0028 0.0001
9 4 0.0074 0.0661 0.1715 0.2508 0.2461 0.1672 0.0735 0.0165 0.0008
9 5 0.0008 0.0165 0.0735 0.1672 0.2461 0.2508 0.1715 0.0661 0.0074
9 6 0.0001 0.0028 0.0210 0.0743 0.1641 0.2508 0.2668 0.1762 0.0446
9 7 0.0000 0.0003 0.0039 0.0212 0.0703 0.1612 0.2668 0.3020 0.1722
9 8 0.0000 0.0000 0.0004 0.0035 0.0176 0.0605 0.1556 0.3020 0.3874
CHAPTER 27. TABLES
9 9 0.0000 0.0000 0.0000 0.0003 0.0020 0.0101 0.0404 0.1342 0.3874
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
10 0 0.3487 0.1074 0.0282 0.0060 0.0010 0.0001 0.0000 0.0000 0.0000
10 1 0.3874 0.2684 0.1211 0.0403 0.0098 0.0016 0.0001 0.0000 0.0000
10 2 0.1937 0.3020 0.2335 0.1209 0.0439 0.0106 0.0014 0.0001 0.0000
10 3 0.0574 0.2013 0.2668 0.2150 0.1172 0.0425 0.0090 0.0008 0.0000
10 4 0.0112 0.0881 0.2001 0.2508 0.2051 0.1115 0.0368 0.0055 0.0001
10 5 0.0015 0.0264 0.1029 0.2007 0.2461 0.2007 0.1029 0.0264 0.0015
10 6 0.0001 0.0055 0.0368 0.1115 0.2051 0.2508 0.2001 0.0881 0.0112
10 7 0.0000 0.0008 0.0090 0.0425 0.1172 0.2150 0.2668 0.2013 0.0574
10 8 0.0000 0.0001 0.0014 0.0106 0.0439 0.1209 0.2335 0.3020 0.1937
10 9 0.0000 0.0000 0.0001 0.0016 0.0098 0.0403 0.1211 0.2684 0.3874
10 10 0.0000 0.0000 0.0000 0.0001 0.0010 0.0060 0.0282 0.1074 0.3487
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
11 0 0.3138 0.0859 0.0198 0.0036 0.0005 0.0000 0.0000 0.0000 0.0000
11 1 0.3835 0.2362 0.0932 0.0266 0.0054 0.0007 0.0000 0.0000 0.0000
11 2 0.2131 0.2953 0.1998 0.0887 0.0269 0.0052 0.0005 0.0000 0.0000
11 3 0.0710 0.2215 0.2568 0.1774 0.0806 0.0234 0.0037 0.0002 0.0000
11 4 0.0158 0.1107 0.2201 0.2365 0.1611 0.0701 0.0173 0.0017 0.0000
11 5 0.0025 0.0388 0.1321 0.2207 0.2256 0.1471 0.0566 0.0097 0.0003
11 6 0.0003 0.0097 0.0566 0.1471 0.2256 0.2207 0.1321 0.0388 0.0025
11 7 0.0000 0.0017 0.0173 0.0701 0.1611 0.2365 0.2201 0.1107 0.0158
11 8 0.0000 0.0002 0.0037 0.0234 0.0806 0.1774 0.2568 0.2215 0.0710
11 9 0.0000 0.0000 0.0005 0.0052 0.0269 0.0887 0.1998 0.2953 0.2131
11 10 0.0000 0.0000 0.0000 0.0007 0.0054 0.0266 0.0932 0.2362 0.3835
11 11 0.0000 0.0000 0.0000 0.0000 0.0005 0.0036 0.0198 0.0859 0.3138
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
12 0 0.2824 0.0687 0.0138 0.0022 0.0002 0.0000 0.0000 0.0000 0.0000
12 1 0.3766 0.2062 0.0712 0.0174 0.0029 0.0003 0.0000 0.0000 0.0000
12 2 0.2301 0.2835 0.1678 0.0639 0.0161 0.0025 0.0002 0.0000 0.0000
12 3 0.0852 0.2362 0.2397 0.1419 0.0537 0.0125 0.0015 0.0001 0.0000
12 4 0.0213 0.1329 0.2311 0.2128 0.1208 0.0420 0.0078 0.0005 0.0000
12 5 0.0038 0.0532 0.1585 0.2270 0.1934 0.1009 0.0291 0.0033 0.0000
12 6 0.0005 0.0155 0.0792 0.1766 0.2256 0.1766 0.0792 0.0155 0.0005
CHAPTER 27. TABLES
12 7 0.0000 0.0033 0.0291 0.1009 0.1934 0.2270 0.1585 0.0532 0.0038
12 8 0.0000 0.0005 0.0078 0.0420 0.1208 0.2128 0.2311 0.1329 0.0213
12 9 0.0000 0.0001 0.0015 0.0125 0.0537 0.1419 0.2397 0.2362 0.0852
12 10 0.0000 0.0000 0.0002 0.0025 0.0161 0.0639 0.1678 0.2835 0.2301
12 11 0.0000 0.0000 0.0000 0.0003 0.0029 0.0174 0.0712 0.2062 0.3766
12 12 0.0000 0.0000 0.0000 0.0000 0.0002 0.0022 0.0138 0.0687 0.2824
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
13 0 0.2542 0.0550 0.0097 0.0013 0.0001 0.0000 0.0000 0.0000 0.0000
13 1 0.3672 0.1787 0.0540 0.0113 0.0016 0.0001 0.0000 0.0000 0.0000
13 2 0.2448 0.2680 0.1388 0.0453 0.0095 0.0012 0.0001 0.0000 0.0000
13 3 0.0997 0.2457 0.2181 0.1107 0.0349 0.0065 0.0006 0.0000 0.0000
13 4 0.0277 0.1535 0.2337 0.1845 0.0873 0.0243 0.0034 0.0001 0.0000
13 5 0.0055 0.0691 0.1803 0.2214 0.1571 0.0656 0.0142 0.0011 0.0000
13 6 0.0008 0.0230 0.1030 0.1968 0.2095 0.1312 0.0442 0.0058 0.0001
13 7 0.0001 0.0058 0.0442 0.1312 0.2095 0.1968 0.1030 0.0230 0.0008
13 8 0.0000 0.0011 0.0142 0.0656 0.1571 0.2214 0.1803 0.0691 0.0055
13 9 0.0000 0.0001 0.0034 0.0243 0.0873 0.1845 0.2337 0.1535 0.0277
13 10 0.0000 0.0000 0.0006 0.0065 0.0349 0.1107 0.2181 0.2457 0.0997
13 11 0.0000 0.0000 0.0001 0.0012 0.0095 0.0453 0.1388 0.2680 0.2448
13 12 0.0000 0.0000 0.0000 0.0001 0.0016 0.0113 0.0540 0.1787 0.3672
13 13 0.0000 0.0000 0.0000 0.0000 0.0001 0.0013 0.0097 0.0550 0.2542
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
14 0 0.2288 0.0440 0.0068 0.0008 0.0001 0.0000 0.0000 0.0000 0.0000
14 1 0.3559 0.1539 0.0407 0.0073 0.0009 0.0001 0.0000 0.0000 0.0000
14 2 0.2570 0.2501 0.1134 0.0317 0.0056 0.0005 0.0000 0.0000 0.0000
14 3 0.1142 0.2501 0.1943 0.0845 0.0222 0.0033 0.0002 0.0000 0.0000
14 4 0.0349 0.1720 0.2290 0.1549 0.0611 0.0136 0.0014 0.0000 0.0000
14 5 0.0078 0.0860 0.1963 0.2066 0.1222 0.0408 0.0066 0.0003 0.0000
14 6 0.0013 0.0322 0.1262 0.2066 0.1833 0.0918 0.0232 0.0020 0.0000
14 7 0.0002 0.0092 0.0618 0.1574 0.2095 0.1574 0.0618 0.0092 0.0002
14 8 0.0000 0.0020 0.0232 0.0918 0.1833 0.2066 0.1262 0.0322 0.0013
14 9 0.0000 0.0003 0.0066 0.0408 0.1222 0.2066 0.1963 0.0860 0.0078
14 10 0.0000 0.0000 0.0014 0.0136 0.0611 0.1549 0.2290 0.1720 0.0349
14 11 0.0000 0.0000 0.0002 0.0033 0.0222 0.0845 0.1943 0.2501 0.1142
14 12 0.0000 0.0000 0.0000 0.0005 0.0056 0.0317 0.1134 0.2501 0.2570
14 13 0.0000 0.0000 0.0000 0.0001 0.0009 0.0073 0.0407 0.1539 0.3559
14 14 0.0000 0.0000 0.0000 0.0000 0.0001 0.0008 0.0068 0.0440 0.2288
CHAPTER 27. TABLES
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
15 0 0.2059 0.0352 0.0047 0.0005 0.0000 0.0000 0.0000 0.0000 0.0000
15 1 0.3432 0.1319 0.0305 0.0047 0.0005 0.0000 0.0000 0.0000 0.0000
15 2 0.2669 0.2309 0.0916 0.0219 0.0032 0.0003 0.0000 0.0000 0.0000
15 3 0.1285 0.2501 0.1700 0.0634 0.0139 0.0016 0.0001 0.0000 0.0000
15 4 0.0428 0.1876 0.2186 0.1268 0.0417 0.0074 0.0006 0.0000 0.0000
15 5 0.0105 0.1032 0.2061 0.1859 0.0916 0.0245 0.0030 0.0001 0.0000
15 6 0.0019 0.0430 0.1472 0.2066 0.1527 0.0612 0.0116 0.0007 0.0000
15 7 0.0003 0.0138 0.0811 0.1771 0.1964 0.1181 0.0348 0.0035 0.0000
15 8 0.0000 0.0035 0.0348 0.1181 0.1964 0.1771 0.0811 0.0138 0.0003
15 9 0.0000 0.0007 0.0116 0.0612 0.1527 0.2066 0.1472 0.0430 0.0019
15 10 0.0000 0.0001 0.0030 0.0245 0.0916 0.1859 0.2061 0.1032 0.0105
15 11 0.0000 0.0000 0.0006 0.0074 0.0417 0.1268 0.2186 0.1876 0.0428
15 12 0.0000 0.0000 0.0001 0.0016 0.0139 0.0634 0.1700 0.2501 0.1285
15 13 0.0000 0.0000 0.0000 0.0003 0.0032 0.0219 0.0916 0.2309 0.2669
15 14 0.0000 0.0000 0.0000 0.0000 0.0005 0.0047 0.0305 0.1319 0.3432
15 15 0.0000 0.0000 0.0000 0.0000 0.0000 0.0005 0.0047 0.0352 0.2059
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
16 0 0.1853 0.0281 0.0033 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000
16 1 0.3294 0.1126 0.0228 0.0030 0.0002 0.0000 0.0000 0.0000 0.0000
16 2 0.2745 0.2111 0.0732 0.0150 0.0018 0.0001 0.0000 0.0000 0.0000
16 3 0.1423 0.2463 0.1465 0.0468 0.0085 0.0008 0.0000 0.0000 0.0000
16 4 0.0514 0.2001 0.2040 0.1014 0.0278 0.0040 0.0002 0.0000 0.0000
16 5 0.0137 0.1201 0.2099 0.1623 0.0667 0.0142 0.0013 0.0000 0.0000
16 6 0.0028 0.0550 0.1649 0.1983 0.1222 0.0392 0.0056 0.0002 0.0000
16 7 0.0004 0.0197 0.1010 0.1889 0.1746 0.0840 0.0185 0.0012 0.0000
16 8 0.0001 0.0055 0.0487 0.1417 0.1964 0.1417 0.0487 0.0055 0.0001
16 9 0.0000 0.0012 0.0185 0.0840 0.1746 0.1889 0.1010 0.0197 0.0004
16 10 0.0000 0.0002 0.0056 0.0392 0.1222 0.1983 0.1649 0.0550 0.0028
16 11 0.0000 0.0000 0.0013 0.0142 0.0667 0.1623 0.2099 0.1201 0.0137
16 12 0.0000 0.0000 0.0002 0.0040 0.0278 0.1014 0.2040 0.2001 0.0514
16 13 0.0000 0.0000 0.0000 0.0008 0.0085 0.0468 0.1465 0.2463 0.1423
16 14 0.0000 0.0000 0.0000 0.0001 0.0018 0.0150 0.0732 0.2111 0.2745
16 15 0.0000 0.0000 0.0000 0.0000 0.0002 0.0030 0.0228 0.1126 0.3294
16 16 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0033 0.0281 0.1853
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
CHAPTER 27. TABLES
----------------------------------------------------------------------------------
17 0 0.1668 0.0225 0.0023 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000
17 1 0.3150 0.0957 0.0169 0.0019 0.0001 0.0000 0.0000 0.0000 0.0000
17 2 0.2800 0.1914 0.0581 0.0102 0.0010 0.0001 0.0000 0.0000 0.0000
17 3 0.1556 0.2393 0.1245 0.0341 0.0052 0.0004 0.0000 0.0000 0.0000
17 4 0.0605 0.2093 0.1868 0.0796 0.0182 0.0021 0.0001 0.0000 0.0000
17 5 0.0175 0.1361 0.2081 0.1379 0.0472 0.0081 0.0006 0.0000 0.0000
17 6 0.0039 0.0680 0.1784 0.1839 0.0944 0.0242 0.0026 0.0001 0.0000
17 7 0.0007 0.0267 0.1201 0.1927 0.1484 0.0571 0.0095 0.0004 0.0000
17 8 0.0001 0.0084 0.0644 0.1606 0.1855 0.1070 0.0276 0.0021 0.0000
17 9 0.0000 0.0021 0.0276 0.1070 0.1855 0.1606 0.0644 0.0084 0.0001
17 10 0.0000 0.0004 0.0095 0.0571 0.1484 0.1927 0.1201 0.0267 0.0007
17 11 0.0000 0.0001 0.0026 0.0242 0.0944 0.1839 0.1784 0.0680 0.0039
17 12 0.0000 0.0000 0.0006 0.0081 0.0472 0.1379 0.2081 0.1361 0.0175
17 13 0.0000 0.0000 0.0001 0.0021 0.0182 0.0796 0.1868 0.2093 0.0605
17 14 0.0000 0.0000 0.0000 0.0004 0.0052 0.0341 0.1245 0.2393 0.1556
17 15 0.0000 0.0000 0.0000 0.0001 0.0010 0.0102 0.0581 0.1914 0.2800
17 16 0.0000 0.0000 0.0000 0.0000 0.0001 0.0019 0.0169 0.0957 0.3150
17 17 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0023 0.0225 0.1668
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
18 0 0.1501 0.0180 0.0016 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
18 1 0.3002 0.0811 0.0126 0.0012 0.0001 0.0000 0.0000 0.0000 0.0000
18 2 0.2835 0.1723 0.0458 0.0069 0.0006 0.0000 0.0000 0.0000 0.0000
18 3 0.1680 0.2297 0.1046 0.0246 0.0031 0.0002 0.0000 0.0000 0.0000
18 4 0.0700 0.2153 0.1681 0.0614 0.0117 0.0011 0.0000 0.0000 0.0000
18 5 0.0218 0.1507 0.2017 0.1146 0.0327 0.0045 0.0002 0.0000 0.0000
18 6 0.0052 0.0816 0.1873 0.1655 0.0708 0.0145 0.0012 0.0000 0.0000
18 7 0.0010 0.0350 0.1376 0.1892 0.1214 0.0374 0.0046 0.0001 0.0000
18 8 0.0002 0.0120 0.0811 0.1734 0.1669 0.0771 0.0149 0.0008 0.0000
18 9 0.0000 0.0033 0.0386 0.1284 0.1855 0.1284 0.0386 0.0033 0.0000
18 10 0.0000 0.0008 0.0149 0.0771 0.1669 0.1734 0.0811 0.0120 0.0002
18 11 0.0000 0.0001 0.0046 0.0374 0.1214 0.1892 0.1376 0.0350 0.0010
18 12 0.0000 0.0000 0.0012 0.0145 0.0708 0.1655 0.1873 0.0816 0.0052
18 13 0.0000 0.0000 0.0002 0.0045 0.0327 0.1146 0.2017 0.1507 0.0218
18 14 0.0000 0.0000 0.0000 0.0011 0.0117 0.0614 0.1681 0.2153 0.0700
18 15 0.0000 0.0000 0.0000 0.0002 0.0031 0.0246 0.1046 0.2297 0.1680
18 16 0.0000 0.0000 0.0000 0.0000 0.0006 0.0069 0.0458 0.1723 0.2835
18 17 0.0000 0.0000 0.0000 0.0000 0.0001 0.0012 0.0126 0.0811 0.3002
18 18 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0016 0.0180 0.1501
CHAPTER 27. TABLES
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
19 0 0.1351 0.0144 0.0011 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
19 1 0.2852 0.0685 0.0093 0.0008 0.0000 0.0000 0.0000 0.0000 0.0000
19 2 0.2852 0.1540 0.0358 0.0046 0.0003 0.0000 0.0000 0.0000 0.0000
19 3 0.1796 0.2182 0.0869 0.0175 0.0018 0.0001 0.0000 0.0000 0.0000
19 4 0.0798 0.2182 0.1491 0.0467 0.0074 0.0005 0.0000 0.0000 0.0000
19 5 0.0266 0.1636 0.1916 0.0933 0.0222 0.0024 0.0001 0.0000 0.0000
19 6 0.0069 0.0955 0.1916 0.1451 0.0518 0.0085 0.0005 0.0000 0.0000
19 7 0.0014 0.0443 0.1525 0.1797 0.0961 0.0237 0.0022 0.0000 0.0000
19 8 0.0002 0.0166 0.0981 0.1797 0.1442 0.0532 0.0077 0.0003 0.0000
19 9 0.0000 0.0051 0.0514 0.1464 0.1762 0.0976 0.0220 0.0013 0.0000
19 10 0.0000 0.0013 0.0220 0.0976 0.1762 0.1464 0.0514 0.0051 0.0000
19 11 0.0000 0.0003 0.0077 0.0532 0.1442 0.1797 0.0981 0.0166 0.0002
19 12 0.0000 0.0000 0.0022 0.0237 0.0961 0.1797 0.1525 0.0443 0.0014
19 13 0.0000 0.0000 0.0005 0.0085 0.0518 0.1451 0.1916 0.0955 0.0069
19 14 0.0000 0.0000 0.0001 0.0024 0.0222 0.0933 0.1916 0.1636 0.0266
19 15 0.0000 0.0000 0.0000 0.0005 0.0074 0.0467 0.1491 0.2182 0.0798
19 16 0.0000 0.0000 0.0000 0.0001 0.0018 0.0175 0.0869 0.2182 0.1796
19 17 0.0000 0.0000 0.0000 0.0000 0.0003 0.0046 0.0358 0.1540 0.2852
19 18 0.0000 0.0000 0.0000 0.0000 0.0000 0.0008 0.0093 0.0685 0.2852
19 19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0011 0.0144 0.1351
n x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
----------------------------------------------------------------------------------
20 0 0.1216 0.0115 0.0008 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
20 1 0.2702 0.0576 0.0068 0.0005 0.0000 0.0000 0.0000 0.0000 0.0000
20 2 0.2852 0.1369 0.0278 0.0031 0.0002 0.0000 0.0000 0.0000 0.0000
20 3 0.1901 0.2054 0.0716 0.0123 0.0011 0.0000 0.0000 0.0000 0.0000
20 4 0.0898 0.2182 0.1304 0.0350 0.0046 0.0003 0.0000 0.0000 0.0000
20 5 0.0319 0.1746 0.1789 0.0746 0.0148 0.0013 0.0000 0.0000 0.0000
20 6 0.0089 0.1091 0.1916 0.1244 0.0370 0.0049 0.0002 0.0000 0.0000
20 7 0.0020 0.0545 0.1643 0.1659 0.0739 0.0146 0.0010 0.0000 0.0000
20 8 0.0004 0.0222 0.1144 0.1797 0.1201 0.0355 0.0039 0.0001 0.0000
20 9 0.0001 0.0074 0.0654 0.1597 0.1602 0.0710 0.0120 0.0005 0.0000
20 10 0.0000 0.0020 0.0308 0.1171 0.1762 0.1171 0.0308 0.0020 0.0000
20 11 0.0000 0.0005 0.0120 0.0710 0.1602 0.1597 0.0654 0.0074 0.0001
20 12 0.0000 0.0001 0.0039 0.0355 0.1201 0.1797 0.1144 0.0222 0.0004
20 13 0.0000 0.0000 0.0010 0.0146 0.0739 0.1659 0.1643 0.0545 0.0020
20 14 0.0000 0.0000 0.0002 0.0049 0.0370 0.1244 0.1916 0.1091 0.0089
20 15 0.0000 0.0000 0.0000 0.0013 0.0148 0.0746 0.1789 0.1746 0.0319
20 16 0.0000 0.0000 0.0000 0.0003 0.0046 0.0350 0.1304 0.2182 0.0898
20 17 0.0000 0.0000 0.0000 0.0000 0.0011 0.0123 0.0716 0.2054 0.1901
20 18 0.0000 0.0000 0.0000 0.0000 0.0002 0.0031 0.0278 0.1369 0.2852
CHAPTER 27. TABLES
20 19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0005 0.0068 0.0576 0.2702
20 20 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0008 0.0115 0.1216
CHAPTER 27. TABLES
27.3 Selected Poisson individual probabilities
Selected values of LAMBDA
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
---------------------------------------------------------------------------------
0 0.9048 0.8187 0.7408 0.6703 0.6065 0.5488 0.4966 0.4493 0.4066 0.3679
1 0.0905 0.1637 0.2222 0.2681 0.3033 0.3293 0.3476 0.3595 0.3659 0.3679
2 0.0045 0.0164 0.0333 0.0536 0.0758 0.0988 0.1217 0.1438 0.1647 0.1839
3 0.0002 0.0011 0.0033 0.0072 0.0126 0.0198 0.0284 0.0383 0.0494 0.0613
4 0.0000 0.0001 0.0003 0.0007 0.0016 0.0030 0.0050 0.0077 0.0111 0.0153
5 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004 0.0007 0.0012 0.0020 0.0031
6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0003 0.0005
7 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
x 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
---------------------------------------------------------------------------------
0 0.3329 0.3012 0.2725 0.2466 0.2231 0.2019 0.1827 0.1653 0.1496 0.1353
1 0.3662 0.3614 0.3543 0.3452 0.3347 0.3230 0.3106 0.2975 0.2842 0.2707
2 0.2014 0.2169 0.2303 0.2417 0.2510 0.2584 0.2640 0.2678 0.2700 0.2707
3 0.0738 0.0867 0.0998 0.1128 0.1255 0.1378 0.1496 0.1607 0.1710 0.1804
4 0.0203 0.0260 0.0324 0.0395 0.0471 0.0551 0.0636 0.0723 0.0812 0.0902
5 0.0045 0.0062 0.0084 0.0111 0.0141 0.0176 0.0216 0.0260 0.0309 0.0361
6 0.0008 0.0012 0.0018 0.0026 0.0035 0.0047 0.0061 0.0078 0.0098 0.0120
7 0.0001 0.0002 0.0003 0.0005 0.0008 0.0011 0.0015 0.0020 0.0027 0.0034
8 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0003 0.0005 0.0006 0.0009
9 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002
10 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
---------------------------------------------------------------------------------
0 0.1225 0.1108 0.1003 0.0907 0.0821 0.0743 0.0672 0.0608 0.0550 0.0498
1 0.2572 0.2438 0.2306 0.2177 0.2052 0.1931 0.1815 0.1703 0.1596 0.1494
2 0.2700 0.2681 0.2652 0.2613 0.2565 0.2510 0.2450 0.2384 0.2314 0.2240
3 0.1890 0.1966 0.2033 0.2090 0.2138 0.2176 0.2205 0.2225 0.2237 0.2240
4 0.0992 0.1082 0.1169 0.1254 0.1336 0.1414 0.1488 0.1557 0.1622 0.1680
5 0.0417 0.0476 0.0538 0.0602 0.0668 0.0735 0.0804 0.0872 0.0940 0.1008
6 0.0146 0.0174 0.0206 0.0241 0.0278 0.0319 0.0362 0.0407 0.0455 0.0504
7 0.0044 0.0055 0.0068 0.0083 0.0099 0.0118 0.0139 0.0163 0.0188 0.0216
8 0.0011 0.0015 0.0019 0.0025 0.0031 0.0038 0.0047 0.0057 0.0068 0.0081
9 0.0003 0.0004 0.0005 0.0007 0.0009 0.0011 0.0014 0.0018 0.0022 0.0027
10 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0004 0.0005 0.0006 0.0008
CHAPTER 27. TABLES
11 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0002
12 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
x 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0
---------------------------------------------------------------------------------
0 0.0408 0.0334 0.0273 0.0224 0.0183 0.0150 0.0123 0.0101 0.0082 0.0067
1 0.1304 0.1135 0.0984 0.0850 0.0733 0.0630 0.0540 0.0462 0.0395 0.0337
2 0.2087 0.1929 0.1771 0.1615 0.1465 0.1323 0.1188 0.1063 0.0948 0.0842
3 0.2226 0.2186 0.2125 0.2046 0.1954 0.1852 0.1743 0.1631 0.1517 0.1404
4 0.1781 0.1858 0.1912 0.1944 0.1954 0.1944 0.1917 0.1875 0.1820 0.1755
5 0.1140 0.1264 0.1377 0.1477 0.1563 0.1633 0.1687 0.1725 0.1747 0.1755
6 0.0608 0.0716 0.0826 0.0936 0.1042 0.1143 0.1237 0.1323 0.1398 0.1462
7 0.0278 0.0348 0.0425 0.0508 0.0595 0.0686 0.0778 0.0869 0.0959 0.1044
8 0.0111 0.0148 0.0191 0.0241 0.0298 0.0360 0.0428 0.0500 0.0575 0.0653
9 0.0040 0.0056 0.0076 0.0102 0.0132 0.0168 0.0209 0.0255 0.0307 0.0363
10 0.0013 0.0019 0.0028 0.0039 0.0053 0.0071 0.0092 0.0118 0.0147 0.0181
11 0.0004 0.0006 0.0009 0.0013 0.0019 0.0027 0.0037 0.0049 0.0064 0.0082
12 0.0001 0.0002 0.0003 0.0004 0.0006 0.0009 0.0013 0.0019 0.0026 0.0034
13 0.0000 0.0000 0.0001 0.0001 0.0002 0.0003 0.0005 0.0007 0.0009 0.0013
14 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0003 0.0005
15 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0002
16 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
x 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0
---------------------------------------------------------------------------------
0 0.0055 0.0045 0.0037 0.0030 0.0025 0.0020 0.0017 0.0014 0.0011 0.0009
1 0.0287 0.0244 0.0207 0.0176 0.0149 0.0126 0.0106 0.0090 0.0076 0.0064
2 0.0746 0.0659 0.0580 0.0509 0.0446 0.0390 0.0340 0.0296 0.0258 0.0223
3 0.1293 0.1185 0.1082 0.0985 0.0892 0.0806 0.0726 0.0652 0.0584 0.0521
4 0.1681 0.1600 0.1515 0.1428 0.1339 0.1249 0.1162 0.1076 0.0992 0.0912
5 0.1748 0.1728 0.1697 0.1656 0.1606 0.1549 0.1487 0.1420 0.1349 0.1277
6 0.1515 0.1555 0.1584 0.1601 0.1606 0.1601 0.1586 0.1562 0.1529 0.1490
7 0.1125 0.1200 0.1267 0.1326 0.1377 0.1418 0.1450 0.1472 0.1486 0.1490
8 0.0731 0.0810 0.0887 0.0962 0.1033 0.1099 0.1160 0.1215 0.1263 0.1304
9 0.0423 0.0486 0.0552 0.0620 0.0688 0.0757 0.0825 0.0891 0.0954 0.1014
10 0.0220 0.0262 0.0309 0.0359 0.0413 0.0469 0.0528 0.0588 0.0649 0.0710
11 0.0104 0.0129 0.0157 0.0190 0.0225 0.0265 0.0307 0.0353 0.0401 0.0452
12 0.0045 0.0058 0.0073 0.0092 0.0113 0.0137 0.0164 0.0194 0.0227 0.0263
13 0.0018 0.0024 0.0032 0.0041 0.0052 0.0065 0.0081 0.0099 0.0119 0.0142
14 0.0007 0.0009 0.0013 0.0017 0.0022 0.0029 0.0037 0.0046 0.0058 0.0071
15 0.0002 0.0003 0.0005 0.0007 0.0009 0.0012 0.0016 0.0020 0.0026 0.0033
16 0.0001 0.0001 0.0002 0.0002 0.0003 0.0005 0.0006 0.0008 0.0011 0.0014
CHAPTER 27. TABLES
17 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0004 0.0006
18 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0002
19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001
CHAPTER 27. TABLES
27.4 Cumulative probability for the Standard Normal Distribution
Second digit of Z
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
----------------------------------------------------------------------------------
-3.5 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002
-3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
-3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
-3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
-3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
-3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
-2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
-2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
-2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
-2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
-2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
-2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
-2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
-2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
-2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
-2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
-1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
-1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
-1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
-1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
-1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
-1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
-1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
-0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
-0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
-0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
-0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
-0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
-0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
-0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
-0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
-0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
Second digit of Z
CHAPTER 27. TABLES
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
----------------------------------------------------------------------------------
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
CHAPTER 27. TABLES
27.5 Selected percentiles from the t-distribution
Selected right tail areas with confidence levels underneath
0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0010 0.0005
conf lev 0.6000 0.7000 0.8000 0.9000 0.9500 0.9800 0.9900 0.9980 0.9990
df ----------------------------------------------------------------------------------
1 1.376 1.963 3.078 6.314 12.706 31.821 63.657 318.31 636.62
2 1.061 1.386 1.886 2.920 4.303 6.965 9.925 22.327 31.599
3 0.978 1.250 1.638 2.353 3.182 4.541 5.841 10.215 12.924
4 0.941 1.190 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 0.920 1.156 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 0.889 1.108 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 0.870 1.079 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 0.865 1.071 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.421 3.690
28 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.396 3.659
30 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.385 3.646
40 0.851 1.050 1.303 1.684 2.021 2.423 2.704 3.307 3.551
50 0.849 1.047 1.299 1.676 2.009 2.403 2.678 3.261 3.496
75 0.846 1.044 1.293 1.665 1.992 2.377 2.643 3.202 3.425
100 0.845 1.042 1.290 1.660 1.984 2.364 2.626 3.174 3.390
200 0.843 1.039 1.286 1.653 1.972 2.345 2.601 3.131 3.340
1000 0.842 1.037 1.282 1.646 1.962 2.330 2.581 3.098 3.300
CHAPTER 27. TABLES
27.6 Selected percentiles from the chi-squared-distribution
Selected right tail areas
0.3000 0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0010 0.0005
df -------------------------------------------------------------------------------------------
1 1.074 1.642 2.072 2.706 3.841 5.024 6.635 7.879 10.828 12.116
2 2.408 3.219 3.794 4.605 5.991 7.378 9.210 10.597 13.816 15.202
3 3.665 4.642 5.317 6.251 7.815 9.348 11.345 12.838 16.266 17.730
4 4.878 5.989 6.745 7.779 9.488 11.143 13.277 14.860 18.467 19.997
5 6.064 7.289 8.115 9.236 11.070 12.833 15.086 16.750 20.515 22.105
6 7.231 8.558 9.446 10.645 12.592 14.449 16.812 18.548 22.458 24.103
7 8.383 9.803 10.748 12.017 14.067 16.013 18.475 20.278 24.322 26.018
8 9.524 11.030 12.027 13.362 15.507 17.535 20.090 21.955 26.124 27.868
9 10.656 12.242 13.288 14.684 16.919 19.023 21.666 23.589 27.877 29.666
10 11.781 13.442 14.534 15.987 18.307 20.483 23.209 25.188 29.588 31.420
11 12.899 14.631 15.767 17.275 19.675 21.920 24.725 26.757 31.264 33.137
12 14.011 15.812 16.989 18.549 21.026 23.337 26.217 28.300 32.909 34.821
13 15.119 16.985 18.202 19.812 22.362 24.736 27.688 29.819 34.528 36.478
14 16.222 18.151 19.406 21.064 23.685 26.119 29.141 31.319 36.123 38.109
15 17.322 19.311 20.603 22.307 24.996 27.488 30.578 32.801 37.697 39.719
16 18.418 20.465 21.793 23.542 26.296 28.845 32.000 34.267 39.252 41.308
17 19.511 21.615 22.977 24.769 27.587 30.191 33.409 35.718 40.790 42.879
18 20.601 22.760 24.155 25.989 28.869 31.526 34.805 37.156 42.312 44.434
19 21.689 23.900 25.329 27.204 30.144 32.852 36.191 38.582 43.820 45.973
20 22.775 25.038 26.498 28.412 31.410 34.170 37.566 39.997 45.315 47.498
21 23.858 26.171 27.662 29.615 32.671 35.479 38.932 41.401 46.797 49.011
22 24.939 27.301 28.822 30.813 33.924 36.781 40.289 42.796 48.268 50.511
23 26.018 28.429 29.979 32.007 35.172 38.076 41.638 44.181 49.728 52.000
24 27.096 29.553 31.132 33.196 36.415 39.364 42.980 45.559 51.179 53.479
25 28.172 30.675 32.282 34.382 37.652 40.646 44.314 46.928 52.620 54.947
26 29.246 31.795 33.429 35.563 38.885 41.923 45.642 48.290 54.052 56.407
27 30.319 32.912 34.574 36.741 40.113 43.195 46.963 49.645 55.476 57.858
28 31.391 34.027 35.715 37.916 41.337 44.461 48.278 50.993 56.892 59.300
29 32.461 35.139 36.854 39.087 42.557 45.722 49.588 52.336 58.301 60.735
30 33.530 36.250 37.990 40.256 43.773 46.979 50.892 53.672 59.703 62.162
40 44.165 47.269 49.244 51.805 55.758 59.342 63.691 66.766 73.402 76.095
50 54.723 58.164 60.346 63.167 67.505 71.420 76.154 79.490 86.661 89.561
75 80.908 85.066 87.688 91.061 96.217 100.84 106.39 110.29 118.60 121.94
100 106.91 111.67 114.66 118.50 124.34 129.56 135.81 140.17 149.45 153.17
200 209.99 216.61 220.74 226.02 233.99 241.06 249.45 255.26 267.54 272.42
1000 1023.0 1037.4 1046.4 1057.7 1074.7 1089.5 1107.0 1118.9 1143.9 1153.7
CHAPTER 27. TABLES
27.7 Sample size determination for a two sample t-test
Number of observations in each group required for a two-sided t-test to detect
a specified relative difference between two means
alpha=.01 alpha = .05 alpha = .10
<------ Power ------> <------ Power ------> <------ Power ------>
delta 99 95 90 80 50 99 95 90 80 50 99 95 90 80 50
0.05 . . . . . . . . . . . . . . .
0.10 . . . . . . . . . . . . . . .
0.15 . . . . . . . . . . . . . . .
0.20 . . . . . . . . . . . . . . .
0.25 . . . . . . . . . 124 . . . . 88
0.30 . . . . . . . . . 87 . . . . 61
0.35 . . . . 110 . . . . 64 . . . 102 45
0.40 . . . . 85 . . . 100 49 . . 108 78 35
0.45 . . . 118 68 . . 105 79 39 . 108 86 62 28
0.50 . . 121 96 55 . 105 86 64 32 . 88 70 51 23
0.55 . 120 101 79 46 123 87 71 53 27 105 73 58 42 19
0.60 . 101 85 67 39 104 74 60 45 23 89 61 49 36 16
0.65 116 87 73 57 34 88 63 51 39 20 76 52 42 30 14
0.70 100 75 63 50 29 76 55 44 34 17 66 45 36 26 12
0.75 88 66 55 44 26 67 48 39 29 15 57 40 32 23 11
0.80 77 58 49 39 23 59 42 34 26 14 50 35 28 21 10
0.85 69 52 43 35 21 52 37 31 23 12 45 31 25 18 9
0.90 62 46 39 31 19 47 34 27 21 11 40 28 22 16 8
0.95 55 42 35 28 17 42 30 25 19 10 36 25 20 15 7
1.00 50 38 32 26 15 38 27 23 17 9 33 23 18 14 7
1.10 42 32 27 22 13 32 23 19 15 8 27 19 15 11 6
1.20 36 27 23 18 11 27 20 16 12 7 23 16 13 10 5
1.30 31 23 20 16 10 23 17 14 11 6 20 14 11 9 5
1.40 27 20 17 14 9 20 15 12 10 6 17 12 10 8 4
1.50 24 18 15 13 8 18 13 11 9 5 15 11 9 7 4
1.60 21 16 14 11 7 16 12 10 8 5 14 10 8 6 4
1.70 19 15 13 10 7 14 11 9 7 4 12 9 7 6 3
1.80 17 13 11 10 6 13 10 8 6 4 11 8 7 5 3
1.90 16 12 11 9 6 12 9 7 6 4 10 7 6 5 3
2.00 14 11 10 8 6 11 8 7 6 4 9 7 6 4 3
CHAPTER 27. TABLES
2.20 12 10 8 7 5 9 7 6 5 3 8 6 5 4 3
2.40 11 9 8 6 5 8 6 5 4 3 7 5 4 4 3
2.60 9 8 7 6 4 7 6 5 4 3 6 5 4 3 2
2.80 9 7 6 5 4 6 5 4 4 3 5 4 4 3 2
3.00 8 6 6 5 4 6 5 4 4 3 5 4 3 3 2
3.50 6 5 5 4 3 5 4 4 3 3 4 3 3 3 2
4.00 6 5 4 4 3 4 4 3 3 2 4 3 3 2 2
delta = (difference in means) / sigma
If sample size exceeeds 125, then it is not printed
The table is indexed along the side by the relative effect size dened as:
=
|
1
2
|
where
1
and
2
are the two means to be compared, and is the standard deviation of the responses around
their respective means. The latter is often a guess-estimate obtained from a pilot study or a literature search.
Along the top are several choices of levels and several choices for power. Usually, you would like to
be at least 80% sure of detecting an effect. [Note that a 50% power is equivalent to ipping a coin!]
For example if = .5, then at least 64 in each group is required for a 80% power at = 0.05..
What does this table indicate? First, notice that as you increase the power for a given relative effect size
the sample size increases. Similarly, as you decrease the relative effect size to be detected, the sample size
increases. And, most important, you need very large experiments to detect small differences!
Power is maximized if the two groups have equal sample sizes, but it is possible to do a power analysis
with unequal sample sizes - consult some of the references listed below for assistance.
CHAPTER 27. TABLES
27.8 Power determination for a two sample t-test
Power for a two-sided two-sample t-test at alpha=.05
Delta
n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
2 5 5 5 6 6 7 7 8 9 10 10 11 13 14 15 16 17 19 20 22
3 5 5 6 7 8 9 10 12 14 16 18 21 23 26 29 33 36 39 43 46
4 5 6 7 8 9 11 13 16 19 22 26 30 34 38 43 48 52 57 61 66
5 5 6 7 9 11 13 16 20 24 29 33 39 44 49 55 60 65 70 75 79
6 5 6 8 10 12 16 20 24 29 35 41 47 53 59 65 71 76 80 84 88
7 5 6 8 11 14 18 23 28 34 41 47 54 61 67 73 78 83 87 90 93
8 5 7 9 12 15 20 26 32 39 46 54 61 68 74 80 84 88 92 94 96
9 5 7 9 13 17 22 29 36 43 51 59 67 74 80 85 89 92 95 97 98
10 6 7 10 14 19 25 32 40 48 56 64 72 78 84 89 92 95 97 98 99
12 6 8 11 16 22 29 37 47 56 65 73 80 86 91 94 96 98 99 99 100
14 6 8 12 17 25 33 43 53 63 72 80 86 91 95 97 98 99 100 100 100
16 6 9 13 19 28 38 48 59 69 78 85 91 94 97 98 99 100 100 100 100
18 6 9 14 21 31 42 53 65 75 83 89 94 97 98 99 100 100 100 100 100
20 6 9 15 23 34 46 58 69 79 87 92 96 98 99 100 100 100 100 100 100
25 6 11 18 28 41 55 68 79 88 93 97 99 99 100 100 100 100 100 100 100
30 7 12 21 33 48 63 76 86 93 97 99 100 100 100 100 100 100 100 100 100
35 7 13 24 38 54 70 82 91 96 98 99 100 100 100 100 100 100 100 100 100
40 7 14 26 42 60 75 87 94 98 99 100 100 100 100 100 100 100 100 100 100
45 8 16 29 47 65 80 91 96 99 100 100 100 100 100 100 100 100 100 100 100
50 8 17 32 51 70 84 93 98 99 100 100 100 100 100 100 100 100 100 100 100
55 8 18 34 55 74 88 95 99 100 100 100 100 100 100 100 100 100 100 100 100
60 8 19 37 58 78 90 97 99 100 100 100 100 100 100 100 100 100 100 100 100
65 9 20 40 62 81 92 98 99 100 100 100 100 100 100 100 100 100 100 100 100
70 9 22 42 65 84 94 98 100 100 100 100 100 100 100 100 100 100 100 100 100
75 9 23 45 68 86 95 99 100 100 100 100 100 100 100 100 100 100 100 100 100
80 10 24 47 71 88 96 99 100 100 100 100 100 100 100 100 100 100 100 100 100
85 10 25 49 74 90 97 100 100 100 100 100 100 100 100 100 100 100 100 100 100
90 10 27 52 76 92 98 100 100 100 100 100 100 100 100 100 100 100 100 100 100
95 11 28 54 78 93 98 100 100 100 100 100 100 100 100 100 100 100 100 100 100
100 11 29 56 80 94 99 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Power is in %
CHAPTER 27. TABLES
Delta = abs(difference in means)/ sigma
This table assumes equal sample sizes in both groups
This chart is indexed in much the same way as the sample size chart and assumes equal sample sizes in
each treatment group. If this was not the case then either use the average sample size to get the approximate
power.
For example, if = 0.50, then looking down the column indexed by 0.50, it appears that a sample size
of around 60 per treatment group is required to get a power of around 80% at = 0.05.
CHAPTER 27. TABLES
27.9 Sample size determination for a single factor, xed effects, CRD
Number of observations in each group required for a single-factor fixed effects CRD
to detect a specified relative difference between the largest and smallest mean
alpha=.01 alpha = .05 alpha = .10
<------ Power ------> <------ Power ------> <------ Power ------>
r delta 99 95 90 80 50 99 95 90 80 50 99 95 90 80 50
--------------------------------------------------------------------------------
2 0.25 . . . . . . . . . 124 . . . . 88
2 0.50 . . 121 96 55 . 105 86 64 32 . 88 70 51 23
2 0.75 88 66 55 44 26 67 48 39 29 15 57 40 32 23 11
2 1.00 50 38 32 26 15 38 27 23 17 9 33 23 18 14 7
2 1.25 33 25 21 17 11 25 18 15 12 7 21 15 12 9 5
2 1.50 24 18 15 13 8 18 13 11 9 5 15 11 9 7 4
2 1.75 18 14 12 10 7 14 10 8 7 4 12 8 7 5 3
2 2.00 14 11 10 8 6 11 8 7 6 4 9 7 6 4 3
--------------------------------------------------------------------------------
3 0.25 . . . . . . . . . . . . . . 115
3 0.50 . . . 113 68 . 125 103 79 41 . 105 85 63 30
3 0.75 100 75 64 51 31 78 56 47 36 19 67 48 38 29 14
3 1.00 57 43 37 30 18 44 32 27 21 11 38 27 22 17 8
3 1.25 37 29 24 20 13 29 21 18 14 8 25 18 15 11 6
3 1.50 26 20 18 14 9 21 15 13 10 6 18 13 11 8 5
3 1.75 20 16 13 11 7 16 12 10 8 5 13 10 8 6 4
3 2.00 16 12 11 9 6 12 9 8 6 4 11 8 7 5 3
--------------------------------------------------------------------------------
4 0.25 . . . . . . . . . . . . . . .
4 0.50 . . . . 76 . . 115 89 48 . 118 96 72 35
4 0.75 108 83 70 57 35 85 63 52 40 22 74 53 43 33 16
4 1.00 62 47 40 33 21 49 36 30 23 13 42 30 25 19 10
4 1.25 40 31 27 22 14 32 23 20 15 9 28 20 16 13 7
4 1.50 28 22 19 16 10 22 17 14 11 7 20 14 12 9 5
4 1.75 21 17 15 12 8 17 13 11 9 5 15 11 9 7 4
4 2.00 17 13 12 10 7 13 10 9 7 4 12 9 7 6 4
CHAPTER 27. TABLES
--------------------------------------------------------------------------------
5 0.25 . . . . . . . . . . . . . . .
5 0.50 . . . . 84 . . 125 97 53 . . 104 79 39
5 0.75 115 88 76 61 38 91 67 56 44 24 80 58 47 36 18
5 1.00 65 50 43 35 22 52 39 32 25 14 45 33 27 21 11
5 1.25 43 33 28 23 15 34 25 21 17 10 30 22 18 14 7
5 1.50 30 23 20 17 11 24 18 15 12 7 21 15 13 10 6
5 1.75 23 18 15 13 9 18 14 12 9 6 16 12 10 8 4
5 2.00 18 14 12 10 7 14 11 9 7 5 12 9 8 6 4
--------------------------------------------------------------------------------
6 0.25 . . . . . . . . . . . . . . .
6 0.50 . . . . 90 . . . 104 57 . . 112 85 42
6 0.75 121 93 80 65 41 96 72 60 47 26 85 61 50 38 20
6 1.00 69 53 46 38 24 55 41 34 27 15 48 35 29 22 12
6 1.25 45 35 30 25 16 36 27 23 18 10 31 23 19 15 8
6 1.50 32 25 21 18 12 25 19 16 13 8 22 16 14 11 6
6 1.75 24 19 16 13 9 19 14 12 10 6 17 12 10 8 5
6 2.00 19 15 13 11 7 15 11 10 8 5 13 10 8 7 4
--------------------------------------------------------------------------------
7 0.25 . . . . . . . . . . . . . . .
7 0.50 . . . . 96 . . . 110 61 . . 118 90 46
7 0.75 . 98 84 69 43 101 76 63 50 28 89 65 53 41 21
7 1.00 72 56 48 39 25 58 43 36 29 16 51 37 31 24 12
7 1.25 47 36 31 26 17 37 28 24 19 11 33 24 20 16 8
7 1.50 33 26 22 18 12 26 20 17 14 8 23 17 14 11 6
7 1.75 25 19 17 14 9 20 15 13 10 6 17 13 11 9 5
7 2.00 19 15 13 11 8 15 12 10 8 5 14 10 9 7 4
--------------------------------------------------------------------------------
8 0.25 . . . . . . . . . . . . . . .
8 0.50 . . . . 101 . . . 116 65 . . 125 95 48
8 0.75 . 102 88 72 46 105 79 66 52 30 93 68 56 43 22
8 1.00 74 58 50 41 26 60 45 38 30 17 53 39 32 25 13
8 1.25 48 38 33 27 18 39 29 25 20 12 34 25 21 16 9
8 1.50 34 27 23 19 13 27 21 18 14 8 24 18 15 12 7
CHAPTER 27. TABLES
8 1.75 25 20 18 15 10 21 16 13 11 7 18 14 11 9 5
8 2.00 20 16 14 12 8 16 12 11 9 5 14 11 9 7 4
--------------------------------------------------------------------------------
9 0.25 . . . . . . . . . . . . . . .
9 0.50 . . . . 106 . . . 122 69 . . . 100 51
9 0.75 . 106 91 75 48 109 82 69 55 31 96 71 59 45 23
9 1.00 77 60 52 43 28 62 47 40 31 18 55 40 33 26 14
9 1.25 50 39 34 28 18 40 30 26 21 12 36 26 22 17 9
9 1.50 35 28 24 20 13 28 22 18 15 9 25 19 16 12 7
9 1.75 26 21 18 15 10 21 16 14 11 7 19 14 12 9 5
9 2.00 20 16 14 12 8 17 13 11 9 6 15 11 9 7 4
--------------------------------------------------------------------------------
10 0.25 . . . . . . . . . . . . . . .
10 0.50 . . . . 110 . . . . 72 . . . 104 54
10 0.75 . 109 94 78 50 113 85 72 57 33 100 73 61 47 25
10 1.00 79 62 54 44 29 64 49 41 33 19 57 42 35 27 14
10 1.25 51 40 35 29 19 42 32 27 21 13 37 27 23 18 10
10 1.50 36 29 25 21 14 29 22 19 15 9 26 19 16 13 7
10 1.75 27 21 19 16 10 22 17 14 12 7 19 15 12 10 6
10 2.00 21 17 15 12 8 17 13 11 9 6 15 11 10 8 5
--------------------------------------------------------------------------------
delta = (largest mean - smallest mean) / sigma
r = number of treatment groups
If sample size exceeeds 125, then it is not printed
This table is conservative as it uses the worst
configuration of the means to estimate the sample size.
This table is only appropriate for FIXED EFFECTS - please
refer to a reference book for planning under random effect models
=
max() min()

CHAPTER 27. TABLES
CHAPTER 27. TABLES
27.10 Power determination for a single factor, xed effects, CRD
Power for a single factor, fixed effects, CRD at alpha=0.05
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
2 3 5 5 6 7 8 9 10 12 14 16 18 21 23 26 29 33 36 39 43 46
2 4 5 6 7 8 9 11 13 16 19 22 26 30 34 38 43 48 52 57 61 66
2 5 5 6 7 9 11 13 16 20 24 29 33 39 44 49 55 60 65 70 75 79
2 6 5 6 8 10 12 16 20 24 29 35 41 47 53 59 65 71 76 80 84 88
2 7 5 6 8 11 14 18 23 28 34 41 47 54 61 67 73 78 83 87 90 93
2 8 5 7 9 12 15 20 26 32 39 46 54 61 68 74 80 84 88 92 94 96
2 9 5 7 9 13 17 22 29 36 43 51 59 67 74 80 85 89 92 95 97 98
2 10 6 7 10 14 19 25 32 40 48 56 64 72 78 84 89 92 95 97 98 99
2 15 6 8 12 18 26 35 46 56 66 75 83 89 93 96 98 99 99 100 100 100
2 20 6 9 15 23 34 46 58 69 79 87 92 96 98 99 100 100 100 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
3 3 5 5 6 6 7 8 9 10 11 13 14 16 18 21 23 26 29 32 35 38
3 4 5 5 6 7 8 9 11 13 15 17 20 23 27 30 34 38 43 47 52 56
3 5 5 6 6 7 9 11 13 16 19 22 26 30 35 40 45 50 55 60 65 70
3 6 5 6 7 8 10 12 15 19 22 27 32 37 43 49 55 60 66 71 76 81
3 7 5 6 7 9 11 14 17 22 26 32 38 44 50 57 63 69 75 80 84 88
3 8 5 6 7 9 12 16 20 25 30 37 43 50 57 64 70 76 81 86 90 92
3 9 5 6 8 10 13 17 22 28 34 41 49 56 64 70 77 82 87 90 93 95
3 10 5 6 8 11 14 19 24 31 38 46 54 62 69 76 82 87 91 94 96 97
3 15 6 7 10 14 20 27 36 46 56 65 74 82 88 92 95 97 99 99 100 100
3 20 6 8 12 18 26 36 47 59 70 79 87 92 96 98 99 100 100 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
4 3 5 5 6 6 6 7 8 9 10 11 13 14 16 18 20 23 25 28 31 34
4 4 5 5 6 6 7 8 10 11 13 15 17 20 23 26 30 34 38 42 46 50
4 5 5 5 6 7 8 10 11 14 16 19 22 26 30 35 39 44 49 54 60 64
4 6 5 6 6 7 9 11 13 16 19 23 27 32 37 43 49 54 60 65 71 75
CHAPTER 27. TABLES
4 7 5 6 7 8 10 12 15 19 23 27 33 38 44 51 57 63 69 74 79 84
4 8 5 6 7 9 11 13 17 21 26 32 38 44 51 58 64 71 76 81 86 89
4 9 5 6 7 9 12 15 19 24 29 36 43 50 57 64 71 77 82 87 90 93
4 10 5 6 7 10 12 16 21 26 33 40 47 55 63 70 77 82 87 91 94 96
4 15 5 7 9 12 17 23 31 40 49 59 68 76 83 89 93 96 98 99 99 100
4 20 6 7 11 15 22 31 41 52 64 74 82 89 93 96 98 99 100 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
5 3 5 5 5 6 6 7 8 8 9 10 12 13 15 17 19 21 23 26 28 31
5 4 5 5 6 6 7 8 9 10 12 14 16 18 21 24 27 30 34 38 42 46
5 5 5 5 6 7 8 9 10 12 15 17 20 23 27 31 36 40 45 50 55 60
5 6 5 5 6 7 8 10 12 14 17 21 25 29 34 39 44 50 55 61 66 71
5 7 5 6 6 8 9 11 14 17 20 24 29 34 40 46 52 58 64 70 75 80
5 8 5 6 7 8 10 12 15 19 23 28 34 40 46 53 60 66 72 78 82 87
5 9 5 6 7 8 11 13 17 21 26 32 38 45 52 60 66 73 79 84 88 91
5 10 5 6 7 9 11 15 19 24 29 36 43 50 58 65 72 78 84 88 92 94
5 15 5 6 8 11 15 21 28 36 45 54 63 72 80 86 91 94 97 98 99 100
5 20 5 7 10 14 20 28 37 48 59 69 78 86 91 95 97 99 99 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
6 3 5 5 5 6 6 7 7 8 9 10 11 12 14 15 17 19 21 24 26 29
6 4 5 5 6 6 7 8 9 10 11 13 15 17 19 22 25 28 32 35 39 43
6 5 5 5 6 6 7 8 10 11 13 16 18 22 25 29 33 37 42 47 52 57
6 6 5 5 6 7 8 9 11 13 16 19 23 27 31 36 41 46 52 57 63 68
6 7 5 6 6 7 9 10 13 15 19 22 27 32 37 43 49 55 61 66 72 77
6 8 5 6 6 8 9 11 14 17 21 26 31 37 43 49 56 62 69 74 79 84
6 9 5 6 7 8 10 12 16 19 24 29 35 42 49 56 63 69 75 81 85 89
6 10 5 6 7 8 11 13 17 22 27 33 40 47 54 62 69 75 81 86 90 93
6 15 5 6 8 11 14 19 25 33 41 50 59 68 76 83 89 93 96 97 99 99
6 20 5 7 9 13 18 25 34 44 55 65 75 83 89 94 97 98 99 100 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
7 3 5 5 5 6 6 6 7 8 9 9 11 12 13 15 16 18 20 22 25 27
7 4 5 5 6 6 7 7 8 9 11 12 14 16 18 20 23 26 30 33 37 41
CHAPTER 27. TABLES
7 5 5 5 6 6 7 8 9 11 13 15 17 20 23 27 31 35 39 44 49 54
7 6 5 5 6 7 8 9 11 13 15 18 21 25 29 33 38 43 49 54 60 65
7 7 5 5 6 7 8 10 12 14 17 21 25 29 34 40 46 52 58 63 69 74
7 8 5 6 6 7 9 11 13 16 20 24 29 34 40 46 53 59 65 71 77 82
7 9 5 6 6 8 9 12 15 18 22 27 33 39 46 52 59 66 72 78 83 87
7 10 5 6 7 8 10 13 16 20 25 31 37 44 51 58 65 72 78 83 88 91
7 15 5 6 8 10 13 18 23 30 38 47 56 65 73 81 87 91 94 97 98 99
7 20 5 7 9 12 17 23 31 41 51 62 72 80 87 92 96 98 99 99 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
8 3 5 5 5 6 6 6 7 8 8 9 10 11 12 14 15 17 19 21 23 26
8 4 5 5 5 6 6 7 8 9 10 11 13 15 17 19 22 25 28 31 35 38
8 5 5 5 6 6 7 8 9 10 12 14 16 19 22 25 29 33 37 42 46 51
8 6 5 5 6 7 7 9 10 12 14 17 20 23 27 31 36 41 46 51 57 62
8 7 5 5 6 7 8 9 11 14 16 20 23 28 32 38 43 49 55 61 66 72
8 8 5 6 6 7 8 10 12 15 19 22 27 32 38 44 50 56 63 69 74 79
8 9 5 6 6 7 9 11 14 17 21 26 31 37 43 50 57 63 70 76 81 85
8 10 5 6 7 8 10 12 15 19 23 29 35 41 48 55 63 69 76 81 86 90
8 15 5 6 7 10 13 17 22 28 36 44 53 62 71 78 85 90 93 96 98 99
8 20 5 6 8 11 16 22 29 39 49 59 69 78 85 91 95 97 99 99 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
9 3 5 5 5 6 6 6 7 7 8 9 10 11 12 13 15 16 18 20 22 24
9 4 5 5 5 6 6 7 8 9 10 11 12 14 16 18 21 23 26 30 33 37
9 5 5 5 6 6 7 8 9 10 12 13 15 18 21 24 27 31 35 39 44 49
9 6 5 5 6 6 7 8 10 11 13 16 19 22 26 30 34 39 44 49 54 60
9 7 5 5 6 7 8 9 11 13 15 18 22 26 31 36 41 47 52 58 64 69
9 8 5 5 6 7 8 10 12 14 18 21 26 30 36 42 48 54 60 66 72 77
9 9 5 6 6 7 9 11 13 16 20 24 29 35 41 47 54 61 67 73 79 84
9 10 5 6 6 8 9 11 14 18 22 27 33 39 46 53 60 67 73 79 84 88
9 15 5 6 7 9 12 16 21 27 34 42 51 60 68 76 83 88 92 95 97 98
9 20 5 6 8 11 15 21 28 36 46 56 67 76 83 89 94 96 98 99 100 100
-------------------------------------------------------------------------------
Delta
r n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
CHAPTER 27. TABLES
10 3 5 5 5 6 6 6 7 7 8 9 9 10 12 13 14 16 17 19 21 23
10 4 5 5 5 6 6 7 8 8 9 11 12 14 15 18 20 22 25 28 32 35
10 5 5 5 6 6 7 7 8 10 11 13 15 17 20 23 26 30 34 38 42 47
10 6 5 5 6 6 7 8 9 11 13 15 18 21 24 28 32 37 42 47 52 58
10 7 5 5 6 7 8 9 10 12 15 18 21 25 29 34 39 44 50 56 62 67
10 8 5 5 6 7 8 10 11 14 17 20 24 29 34 40 46 52 58 64 70 75
10 9 5 6 6 7 8 10 13 15 19 23 28 33 39 45 52 58 65 71 77 82
10 10 5 6 6 7 9 11 14 17 21 26 31 37 44 51 58 65 71 77 82 87
10 15 5 6 7 9 12 15 20 25 32 40 49 57 66 74 81 87 91 94 97 98
10 20 5 6 8 11 14 20 26 35 44 54 64 74 82 88 93 96 98 99 100 100
-------------------------------------------------------------------------------
Power is in %
Delta = (max difference in means)/ sigma
r = number of treatments
This table assumes equal sample sizes in all groups
The power tabulated is conservative because it assumes the worst possible configuration
for the means for a given delta and assumes equal sample sizes in all groups
=
max() min()

Chapter 28
THE END!
Contents
28.1 Statisfaction - with apologies to Jagger/Richards . . . . . . . . . . . . . . . . . . . . . 1053
28.2 ANOVA Man with apologies to Lennon/McCartney . . . . . . . . . . . . . . . . . . . 1055
Now that the classes are over, you might enjoy listening to some classical songs from http://www.
glicko.net/music.html that statisticians fondly remember.
28.1 Statisfaction - with apologies to Jagger/Richards
Statisfaction
Words: Mark Glickman
Music: Jagger/Richards ("Satisfaction")
Available at http://www.glicko.net/music.html
I cant get no statisfaction,
I cant get no statisfaction.
Cause I try and I try and I try and I try.
I cant get no, I cant get no.
When Im sitting down at lecture,
1053
CHAPTER 28. THE END!
And that man begins to explain to me
That you must pay close attention
When youre tting your regression
To heteroskedasticity!
I cant get no, oh no no no.
Hey hey hey, thats what I say.
I cant get no statisfaction.
When Im working on my homework,
And Im lled with great uncertainty,
So I choose the pooled procedure,
And my teacher points and laughs at me
Cause our p-values dont agree!
I cant get no strong relation.
When Im handed my diploma
For my hard-earned Bachelors degree,
And the Dean says I cannot leave
Until I give an explanation
How to compute a correlation.
I cant get no, I cant get no,
no statisfaction, no statisfaction, no statisfaction.
28.2 ANOVA Man with apologies to Lennon/McCartney
ANOVA Man
Words: Mark Glickman
Music: Lennon/McCartney ("Nowhere Man")
Available at http://www.glicko.net/music.html
Hes a real ANOVA man
Designing all his sampling plans
Calculating mean-squared errors and p-values.
Wants to test for equal mus
Knows which tables he must use
All his samples he will choose at random.
ANOVA man, please listen;
Wheres the data that youre missing?
ANOVA man, what kinds of bias can you withstand?
Writes down two hypotheses;
Hopes to reject the rst of these;
Needs to list out all degrees of freedom.
ANOVA man, try harder;
Dont give up, youre smarter;
ANOVA man, how come your students dont understand?
At 0.05 he rejects
Ignores the size of his effects
Now hes stuck hes got selection bias!
ANOVA man, please listen;
Wheres the data that youre missing?
ANOVA man, what kinds of bias can you withstand?
Hes a real ANOVA man
Designing all his sampling plans
Calculating mean-squared errors and p-values.
Chapter 29
An overview of enviromental eld
studies
Contents
29.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
29.1.1 Survey Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065
29.1.2 Permanent or temporary monitoring stations . . . . . . . . . . . . . . . . . . . . 1079
29.1.3 Renements that affect precision . . . . . . . . . . . . . . . . . . . . . . . . . . 1080
29.1.4 Sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
29.2 Analytical surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
29.3 Impact Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
29.3.1 Before/After contrasts at a single site . . . . . . . . . . . . . . . . . . . . . . . . 1088
29.3.2 Repeated before/after sampling at a single site. . . . . . . . . . . . . . . . . . . . 1088
29.3.3 BACI: Before/After and Control/Impact Surveys . . . . . . . . . . . . . . . . . . 1089
29.3.4 BACI-P: Before/After and Control/Impact - Paired designs . . . . . . . . . . . . . 1092
29.3.5 Enhanced BACI-P: Designs to detect acute vs. chronic effects or to detect changes
in variation as well as changes in the mean. . . . . . . . . . . . . . . . . . . . . . 1094
29.3.6 Designs for multiple impacts spread over time . . . . . . . . . . . . . . . . . . . 1096
29.3.7 Accidental Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099
29.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108
29.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
29.6 Selected journal articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
29.6.1 Designing Environmental Field Studies . . . . . . . . . . . . . . . . . . . . . . . 1112
29.6.2 Beyond BACI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113
29.6.3 Environmental impact assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 1113
29.7 Examples of studies for discussion - good exam questions! . . . . . . . . . . . . . . . 1114
29.7.1 Effect of burn upon salamanders . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
1057
CHAPTER 29. AN OVERVIEW OF ENVIROMENTAL FIELD STUDIES
This section of the notes is based on an article published in
Schwarz, C. J. (1998). Studies of uncontrolled events. In Statistical Methods for Adaptive
Management Studies, p. 19-40, Sit, V. and Taylor, B. (editors). BC Ministry of Forests.
This paper and other papers (by various authors) on Adaptive Management is available online at http:
//www.for.gov.bc.ca/hfd/pubs/docs/lmh/lmh42.htm. It has a number of nice chapters on
using statistics in Forestry Management and is written at an accessible level.
Other nice books for studying wildlife populations or guidance on designing monitoring studies are:
Elzinga, C.L., Salzer, D.W., Willoughby, J.W., and Gibbs, J.P. (2001). Monitoring Plant and Animal
Populations. Blackwell Science, New York.
A nice how-to manual on designing monitoring studies.
Morrison, M. L., Block, W. M., Strickland, M. D., and Kendall, W. L. (2010). Wildlife Study Design.
2nd Editions. Springer, New York.
http://www.amazon.com/Wildlife-Design-Springer-Environmental-Management/
dp/1441925945/ref=sr_1_3?s=books&ie=UTF8&qid=1285972814&sr=1-3
This is a very readable book on principles of survey, experimental design, and environmental impact
and monitoring studies. It is non-technical with lots of illustrations to real work problems.
Skalski, J.R. and Robson, D.S. (1992). Techniques for Wildlife Investigations: Design and Analysis
of Capture Data. New York, Academic Press.
This books presents methods for conducting experimental inference and mark-recapture statistical
studies for sh and wildlife investigations.
29.1 Introduction
The rationale for carefully planned experiments in ecology is well documented (Hurlbert, 1984). A well
designed experiment will have a high probability of detecting important, biologically meaningful differences
among the experimental groups. Furthermore, because the manager directly manipulated the experimental
factors and randomly assigned the experimental units to the particular combination of experimental factors,
the manager is able to infer a causal relationship between the experimental factors and the response variable.
The manager who takes a similar approach and practices active adaptive management will be able to make
the strongest possible inferences about the role of the experimental factors.
In many cases, controlled experiments are impractical or too expensive, and surveys of existing ecolog-
ical populations are performed, even though the resulting inferences will be weaker than those obtainable
through controlled experimentation. Consequently non-experimental studies, or passive adaptive manage-
ment, lead to conclusions that are primarily a tool for generating hypotheses eventually to be tested by
careful and more efcient experimentation.
For example, observation studies of existing lakes showed that the more acidic lakes tended to have
fewer sh. An alternate explanation that could explain this result states that there is some unknown factor
that causes both the lake to acidify and also to kill sh; i.e. the relationship between numbers of sh and
acidication is a result of a common response to another factor. However, experiments where lakes were
deliberately acidied refute this alternate explanation. No such refutation is possible fromsurveys of existing
populations. The primary message is that causation cannot be inferred without active manipulation.
Despite the weaker inferences from studying non-experimental studies, the same attention must be paid
to the proper design of a survey so that the conclusions are not tainted by inadvertent biases. There are many
different types non-experimental studies; these are outlined in Figure 1 below.
Figure 1: A classication of the methods considered in this chapter.
To begin, consider the following series of examples to illustrate the differences among these types of
studies.
Example A. Descriptive Study. A manager is interested in examining the natural regeneration in
a cutblock harvested by clearcutting. The objective is to measure the amount of regeneration. A
suitable response measure will be the density of newly grown trees. A series of sample plots is
systematically located within a single cutblock and the density is measured on each sample plot.
The mean density over all plots is computed along with a measure of precision, the standard error.
There is only one response variable, the density on each plot, and no explanatory variables. This is a
Descriptive Survey as no comparisons will be made with other cut blocks and the information pertains
only to that particular cutblock. No inferences about the density in other cutblocks is possible.
Example B. Observational study. This same manager now notices that north facing slopes seem to
have a lower insect infestation rates than south facing slopes. One block from a north facing slope and
one block from a south facing slope are selected. Sample plots are located on each cutblock, and the
insect infestation is measured on each sample plot. The response variable is the amount of infestation
in each plot. The orientation of the slope is an explanatory variable. Estimates of the mean infestation
are obtained for each block. The sample means for each block likely differ, but with information on
the variation within each block, it is possible to determine if there is evidence that the population
means also differ, i.e., to determine if there is evidence that the true average infestation in the two
blocks differs. This is an Observational Study as two convenient blocks were selected and compared.
However, the results are only applicable to the two blocks sampled and cannot be extrapolated to other
blocks, nor to the effects of north and south facing slopes. The reason for this weak inference is that
the observed differences between the blocks may be a result of just natural variation unrelated to the
direction of the slope; no information has been collected on the variability among blocks with the
same orientation.
Example C. Analytical Survey. The manager expands the above survey. Within the Forest Manage-
ment Unit, blocks are randomly chosen in pairs so that within each pair, one block is on a north facing
slope and the other is on a south facing slope. Sample plots are randomly located on each block,
and the insect infestation is measured on each sample plot. The response variable is the amount of
infestation in each plot. The orientation is an explanatory variable. Estimates of the mean infestation
are obtained for each type of slope along with the measures of precision. The manager then compares
the two means using information on both the within block variability and the variability among blocks
with the same orientation. It may appear that plots on the south facing slope have a higher infestation
than plots on a north facing slope. This is an Analytical Survey, as a comparison was made over an en-
tire population of cutblocks in the Forest Management Unit. This differs from a controlled experiment
in that the orientation of the cutblocks cannot be controlled by the manager. An alternate explanation
for this observed results is that it was some other unknown factor that caused the insect infestations to
be different on the two orientations.
Example D. Designed Experiment. The manager is interested in testing the effect of two different
types of fertilizer on regeneration growth. Experimental plots in several homogeneous cutblocks are
established. Within each cutblock, plots are randomly assigned to one of the fertilizers. The regen-
eration growth of the plots treated with the two fertilizers are then compared. The response variable
is the amount of growth; the explanatory variable is the fertilizer type. Because plots were randomly
assigned to the fertilizers, the effects of any other, uncontrollable, lurking factor should, on average,
be about equal in the two treatment groups, and consequently any difference in the mean regeneration
growth can be attributed to the fertilizer. The primary differences between this example and Example
C are that the manager has control over the explanatory factor and can randomly assign experimental
units to treatments. These two differences in the protocol allow stronger inferences than in Analytical
Studies.
Example E. Impact Study. The manager wishes to examine if clear cutting is changing the water
quality on nearby streams. A control site in a provincial park is selected with similar soil and topogra-
phy as the experimental site. Water quality readings are taken from both streams several times before
harvesting, and several times after harvesting. The response variable is the water quality; the explana-
tory variable is the presence or absence of nearby clearcutting. The changes in water quality in the
control and experimental sites are compared. If the objective is to examine if there is a difference in
water quality between these two specic sites, then the study will answer the question. This is similar
to the strength of inference for observational studies (Example B). If the objective is to extrapolate
from this pair of sites to the effects of clearcutting in general, the inference is much more limited.
First, there is no replication of the control or impacted sites and so it is not possible to know if the
observed differences are within the range of natural variation. This could be partly resolved by adding
multiple control sites and assuming that the variability among control sites is representative of that
among impact sites. However, the lack of randomization of the impact will still limit the extent to
which the results can be generalized. But in the longer term, if there are several such pairs of sites,
and all show the same type of impact, there are good grounds for assigning a causal relationship, even
though randomization never took place. This would be based on the idea of a super-population con-
sisting of all possible pairs of sites; it is not likely that unobservable, latent factors would be operating
in the same direction in all experiments. This last form is the closest to a designed experiment for an
impact study.
These ve examples differ in two important dimensions:
1. the amount of control over the explanatory factor. In descriptive studies there is the least amount of
control while in designed experiments there is maximal control.
2. the degree of extrapolation to other settings. Again, in descriptive studies, inference is limited to
those surveyed populations while in designed experiments on randomly selected experimental units,
inference can be made about future effects of the explanatory factors.
In general, the more control or manipulation present in a study, the stronger the inferences that can be
made as shown in Figure 2 below:
Figure 2: Relationship between degree of control, strength of inference, and type of study design.
This chapter will present an overview of some of the issues that arise in studies lacking experimental
manipulations. It will start with an overview of the Descriptive Surveys used to obtain basic information
about a population. Observational Studies will not be explicitly addressed as they are of such limited use-
fulness and their design and analysis are very close to Analytical Surveys. Next, Analytical Surveys will
be discussed where the goal is to compare subsets of an existing population. Impact Surveys will then be
discussed where a comparison is made between one site affected by some planned or unplanned event and a
control site where no such event occurs. Finally, some general principles will be reviewed.
A nice comparison of the differences in how studies can be designed is presented in Figure 1.3 of Elzinga
et al. (2001) which is adapted below. The context for below is how to detect a treatment effect from a
prescribed burn. The notation Y
1,t
indicates a measurement taken at year 1 after the burn at the treatment
site; Y
1,c
indicates a measurement taken 1 year before the burn at a control site; etc.
Measurements Typical
Type of design Before After Problems?
No monitoring none none Unable to tell anything
Post only. none Y
1,t
No control,
No control. Y
2,t
subsequent changes
No replication may be temporal change
Pre & Post Y
1,t
Y
1,t
No control,
no control diff may be
temporal artefact.
Pre & Post Y
1,t
Y
1,t
Basic BACI design.
Control & Treat Y
1,c
Y
1,c
Is control unique?
No replication
Pre & Post Y
1,t
Y
1,t
Enhanced BACI design.
Control & Treat Y
1,c1
Y
1,c1
Bad timing? ?
Replication Y
1,c2
Y
1,c2
of control
Pre & Post Y
2,t
Y
1,t
Enhanced BACI design.
Control & Treat Y
1,t
Y
2,t
Bare minimum for
Replication Y
2,c1
Y
1,c1
env impact
of control. Y
1,c1
Y
2,c1
design.
Rep times Y
2,c2
Y
1,c2
Y
1,c2
Y
2,c2
Pre & Post Y
2,t1
Y
1,t1
Reseach design.
Control & Treat Y
1,t1
Y
2,t1
Stratication??
Replication Y
2,t2
Y
1,t2
Adequate power?
of control Y
1,t2
Y
2,t2
Pseudoreplication
AND treat Y
2,c1
Y
1,c1
Y
1,c1
Y
2,c1
Y
2,c2
Y
1,c2
Y
1,c2
Y
2,c2
There are many excellent references on descriptive survey methods (Thompson 1992; Cochran 1977;
Krebs, 1989) and so this section is limited to a brief account of the main survey methods that could be used
in eld research. Details on actual eld procedures are also available, e.g. Myers and Shelton (1980).
29.1.1 Survey Methods
This is the basic method of selecting survey units. Each unit in the population is selected with equal prob-
ability and all possible samples are equally likely to be chosen. This is commonly done by listing all the
members in the population and then sequentially choosing units using a random number table. Units are
usually chosen without replacement, i.e., each unit in the population can only be chosen once. In some
cases (particularly for multi-stage designs), there are advantages to selecting units with replacement, i.e. a
unit in the population may potentially be selected more than once. The analysis of a simple random sam-
ple is straightforward. The mean of the sample is an estimate of the population mean. An estimate of the
population total is obtained by multiplying the sample mean by the number of units in the population. The
sampling fraction, the proportion of units chosen from the entire population, is typically small. If it exceeds
20%, an adjustment (the nite population correction) will result in better estimates of precision (a reduction
in the standard error) to account for the fact that a substantial fraction of the population was surveyed.
An example of a simple random sample would be a vegetation survey in a large forest stand. The stand is
divided into 480 one-hectare plots, and a random sample of 24 plots was selected and analyzed using aerial
photos.
A simple random sample will spread sampling effort over the entire population and will ensure that the
sample selected is a representative sample of the population. All other variables will occur in the sample in
roughly the same proportion as in the original population.
A disadvantage of a simple random sample is primarly logistical - it may impractical to locate one m
2
vegetation plots in a large grassland area.
A simple random sample design is often hidden in the details of many other survey designs. For
example, many surveys of vegetation are conducted using strip transects where the initial starting point of
the transect is randomly chosen, and then every plot along the transect is measured. Here the strips are the
sampling unit, and are a simple random sample from all possible strips. The individual plots are subsamples
from each strip and cannot be regarded as independent samples. For example, suppose a rectangular stand
is surveyed using aerial overights. In many cases, random starting points along one edge are selected, and
the aircraft then surveys the entire length of the stand starting at the chosen point. The strips are typically
analyzed section- by-section, but it would be incorrect to treat the smaller parts as a simple random sample
from the entire stand.
Note that a crucial element of simple random samples is that every sampling unit is chosen independently
of every other sampling unit. For example, in strip transects plots along the same transect are not chosen
independently - when a particular transect is chosen, all plots along the transect are sampled and so the
selected plots are not a simple random sample of all possible plots. Strip-transects are actually examples of
cluster-samples.
Systematic Surveys
In some cases, it is logistically inconvenient to randomly select sample units from the population. An
alternative is to take a systematic sample where every k
th
unit is selected (after a random starting point);
k is chosen to give the required sample size. For example, if a stream is 2 km long, and 20 samples are
required, then k = 100 and samples are chosen every 100 m along the stream after a random starting point.
A common alternative when the population does not naturally divide into discrete units is grid-sampling.
Here sampling points are located using a grid that is randomly located in the area. All sampling points are a
xed distance apart.
In previous example, there were 480 sampling units. So if a sample size of 24 is the objective, approxi-
mately every 480/24 = 20
th
unit needs to be selected:
Note that in this example, there appears to be some feature of the population that seems to match our
skip interval so that samples frequently occur together (see below).
If the population self-randomizes, then a systematic sample will be equivalent to a simple random sam-
ple. For example, it seem unlikely that if you a systematic survey of every 5th angling party in a creel survey,
that anglers will organize themselves so that every 5th party has a higher than average catch rate. This as-
sumption of self-randomizing cannot be checked statistically and must be evaluated based on biological
grounds.
If a known trend is present in the sample, this can be incorporated into the analysis (Cochran, 1977,
Chapter 8). For example, suppose that the systematic sample follows an elevation gradient that is known to
directly inuence the response variable. A regression-type correction can be incorporated into the analysis.
However, note that this trend must be known from external sources - it cannot be deduced from the survey.
Pitfall: A systematic sample is typically analyzed in the same fashion as a simple random sample.
However, the true precision of an estimator from a systematic sample can be either worse or better than a
simple random sample of the same size, depending if units within the systematic sample are positively or
negatively correlated among themselves. For example, if a systematic samples sampling interval happens to
match a cyclic pattern in the population, values within the systematic sample are highly positively correlated
(the sampled units may all hit the peaks of the cyclic trend), and the true sampling precision is worse than an
SRS of the same size. What is even more unfortunate is that because the units are positively correlated within
the sample, the sample variance will underestimate the true variation in the population, and if the estimated
precision is computed using the formula for a SRS, a double dose of bias in the estimated precision occurs
(Krebs, 1989, p.227). On the other hand, if the systematic sample is arranged perpendicular to a known
trend to try and incorporate additional variability in the sample, the units within a sample are now negatively
correlated, the true precision is now better than an SRS sample of the same size, but the sample variance
now overestimates the population variance, and the formula for precision from a SRS not will overstate
the sampling error. While logistically simpler, a systematic sample is only equivalent to a simple random
sample of the same size if the population units are in random order to begin with. (Krebs, 1989, p. 227).
Even worse, there is no information in the systematic sample that allows the manager to check for hidden
trends and cycles.
Nevertheless, systematic samples do offer some practical advantages over SRS if some correction can be
made to the bias in the estimated precision:
it is easier to relocate plots for long term monitoring
mapping can be carried out concurrently with the sampling effort because the ground is systematically
traversed.
it avoids the problem of poorly distributed sampling units which can occur with a SRS [but this can
also be avoided by judicious stratication.]
Solution: Because of the necessity for a strong assumption of randomness in the original population,
systematic samples are discouraged and statistical advice should be sought before starting such a scheme. If
there are no other feasible designs, a slight variation in the systematic sample provides some protection from
the above problems. Instead of taking a single systematic sample every k
th
unit, take 2 or 3 independent
systematic samples of every 2k
th
or 3k
th
unit, each with a different starting point. For example, rather than
taking a single systematic sample every 100 m along the stream, two independent systematic samples can
be taken, each selecting units every 200 m along the stream starting at two random starting points. The total
sample effort is still the same, but now some measure of the large scale spatial structure can be estimated.
This technique is known as replicated sub-sampling (Kish, 1965, p. 127).
Cluster sampling
In some cases, units in a population occur naturally in groups or clusters. For examples, some animals
individual animals are not randomly selected; the herds were the sampling unit. The strip-transect example
point are measured. The sampling units is the circular plot while trees within the plot are sub-samples.
In the same population above, the population divides itself into 60 clusters of size 8. A cluster sample
of 24 ultimate units requires selecting 3 clusters at random from the population of 60 clusters, and then
measuring all units within each of the selected clusters:
The reason cluster samples are used is that costs can be reduced compared to a simple random sample
giving the same precision. Because units within a cluster are close together, travel costs among units are re-
duced. Consequently, more clusters (and more total units) can be surveyed for the same cost as a comparable
simple random sample.
A confusing aspect of cluster sampling is the question of sample size. Rather than relying upon a
single number to represent sample size, I prefer to state that the sample size is 3x8 rather than 24 to reinforce
the idea that 3 clusters of size 8 where selected.
Cluster sample also arise in transect sampling. For example, suppose that transects are taken starting at
the left margin of the map and across the entire map. A sample of three (unequally sized) custers might look
like:
analysis is to come up with a estimate that appears to be more precise than it really is, i.e. the estimated
Solution: In order to be condent that the reported standard error really reects the uncertainty of
the estimate, it is important that the analytical methods are appropriate for the survey design. The proper
analysis treats the clusters as a random sample from the population of clusters. The methods of simple
random samples are applied to the cluster summary statistics (Thompson, 1992, Chapter 12; and Nemec,
1993).
Multi-stage sampling
In many situations, there are natural divisions of the population into several different sizes of units. For
example, a forest management unit consists of several stands, each stand has several cutblocks, and each
cutblock can be divided into plots. These natural divisions can be easily accommodated in a survey through
the use of multi-stage methods. Selection of units is done in stages. For example, several stands could be
selected from a management area; then several cutblocks are selected in each of the chosen stands; then
several plots are selected in each of the chosen cutblocks. Note that in a multi-stage design, units at any
stage are selected at random only from those larger units selected in previous stages.
Refering back again to the previous population of 480 units. A two-stage sample of ultimate units could
be selected by selecting clusters at random from the population of 60 clusters, followed by selecting 2 units
at random from each of the clusters:
Note that the second stage of sampling can be done in any number of ways, e.g. a systematic sample
could be done within each of the rst stage units selected.
The advantage of multi-stage designs are that costs can be reduced compared to a simple random sample
of the same size, primarily through improved logistics. The precision of the results is less than an equivalent
simple random sample, but because costs are less, a larger multi-stage survey can often be done for the same
costs as a smaller simple random sample. This often results in a more precise design for the same cost.
However, due to the misuse of data from complex designs, simple designs are often highly preferred and end
up being more cost efcient when costs associated with incorrect decisions are incorporated.
Pitfall: Although random selections are made at each stage, a common error is to analyze these types
of surveys as if they arose from a simple random sample. The plots were not independently selected; if a
particular cutblock was not chosen, then none of the plots within that cutblock can be chosen. As in cluster
samples, the consequences of this erroneous analysis are that the estimated standard errors are too small and
do not fully reect the actual imprecision in the estimates. A manager will be more condent in the estimate
than is justied by the survey.
Solution: Again, it is important that the analytical methods are suitable for the sampling design. The
proper analysis of multi-stage designs takes into account that random samples takes place at each stage
(Thompson, 1992, Chapter 13). In many cases, the precision of the estimates is determined essentially by
the number of rst stage units selected. Little is gained by extensive sampling at lower stages.
Multi-phase designs
In some surveys, multiple surveys of the same survey units are performed. In the rst phase, a sample of
units is selected (usually by a simple random sample). Every unit is measured on some variable. Then in
subsequent phases, samples are selected ONLY from those units selected in the rst phase, not from the
entire population.
For example, refer back to the vegetation survey. An initial sample of 24 plots is closen in a simple
random survey. Aerial ights are used to quickly measure some characteristic of the plots. A second phase
sample of 6 units (circled below) is then measured using ground based methods.
Multiphase designs are commonly used in two situations. First, it is sometimes difcult to stratify a
population in advance because the values of the stratication variables are not known. The rst phase is
used to measure the stratication variable on a random sample of units. The selected units are then stratied,
and further samples are taken from each strata as needed to measure a second variable. This avoids having
to measure the second variable on every unit when the strata differ in importance. For example, in the rst
phase, plots are selected and measured for the amount of insect damage. The plots are then stratied by the
amount of damage, and second phase allocation of units concentrates on plots with low insect damage to
measure total usable volume of wood. It would be wasteful to measure the volume of wood on plot with
much insect damage.
The second common occurrence is when it is relatively easy to measure a surrogate variable (related to
the real variable of interest) on selected units, and then in the second phase, the real variable of interest is
measured on a subset of the units. The relationship between the surrogate and desired variable in the smaller
sample is used to adjust the estimate based on the surrogate variable in the larger sample. For example,
managers need to estimate the volume of wood removed from a harvesting area. A large sample of logging
trucks is weighed (which is easy to do), and weight will serve as a surrogate variable for volume. A smaller
sample of trucks (selected from those weighed) is scaled for volume and the relationship between volume
and weight from the second phase sample is used to predict volume based on weight only for the rst phase
sample. Another example is the count plot method of estimating volume of timber in a stand. A selection of
plots is chosen and the basal area determined. Then a sub-selection of plots is rechosen in the second phase,
and volume measurements are made on the second phase plots. The relationship between volume and area
in the second phase is used to predict volume from area measurements seen the rst phase.
Summary comparison of designs
The following is a quick summary of the common sampling designs to be illustrated in this chapter based on
Table 8.1 of Elzinga et al. (2001).
Design Recommended uses Advantages Disadvantages
SRS Small areas with ho-
mogeneous sampling
units.
Easy to analyze Logistically difcult
for many surveys.
Stratied Statication variable
is apparent. Units
in strata are homoge-
neous.
More efcient than
unstrated. Meth-
ods can be tailored to
strata.
HIGHLY recom-
mended!
Systematic Population self ran-
domizes.
Easy to implement.
Same efciency as
SRS when population
self randomizes.
Watch for cycles in
population.
Cluster Individual (ultimate)
survey units not avail-
able. Units naturally
come in clusters.
Cheaper than sam-
pling individual units.
Inefcient if units in
cluster are highly cor-
related. Tradeoff
number of clusters
and cluster size. Cau-
tion in analysis!
Two-
stage
Individual (ultimate)
survey units not avail-
able. Units natu-
rally come in clus-
ters. Units highly
correlated in clusters
or cluster too large to
enumerate.
Cheaper than sam-
pling individual units.
Complex analysis -
seek help!
Two-
phase
Easy to measure aux-
iliary variable.
Highly efcient if
auxiliary variable is
related to response
Complex analysis -
seek help!
29.1.2 Permanent or temporary monitoring stations
One common objective of long-term studies is to investigate changes over time of a particular population.
This will involve repeated sampling from the population. There are three common designs.
First, separate independent surveys (temporary monitoring plots) can be conducted at each time point.
This is the simplest design to analyze because all observations are independent over time. For example,
independent surveys can be conducted at ve year interval to assess regeneration of cut blocks. However,
precision of the estimated change may be poor because of the additional variability introduced by having
new units sampled at each time point.
At the other extreme, units are selected in the rst survey and the same units are remeasured over time.
For example, permanent study plots can be established that are remeasured for regeneration over time. The
advantage of permanent study plots is that comparisons over time are free of additional variability introduced
by new units being measured at every time point. One possible problem is that survey units have become
damaged over time, and the sample size will tend to decline over time. An analysis of these types of
designs is more complex because of the need to account for the correlation over time of measurements on
the same sample plot and the need to account for possible missing values when units become damaged and
are dropped from the study.
Intermediate to the above two designs are partial replacement designs (panel designs) where a portion
of the survey units are replaced with new units at each time point. For example, 1/5 of the units could be
replaced by new units at each time point - units would normally stay in the study for a maximum of 5 time
periods. The analysis of these types of designs must account for both the repeated measurements on the
same plots and the replacement of plots over time - seek help for the analysis of such designs.
29.1.3 Renements that affect precision
Any of the following renements to a sampling design may be considered for any of the designs listed above.
Stratication
All survey methods can potentially benet from stratication (also known as blocking in the experimental
design literature). It is the more common method used to improve precision and can be often implemented
at little or no additional costs.
Stratication begins by grouping survey units into homogeneous groups before conducting the survey.
Some decision needs to be made on how to allocate the total sampling effort among the strata. Depending
upon the goals of the survey, an optimal allocation of sampling units can be one that is equal in all strata,
that is proportional to the stratum size, or that is related to the cost of sampling in each stratum (Thompson,
1992, Chapter 11). Equal allocation (where all strata have the same sample size) is preferred when equally
precise estimates are required for each stratum as well as for the overall population. Proportional allocation
(where the sample size in each stratum is proportional to the population size) are preferred when more
precise estimates are required in larger strata. If the costs of sampling vary among the strata, then an optimal
allocation that accounts for costs would try to obtain the best overall precision at lowest cost by allocating
units among the strata accounting for the costs of sampling in each stratum.
Then a separate, independent survey is conducted in each straum. Different survey methods may be used
in each stratum - hence stratication allows matching a survey protocol to features of the survey, e.g. aerial
surveys in low density strata, but ground surveys in high density strata.
Finally, at the end of the survey, the stratum results are combined and weighted appropriately. For exam-
ple, a watershed might be stratied by elevation into three strata, and separate surveys are conducted within
each elevation stratum. The separate results would be weighted proportional to the size of the elevation
strata. Stratication will be benecial whenever variability among the sampling units can be anticipated and
strata can be formed that are more homogeneous than the original population.
Consider the population examined earlier. Suppose that the population can be divided into three strata
and a different survey method can be used in each stratum:
In the rst (top most) stratum, a simple random sample was taken; in the second stratum a cluster sample
was taken; in the third stratum a cluster sample (via transects) was also taken.
Stratication can be carried out prior to the survey (pre- stratication) or after the survey (post-stratication).
Pre-stratication is used if the stratum variable is known in advance for every plot (e.g. elevation of a plot).
Post-stratication is used if the stratum variable can only be ascertained after measuring the plot, e.g. soil
quality or soil pH. The advantages of pre-stratication are that samples can be allocated to the various strata
in advance to optimize the survey and the analysis is relatively straightforward. With post-stratication,
there is no control over sample size in each of the strata, and the analysis is more complicated (the problem
is that the samples sizes in each stratum are now random). A post-stratication can result in signicant gains
in precision but does not allow for ner control of the sample sizes as found in pre-stratication.
Auxiliary variables
An association between the measured variable of interest and a second variable of interest (a surrogate) can
be exploited to obtain more precise estimates. For example, suppose that growth in a sample plot is related
to soil nitrogen content. A simple random sample of plots is selected and the height of trees in the sample
plot is measured along with the soil nitrogen content in the plot. A regression model is t (Thompson, 1992,
Chapters 7 and 8) between the two variables to account for some of the variation in tree height as a function
of soil nitrogen content. This can be used to make precise predictions of the mean height in stands if the
soil nitrogen content can be easily measured. This method will be successful if there is a direct relationship
between the two variables, and, the stronger the relationship, the better it will perform. This technique is
often called ratio-estimation or regression-estimation.
Notice that multi-phase designs often use an auxiliary variable but this second variable is only measured
on a subset of the sample units.
Sampling with unequal probability
All of the designs discussed in previous sections have assumed that each sample unit was selected with equal
probability. In some cases, it is advantageous to select units with unequal probabilities, particularly if they
differ in their contribution to the overall total. This technique can be used with any of the sampling designs
discussed earlier. An unequal probability sampling design can lead to smaller standard errors (i.e. greater
precision) for the same total effort compared to an equal probability design. For example, forest stands may
be selected with probability proportional to the area of the stand (i.e. a stand of 200 ha will be selected
with twice the probability that a stand of 100 ha in size) because large stands contribute more to the overall
population and it would be wasteful of sampling effort to spend much effort on smaller stands.
The variable used to assign the probabilities of selection to individual study units does not need to have
an exact relationship with an individual contributions to the total. For example, in probability proportional
to prediction (3P sampling), all trees in a small area are visited. A simple, cheap characteristic is measured
which is used to predict the value of the tree. A sub-sample of the trees is then selected with probability pro-
portional to the predicted value, remeasured using a more expensive measuring device, and the relationship
between the cheap and expensive measurement in the second phase is used with the simple measurement
from the rst phase to obtain a more precise estimate for the entire area. This is an example of two-phase
sampling with unequal probability of selection.
Unequal probability sampling designs are most commonly used when the sampling units are land-based
polygons, but are rarely used in other areas of ecology.
29.1.4 Sample size determination
A common problem in survey design is the choice of sample size. This is an important question because
the sample size is the primary determinant of the costs of the survey and of precision. The sample size
should be chosen so that the nal estimates have a precision that is adequate for the management question.
Paradoxically, in order to determine the proper sample size, some estimate of the population values needs
to be known before the survey is conducted! Historical data can sometimes be used. In some cases, pilot
studies will be needed to obtain preliminary estimates of the population values to plan the main survey.
[Pilot studies are also useful to test the protocol - refer to the conclusion for more advice on pilot studies].
Unfortunately, sometimes even pilot studies cannot be done because of difculty in sampling or because the
phenomena is one-time event. If there are multiple objectives it may also be difcult to reconcile the sample
size requirements for each objective.
In these and many other cases, sample sizes are determined solely by the budget for the survey, i.e.
sample size = budget/cost per sample.
29.2 Analytical surveys
In descriptive surveys, the objective was to simple obtain information about one large group. In observa-
tional studies, two deliberately selected sub-populations are selected and surveyed, but no attempt is made
to generalize the results to the whole population. In analytical studies, sub-populations are selected and
sampled in order to generalize the observed differences among the sub-population to this and other similar
populations.
As such, there are similarities between analytical and observational surveys and experimental design.
The primary difference is that in experimental studies, the manager controls the assignment of the explana-
tory variables while measuring the response variables, while in analytical and observational surveys, neither
set of variables is under the control of the manager. [Refer back to Examples B, C, and D in Section 1.] The
analysis of complex surveys for analytical purposes can be very difcult (Kish 1987; Kish, 1984; Rao, 1973;
Sedransk, 1965a, 1965b, 1966).
As in experimental studies, the rst step in analytical surveys is to identify potential explanatory variables
(similar to factors in experimental studies). At this point, analytical surveys can be usually further subdivided
into three categories depending on the type of stratication:
the population is pre-stratied by the explanatory variables and surveys are conducted in each stratum
to measure the outcome variables;
the population is surveyed in its entirety, and post-stratied by the explanatory variables.
the explanatory variables can be used as auxiliary variables in ratio or regression methods.
[It is possible that all three types of stratication take place - these are very complex surveys.]
The choice between the categories is usually made by the ease with which the population can be pre-
stratied and the strength of the relationship between the response and explanatory variables. For example,
sample plots can be easily pre-stratied by elevation or by exposure to the sun, but it would be difcult to
pre-stratify by soil pH.
Pre-stratication has the advantage that the manager has control over the number of sample points col-
lected in each stratum, whereas in post- stratication, the numbers are not controllable, and may lead to very
small sample sizes in certain strata just because they form only a small fraction of the population.
For example, a manager may wish to investigate the difference in regeneration (as measured by the
density of new growth) as a function of elevation. Several cutblocks will be surveyed. In each cutblock, the
sample plots will be pre-stratied into three elevation classes, and a simple random sample will be taken in
each elevation class. The allocation of effort in each stratum (i.e. the number of sample plots) will be equal.
The density of new growth will be measured on each selected sample plot. On the other hand, suppose that
the regeneration is a function of soil pH. This cannot be determined in advance, and so the manager must
take a simple random sample over the entire stand, measure the density of new growth and the soil pH at
each sampling unit, and then post-stratify the data based on measured pH. The number of sampling units in
each pH class is not controllable; indeed it may turn out that certain pH classes have no observations.
If explanatory variables are treated as a auxiliary variables, then there must be a strong relationship
between the response and explanatory variables and the auxiliary variable must be able to measured precisely
for each unit. Then, methods like multiple regression can also be used to investigate the relationship between
the response and the explanatory variable. For example, rather than classifying elevation into three broad
elevation classes or soil pH into broad pH classes, the actual elevation or soil pH must be measured precisely
to serve as an auxiliary variable in a regression of regeneration density vs. elevation or soil pH.
If the units have been selected using a simple random sample, then the analysis of the analytical surveys
proceeds along similar lines as the analysis of designed experiments (Kish, 1987; also refer to Chapter
2). In most analyses of analytical surveys, the observed results are postulated to have been taken from a
hypothetical super-population of which the current conditions are just one realization. In the above example,
cutblocks would be treated as a random blocking factor; elevation class as an explanatory factor; and sample
plots as samples within each block and elevation class. Hypothesis testing about the effect of elevation on
mean density of regeneration occurs as if this were a planned experiment.
Pitfall: Any one of the sampling methods described in Section 2 for descriptive surveys can be used for
analytical surveys. Many managers incorrectly use the results from a complex survey as if the data were
collected using a simple random sample. As Kish (1987) and other have shown, this can lead to substantial
underestimates of the true standard error, i.e., the precision is thought to be far greater than is justied
based on the survey results. Consequently the manager may erroneously detect differences more often than
expected (i.e., make a Type I error) and make decisions based on erroneous conclusions.
Solution: As in experimental design, it is important to match the analysis of the data with the survey
design used to collect it. The major difculty in the analysis of analytical surveys are:
1. Recognizing and incorporating the sampling method used to collect the data in the analysis. The
survey design used to obtain the sampling units must be taken into account in much the same way as
the analysis of the collected data is inuenced by actual experimental design. A table of equivalences
between terms in a sample survey and terms in experimental design is provided in Table 1.
Table 1
Equivalences between terms used in surveys and in experimental design.
Survey Term Experimental Design Term
Simple Random
Sample
Completely randomized design
Cluster Sampling (a) Clusters are random effects; units within a cluster
treated as sub-samples; or
(b) Clusters are treated as main plots; units within a
cluster treated as sub-plots in a split-plot analysis.
Multi-stage sam-
pling
(a) Nested designs with units at each stage nested in
units in higher stages. Effects of units at each stage are
treated as random effects. or
(b) Split-plot designs with factors operating at higher
stages treated as main plot factors and factors operat-
ing at lower stages treated as sub-plot factors.
Stratication Fixed factor or random block depending on the reasons
for stratication.
Sampling Unit Experimental unit or treatment unit
Sub-sample Sub-sample
There is no quick easy method for the analysis of complex surveys (Kish, 1987). The super-population
approach seems to work well if the selection probabilities of each unit are known (these are used to
weight each observation appropriately) and if random effects corresponding to the various strata or
stages are employed. The major difculty caused by complex survey designs is that the observations
are not independent of each other.
2. Unbalanced designs (e.g. unequal numbers of sample points in each combination of explanatory fac-
tors). This typically occurs if post- stratication is used to classify units by the explanatory variables
but can also occur in pre-stratication if the manager decides not to allocate equal effort in each stra-
tum. The analysis of unbalanced data is described by Milliken and Johnson (1984).
3. Missing cells, i.e., certain combinations of explanatory variables may not occur in the survey. The
analysis of such surveys is complex, but refer to Milliken and Johnson (1984).
4. If the range of the explanatory variable is naturally limited in the population, then extrapolation outside
of the observed range is not recommended.
More sophisticated techniques can also be used in analytical surveys. For example, correspondence
analysis, ordination methods, factor analysis, multidimensional scaling, and cluster analysis all search for
post-hoc associations among measured variables that may give rise to hypotheses for further investigation.
Unfortunately, most of these methods assume that units have been selected independently of each other
using a simple random sample; extensions where units have been selected via a complex sampling design
have not yet developed. Simpler designs are often highly preferred to avoid erroneous conclusions based on
inappropriate analysis of data from complex designs.
Pitfall: While the analysis of analytical surveys and designed experiments are similar, the strength of the
conclusions is not. In general, causation cannot be inferred without manipulation. An observed relationship
in an analytical survey may be the result of a common response to a third, unobserved variable. For example,
consider the two following experiments. In the rst experiment, the explanatory variable is elevation (high or
low). Ten stands are randomly selected at each elevation. The amount of growth is measured and it appears
that stands at higher elevations have less growth. In the second experiment, the explanatory variables is
the amount of fertilizer applied. Ten stands are randomly assigned to each of two doses of fertilizer. The
amount of growth is measured and it appears that stands that receive a higher dose of fertilizer have greater
growth. In the rst experiment, the manager is unable to say whether the differences in growth are a result
of differences in elevation or amount of sun exposure or soil quality as all three may be highly related. In
the second experiment, all uncontrolled factors are present in both groups and their effects will, on average,
be equal. Consequently, the assignment of cause to the fertilizer dose is justied because it is the only factor
that differs (on average) among the groups.
As noted by Eberhardt and Thomas (1991), there is a need for a rigorous application of the techniques
for survey sampling when conducting analytical surveys. Otherwise they are likely to be subject to biases
of one sort or another. Experience and judgment are very important in evaluating the prospects for bias, and
attempting to nd ways to control and account for these biases. The most common source of bias is the
selection of survey units and the most common pitfall is to select units based on convenience rather than
on a probabilistic sampling design. The potential problems that this can lead to are analogous to those that
occur when it is assumed that callers to a radio-phone- in show are representative of the entire population.
29.3 Impact Studies
Probably the most important and controversial use of surveys is to investigate the effects of large scale,
potentially unreplicated events. These are commonly referred to as impact studies where the goals of the
study are to investigate the impact of an event or process. In many cases, this must be done without having
the ability or resources to conduct a planned experiment.
Consider three examples: the impact of a hydroelectric dam on water quality of the dammed stream; the
impact of clearcuts on water quality of nearby streams; and the effect of different riparian zone widths along
streams near clearcuts. First, randomization and replication are not possible in the rst example. Only one
dam will be built on one stream. In the other two examples, it is conceivably possible to randomize and
replicate the experiment and so the principles of experimental design may be useful. Second, the impact of
the rst two examples can be compared to a control or non-treated site while in the last example, comparisons
are made between impacts: the two different riparian zone widths.
Regardless of the control over randomization and replication, the goal of impact studies is typically to
measure ecological characteristics (usually over time) to look for evidence of a difference (impact) between
the two sites. Presumably, this will be attributed to the event, but as shown later, the lack of replication and
randomization may limit the generalizability of the ndings. Then based on the ndings, remediation or
changes in future events will be planned. In all cases, the timing of the event must be known in advance so
that baseline information can be collected.
A unifying example for this section will be an investigation of the potential effects of clearcuts on water
quality of nearby streams. Several, successively more complex, impact designs will be considered.
29.3.1 Before/After contrasts at a single site
This is the simplest impact design. A single survey is taken before and after a potential disturbance. This
design is widely used in response to obvious accidental incidences of potential impact (e.g. oil spills, forest
res), where, fortuitously, some prior information is available. From this study, the manager obtains a single
measurement of water quality before and after the event. If the second survey reveals a change, this is
attributed to the event.
Pitfall: There may be no relationship between the observed event and the changes in the response
variable - the change may be entirely coincidental. Even worse, there is no information collected on the
natural variability of the water quality over time and the observed differences may simply be due to natural
uctuations over time. Decisions based on this design are extremely hard to justify. This design cannot be
used if the event cannot be planned and there is no prior data. In these cases, there is little that can be said
about the impact of the event.
29.3.2 Repeated before/after sampling at a single site.
An embellishment on the previous sampling scheme is to perform multiple surveys of the stream at multiple
time points before and after the event. In this design, information is collected on the mean water quality
before and after the impact. As well, information is collected on the natural variability over time. This
design is better than the previous design in that observed changes due solely to natural uctuations over time
can be ruled out and consequently any observed change in the mean level is presumably real.
The choice between regular intervals and random intervals depends upon the objectives of the study. If
the objective is to detect changes in trend, regularly spaced intervals are preferred because the analysis is
easier. On the other hand, it the objective is to assess differences before and after impact, then samples at
random time points are advantageous, rather than at xed schedules is that no cyclic differences unforeseen
by the sampler will inuence the size of the difference between the before and after the impact. For example,
surveys taken every summer for a number of years before and after the clearcut may show little difference
in water quality but there may be great differences in the winter that go undetected.
Pitfall: Despite repeated surveys, this design suffers from the same aw as in the previous design. The
repeated surveys are pseudo-replication in time rather than real replicates (Hurlbert, 1984). The observed
change may have occurred regardless of the clearcut as a consequence of long term trends over time. Again,
decisions based on this design are difcult to justify.
29.3.3 BACI: Before/After and Control/Impact Surveys
As Green (1979) pointed out, an optimal impact survey has several features:
the type of impact, time of impact, and place of occurrence should be known in advance;
the impact should not have occurred yet;
control areas should be available.
The rst feature allows the surveys to be efciently planned to account for the probable change in the
environment. The second feature allows a baseline study to be established and to be extended as needed. The
last feature allows the surveyor to distinguish between temporal effects unrelated to the impact and changes
related to the impact.
The simplest BACI design will have two times of sampling (before and after impact) in areas (treatment
and a control) with measurements on biological and environmental variables in all combinations of time and
area. In this example, two streams would be sampled. One stream would be adjacent to the clearcut (the
treatment stream); the second stream would be adjacent to a control site that is not clearcut and should have
similar characteristics to the treatment stream and be exposed to similar climate and weather. Both streams
are sampled at the same time points before the clearcut occurs and at the same time point after the clearcut
takes place. Technically, this is known as an area-by-time factorial design, and evidence of an impact is
found by comparing the before and after samples for the control site with the before and after samples for
the treatment sites. This contrast is technically known as the area- by-time interaction as shown in Figure 3
below:
Figure 3. Simplied outcomes in a BACI design.
This design allows for both natural stream-to-stream variation and coincidental time effects. If there is
no effect of the clearcut, then change in water quality between the two time points should be the same, i.e.
parallel lines in Figures 3a and 3b. On the other hand, if there is an impact, the time trends will not be
parallel (Figures 3c-3e).
Pitfalls: Hurlbert (1984) , Stewart-Oaten, Murdock, and Parker (1986), and Underwood (1991) discuss
the simple BACI design and point out concerns with its application.
First, because impact to the sites were not randomly assigned, it is possible that any observed difference
between control and impact sites are related solely to some other factor that differs between the two sites.
One could argue that it is unfair to ascribe the effect to the impact. However, as pointed out by Stewart-Oaten
et al. (1986), the survey is concerned about a particular impact in a particular place, not in the average of the
impact when replicated in many different locations. Consequently, it may be possible to detect a difference
between these two specic sites, but without randomization of replicate treatments at many different sites, it
is not possible to generalize the ndings from this study to other events on different streams.
This concern can be reduced by monitoring several control sites (Underwood, 1991). Then assuming
that the variation in the (After-Before) measurements of the multiple control sites is the same as the variation
among potentially impacted sites, and assuming that the variability over time between the control sites is not
correlated, it is possible to estimate if the difference observed in the impacted site is plausible in light of
the observed variability in the changes in the control sites. In our example, several control streams could be
monitored at the same time points as the single impact stream. Then if the observed difference in the impact
stream is much different than could be expected based on the multiple control streams, the event is said to
have caused an impact. When several control sites are monitored, there is less of a concern about the lack of
randomization because the replicated control sites provide some information about potential effects of other
factors.
The second and more serious concern with the simple Before-After design with a single sampling point
before and after the impact, is that it fails to recognize that there may be natural uctuations in the char-
acteristic of interest that are unrelated to any impact (Hurlbert, 1984; Stewart-Oaten, 1986). For example,
consider Figure 4 below:
Figure 4: Problems with the simple BACI design
The change in a measured variable from two sampling occasions (dots at Before and After the impact) in
the control (solid line) or impacted (dashed line) sites. In (a), there is little natural variation in the response
over time and so the measured values indicate a change in the mean level. In (b) and (c), natural variation
is present, but because only one point was sample before and after impact, it is impossible to distinguish
between no impact (b) and impact (c) on the mean level.
If there were no natural uctuations over time, the single samples before and after the impact would be
sufcient to detect the effects of the impact. However, if the population also has natural uctuations over and
above the long term average, then it is impossible to distinguish between cases where there is no effect from
those where there was impact. In terms of our example, differences in the water quality may be artifacts of
the sampling dates and natural uctuations may obscure differences or lead one to believe differences are
present when they are not.
29.3.4 BACI-P: Before/After and Control/Impact - Paired designs
Stewart-Oaten et al (1986) extended the simple BACI design by pairing surveys at several selected time
points before and after the impact. Both sites are measured at the same time points. An analysis of how the
difference between the control and impact sites changes over time would reveal if an impact has occurred as
shown in Figure 5 below:
Figure 5: The BACI-P design
The change in a measured variable from multiple randomly chosen sampling occasions (dots at Before
and After the impact) in the control (solid line) or impacted (dashed line) sites. In (a), there is no impact and
the mean level of the difference (bottom most line) is constant over time. In (b), there is an impact, and the
mean level of the difference (bottom most line) changes over time.
The rationale behind the design is that repeated sampling before the development gives an indication of
the pattern of differences over several periods of potential change between the two sites. This study design
provides information both on the mean difference in the water quality before and after impact, and on the
natural variability of the water quality measurements. If the changes in the mean difference are large relative
to natural variability, the manager has detected an effect.
The decision between random and regularly spaced intervals has been discussed in an earlier section -
the same considerations apply here.
Pitfall: As with all studies, numerous assumptions need to be made during the analysis (Stewart-Oaten,
Bence, and Osenbery, 1992; Smith, Orvor and Cairns, 1992). The primary assumption is that the responses
over time are independent of each other. A lack of independence over time tends to produce false-positives
(Type I errors) where the manager may declare that an impact has occurred when in fact, none has. In these
cases formal time series methods may be necessary (Rasmussen et al., 1993). [The analysis of time series is
easiest with regularly spaced sampling points.]
Pitfall: It is also assumed that the difference in mean level between control and impact sites is constant
over time in the absence of an impact effect and that the effect of the impact is to change the arithmetic
difference. In our example, it would be assumed that the difference in the mean water quality between
the two sites is constant over time. The mean water quality measurements may uctuate over time, but
both sites are assumed to uctuate in lock-step with each other maintaining the same average arithmetic
difference. One common way this assumption is violated is that if the response variable at the control site is
a constant multiple of the response variable at the impact site. Then arithmetic differences will depend upon
the actual levels. For example, suppose that the readings of water quality at two sites at the rst time point
were 200 vs. 100 which has an arithmetic difference of 100; at the second time point, the readings were 20
vs. 10 which has an arithmetic difference of 10; but both pairs are in a 2:1 ratio at both time points. The
remedy is simple: a logarithmic transform of the raw data converts a multiplicative difference into a constant
arithmetic difference on the logarithmic scale. This is a common problem when water quality measurements
are concentrations, e.g. pH.
Underwood (1991) also considered two variations on the BACI-P design. First, it may not be possible
to sample both sites simultaneously for technical or logistical reasons. Underwood (1991) discussed a mod-
ication where sampling is done at different times in each site before and after impact (i.e., sampling times
are no longer paired), but notes that this modication cannot detect changes in the two sites that occurred
before the impact. For example, differences in water quality may show a gradual change over time in the
paired design prior to impact. Without paired sampling, it would be difcult to detect this change. Second,
sampling only a single control site still has problems identied earlier of not knowing if observed differences
in the impact and the control sites are site specic. Again, Underwood (1991) suggests that multiple control
sites should be monitored. In our example, more than one control site would be measured at each time
point. The variability in the difference between each control site and the impact site provides information
on generalizability to other sites.
29.3.5 Enhanced BACI-P: Designs to detect acute vs. chronic effects or to detect
changes in variation as well as changes in the mean.
As Underwood (1991) pointed out, the previous designs are suitable for detecting long term (chronic) ef-
fects in the mean level of some variable. In some cases, the impact may have an acute effect (i.e., effects
only last for a short while) or may change the variability in response (e.g. seasonal changes become more
pronounced). Underwoods solution is to modify the sampling schedule so that it occurs on two temporal
scales of Figure 6 below:
Figure 6: The enhanced BACI-P Design
The change in a measured variable from multiple randomly chosen sampling occasions in two periods
(dots at Before and After the impact) in the control (solid line) or impacted (dashed line) sites. The two
temporal scales (sampling periods vs. sampling occasions) allows the detection of a change in mean and in
a change in variability after impact.
For example, groups of surveys could be conducted every 6 months with three surveys 1 week apart
randomly located within each group. The analysis of such a design is presented in Underwood (1991).
Again, several control sites should be used to confound the argument about detected differences being site
specic.
This design is also useful when there are different objectives. For example, the objective for one variable
may be to detect a change in trend. The pairing of sample points on the long time scale leads to efcient
detection of trend changes. The objectives for another variable may be to detect differences in the mean
level. The short time scale surveys randomly located in time and space are efcient for detecting differences
in the mean level.
29.3.6 Designs for multiple impacts spread over time
The designs that have been looked at in previous sections have a common short-coming they cannot
be used to assess transient responses to management actions or environmental distrurbances because such
responses manifest themselves as time-treatment interactions. Furthermore, there are many situations
where environmental impacts occur over a series of years, e.g. several dams being built; clear cuttings that
are staggered over time.
For example, consider a study to examine the effects of salmon enhancement through hatcheries. A long
study is conducted where tagged wild and tagged hatchery raised sh are released. From the returns of the
tags, it appears that sh from hatcheries have substantially lower survival rates than wild sh. The obvious
hypothesis is that the survival rates of hatchery sh is lower than that of wild sh. However, the declines are
correlated with a warming trend in ocean temperature and the correlation is used to argue that the hatchery
stocks are more sensitive to warming trends, but as soon as the ocean conditions return to normal, the survival
rates will be comparable.
Such time-treatment interaction effects cannot be measured with simple experimental designs that in-
volve simultaneous initiation of treatment on treated replicates (along with no treatment of conrols). Con-
sequently, people with a vested interest in a treatment that does not seem to work can argue that this is a
temporary effect. The only direct, empirical way to convincingly counter such arguments is to demonstrate
that teatments initiated later in time result in the same transient patterns as for earlier treatments.
In
Walters, C. J., Collie, J. S., and Webb, T. (1988).
Experimental designs for estimating transient responses to management disturbances.
Canadian Journal of Fisheries and Aquatic Sciences 45, 530-538.
http://dx.doi.org/10.1139/f88-062
a new set of experimental designs, called staircase designs are proposed. [This paper is located in the
supplmental readings portion of the course notes.] The paper is fairly heavy going with a high level of
statistics used, but the ideas are relatively simple and easily extracted.
For example, here is the simplest design that involves three experimental units, two of which are assigned
to treatment and one to control.
For this design to work, several assumptions need to be made:
treatment effects are irreversible, strongly persistent, or might only be exibited only after a an unknown
and perhaps long, time delay after treatment is initiated;
there is a direct management interest in estimating the full transient pattern of response.
interaction effects are independent of time since treatment. This means that if the effect of the inter-
vention varies by year (an interaction), only the calendar year is important, and not the fact that it is 2
or 3 years since the treatment started.
Under these conditions, Figure 1 (above) is the simplest possible design. In this gure, note that the all
units have at least 1 year measured before treatments, and that treatments start in the treated units at one year
intervals.
More complex designs are possible. For example, with more than three treated units, there must again be
at least one year of measurement before treatment begins; there must be at least one control unit; but now it
is not necessary that a strict staircase design be followed, but some replicates must start treatment on even
year and some replicates must start treatment on odd years. An illustration of this is found in Figure 2
(below):
Finally, it is possible to relax the assumption that interaction effects are independent of time since treat-
ment. This would require a full staircase design similar to the Figure 1 above, but a new treatment must
be initiated for every year in the study. This can lead to a formidible experiment. However, if the transient
effects are only temporary, then a modied design can be used, where units are rolled in and out of the study
as time progresses (a type of panel design).
Walters et al. (1988) then go on to discuss how to best design these types of studies. While the exact
computations are complex, several general principles were derived:
it is best to spread treatment starting times over as many times as possible;
if replicated treatments are possible, it seems best to start the replicates as early as possible. This also
guards against pseudo-replication.
there is little gain in extensive monitoring of treated units before treatment beyond the initial one year
premonitoring phase;
it is only necessary to measure the units until the transient phase is nished. Walters et al recommend
a sequential approach where data is collected on year at a time until you are satised that the transient
effects are gone.
blocking or stratication (grouping of experimental units into homogeneous sets) is extremely advan-
tageous.
As Walters et al (1988) state:
Experimental designs with a staircase of treatment applications may require formidible invest-
ments by resource scientists and managers, with substantial delays before average response pat-
terns can be clearly distinguished from time-treatment interactions. But the alternative to such
investments is to continue managing in a twilight zone where any policy (treatment) failure can
be blamed on unlucky interactions with environmental factors and where successes simply re-
ect good luch with such factors. In this zone, the cumulative cost of continuing bad practices
and missing opportunities for improved ones will evenually exceed the cost of doing a proper
experiment in the rst place.
29.3.7 Accidental Impacts
The BACI and related designs all assume that the environmental impact study will start before the impact
occurs. What can be done with Accidental Impacts on biological resources where no baseline data is avail-
able??
This section is based on:
Wiens, J.A., Parker, K. R. (1995).
Analyzing the Effects of Accidental Environmental Impacts: Approaches and Assumptions.
Ecological Applications 5, 1069-1083.
http://dx.doi.org/10.2307/2269355.
In most accidental impacts, there is no before data and data can only be collected from the point of
impact onwards. For example, an oil-tanker may run aground and spill oil on the coast; a gasoline stations
tanks may spring a leak; or a re may release toxic fumes.
As in the BACI suite of design, accidental impacts are not experiments in the true sense of the word. Ob-
viously, the impact was not randomly assigned to an experimental unit, and the denition of the experimental
unit is not well dened.
The context for this section is the break-up an oil tanker off the West Coast of Vancouver Island. The
specic effects of this event upon shell sh is to be examined.
The following have been proposed as possible statistical designs to assess accidental impacts. Simple
graphs have been drawn below illustrating results of the impact assessment in the case of negative impact on
shell sh densities and the case of no impact on shell sh densities for each of the proposed designs above.
Assume (if necessary) that the impact slowly dies off over time.
Spatial Impact-Reference design. Sampling is performed immediately after impact at two sites - the
impacted and non-impacted sites.
Spatial Regression design. Sampling is performed at a number of sites over the range of exposure
(e.g. by the amount of oil washed ashore). A regression of abundance against the exposure is drawn.
Spatial Matched pair design. Sampling is done on randomly selected impact sites and control sites
that are matched on relevant natural factors, e.g. type of substrate where the shell sh aggregate.
Temporal Baseline design. Sometimes, fortuitous surveys have been done at the same site before the
impact occurred. Sampling takes place at the same site after the impact.
Temporal Time-series design. The impacted site is surveyed repeated over a long period of time (e.g.
bi monthly for 2 years) and the results plotted.
Temporal-Spatial Pre-post design. Similar to the classical BACI design except that pre/post samples
are taken at sites that vary in the degree of exposure to the impact.
Temporal-Spatial Level-by-time design. The impact site is measured over time from the time of
impact. A control site is also measured over time from the time of impact at the same sampling
occasions.
Temporal-Spatial Impact trend-by-time design. The Regression design is performed at the im-
pacted site just after impact. The same design is performed at the impacted site a year or longer after
impact. Both plots of response vs. dose are plotted on the same graph.
The following are some questions about the above designs:
1. Some scientists have suggested that to assess accidental impacts it is sufcient to answer the following
question:
Would the impacted area be what it is, had there not been an impact?
Briey discuss this objective with respect to assessing accidental impacts and the above designs.
Solution:
On the rst glance, this seems like a reasonable objective. However, natural variation is always taking
place, so that every point on earth is different fromevery other point on earth. So, if you are examining,
for example, shell sh density, this varies over every square meter of the site. It will be impossible to
every tell exactly what the impacted site was before impact.
Perhaps a better denition would be :
What are the range of responses for non-impacted sites that are are as similar to the im-
pacted site as possible, and has the impact moved the site outside this range?
2. What are the two key assumptions for the Spatial designs? [Hint: one of these deals with the duration
of the sampling].
Solution:
The two key assumptions are:
Equal natural factors at impacted and non-impacted areas. However, because contamination was
not randomized, there is no guarantee of equal natural factors.
Sampling interval is short relative to temporal variation. This guarantees that the measurements
show the effect of the impact and not just difference that would have occurred naturally over
time.
3. What are the two key assumptions for the Temporal designs? [Hint: one of these deals with changes
in personnel and methods over time.]
Solution:
Natural factors are in steady-state equilibrium, i.e. the population levels remain the same over
time in the absence of an impact.
Differences in sampling personnel and sampling methods over time are inconsequential so that
the observed differences are related to the impact and not to differences in methods.
4. What are the two key assumptions for the Spatial-Temporal designs?
Solution:
Natural factors and the biological resource are in dynamic equilibrium among area. The level
of a resource changes similarly for different areas responding similarly to changing climatic
conditions and populations.
Consistent sampling methods over time.
5. Classify the designs by their ability to assess an initial impact and their ability to assess the recovery
process. [A simple yes/not table would sufce similar to the table below.]
Solution:
Design Assessing initial
impact
Assessing recov-
ery
Spatial Impact-reference x
Spatial Regression x
Spatial Matched pairs x
Temporal Baseline x
Temporal Time series x x
ST Pre-post pairs x
ST Level-by-time x x
ST Trend-by-time x x
6. Rank the designs on a scale of defensibility for determining if an impact occurred. Use a four point
scale ranging from 1 (least) to 4 (most). [Several designs can all be ranked equally]
Solution:
Design Ranking
Spatial Impact-reference 2
Spatial Regression 2
Spatial Matched pairs 2
Temporal Baseline 1
Temporal Time series 2
ST Pre-post pairs 3
ST Level-by-time 4
ST Trend-by-time 4
These designs can also be compared on other dimensions using a tabular format.
Design Requires con-
sistent sampling
methods over
time
Covariates are
feasible and
useful
Exposure levels
are adequately
sampled
Spatial Regression x x
Temporal Baseline x
Temporal Time series x x
ST Pre-post pairs x
ST Level-by-time x
ST Trend-by-time x x
Design Requires steady
state equilib-
rium
Requires dy-
namic equilib-
rium
Requires natural
factors to be
equal
Spatial Regression x
Temporal Baseline x
Temporal Time series x
ST Pre-post pairs x
ST Level-by-time x
ST Trend-by-time x
A steady state equilibrium implies that the population levels remain the same over time, in the absence
of an impact.
A dynamic equilibrium implies that the level of a resource changes similarly for different area re-
sponding similarly to changing climatic conditions and populations.
A equal natural factors implies that the resource and natural processes that could change the resource
are the same at both sites.
29.4 Conclusion
Green (1979) gave 10 principles applicable to any sampling design. These have been paraphrased, reordered,
and extended below. Underwood (1994) also gives some advice on areas of common misunderstanding
between environmental biologists and statisticians.
1. Formulate a clear, concise hypothesis.
It cannot be over-emphasized that the success or failure of a sampling program often hinges on clear,
explicit hypotheses. Woolly thinking at this stage frequently leads to massive amounts of data col-
lected without enough thought in advance as to how, to what end, and at what cost the information can
be subsequently handled. Hypotheses should be stated in terms of direct, measurable variables (i.e.,
action X will cause a decrease in Y). The hypotheses to be tested has implications for what data is to
be collected and how it should be collected. Refer to Chapter *** for more details.
2. Ensure controls will be present.
Most surveys are concerned with changes over time, typically before and after some impact. Effects
of an impact cannot be demonstrated without the presence of controls serving as a baseline so that
changes over time, unrelated to the impact, can be observed. Without controls, no empirical are
available to refute the argument that observed changes might have occurred regardless of impact.
3. Stratify in time and space to reduce heterogeneity.
If the area to be sampled is large and heterogeneous (highly variable), then sampling from the entire
area, ignoring the known heterogeneity, reduces precision of the estimate. Extra variation may be
introduced to the measured variable solely by differences within the study area unrelated to the treat-
ment. By stratifying the study area in advance (also known as blocking in the experimental design
literature), this extra variability can be accounted for. The judicious choice of auxiliary variables can
also be used to increase precision of the estimates.
4. Take replicate samples within each combination of time, space, or any other controlled variable.
Differences among treatments can only be demonstrated by comparing the observed difference among
treatments with differences within each treatment. Lack of replication often restricts the interpretation
of many experiments and surveys to the sampled units rather than to the entire population of interest.
It is imperative that the replicates be true replicates and not pseudo-replicates (Hurlbert, 1984) where
the same experimental unit is often measured many times.
5. Determine the size of a biologically meaningful, substantive difference that is of interest.
A sufciently large study (i.e. with large sample sizes) can detect minute differences that may not be
of biological interest. It is important to quantify the size of a difference that is biologically meaningful
before a study begins so that resources are not wasted either by performing a study with an excessive
sample size or by performing a study that has lower power to detecting this important difference.
6. Estimate the required sample sizes to obtain adequate power to detect substantive differences
or to ensure sufcient precision of the estimates.
In this era of scal restraint, it is unwise to spend signicant sums of money on surveys or experiments
that have only a slight chance of detecting the effect of interest or give estimates that are so imprecise
as to be useless. Such designs are a waste of time and money.
If the goal of the study is to detect a difference among populations, the required sample sizes will
depend upon the magnitude of the suspected difference, and the amount of natural variation present.
Estimates of these qualities can often be obtained from prior experience, literature reviews of similar
studies, or from pilot surveys. Simulation studies can play an important role in assessing the efciency
of a design.
If the goal is descriptive, then the required sample sizes will depend only upon the natural variation
present. As above, these can be obtained from prior experience, literature reviews, or a pilot study.
As noted earlier, it may be infeasible to conduct a pilot study, historical data may not exist, or it may be
difcult to reconcile sample sizes required for different objectives. Some compromise will be needed
(Cochran, 1977, p.81-82).
One common misconception is that sample size is linked to the size of the population. To the contrary,
the sample sizes required to estimate a parameter in a small population with a specied precision are
the same as in a large population. This non-intuitive result has a direct analogue in testing a pot of
soup for salt; the cook tastes only a spoonful regardless of pot size. Another analogy is that you take
the same size sample to test for a chemical in the Baltic Sea as you would take in the Pacic Ocean,
even though the Pacic Ocean is much larger than the Baltic Sea.
7. Allocate replicate samples using probabilistic methods in time and space. There is a tendency
to allocate samples into representative or typical locations. Even worse are convenience samples
where the data are collected at sampling points that are easily accessible or close-at-hand. The key to
ensuring representativeness is randomization. Randomization ensures that the effects of all other un-
controllable variables are equal, on average, in the various treatment groups or that they appear in the
sample, on average, in the same proportions as in the population. Unless the manager is omniscient,
it is difcult to ensure that representative or typical sites are not affected by other, unforeseen,
uncontrollable factors.
Notice that a large sample size does not imply representativeness. Randomization controls represen-
tativeness; sample size controls statistical power.
8. Pretest the sampling design and sampling methods.
There is always insufcient time, and it is psychologically difcult to spend effort on a pilot study
knowing that the data collected may not contribute to the nal study results and may be thrown away.
However, this is the only way to check for serious problems in the study; to check if the size of the
survey unit is appropriate; to check if the data collection forms are adequate; to check the actual
variability present in the eld, etc.
After a pilot study has been conducted, its results can be used to modify the proposed design and
ne-tune such aspects such as the required sample size. In many cases, a pilot study shows that the
objectives of the proposed study are unobtainable for the projected cost and effort and the study must
be substantially modied or abandoned.
9. Maintain quality assurance throughout the study.
Despite best efforts, things will deviate from the plan during the course of the study, particularly
if the survey extends over many years and personnel change. Many of the principles of statistical
process control can be applied here (Montgomery, 1996). For example, ensure that instruments are
recalibrated at regular intervals; ensure that sampling protocols are being consistently followed among
different team members; and ensure that data is being keyed correctly.
10. Check the assumptions of any statistical analysis.
Any statistical procedure makes explicit and implicit assumptions about the data collected. Match
the analysis with the survey design. In many cases, a statistically signicant result can be obtained
erroneously if assumptions necessary for the analysis were violated.
11. Use the Inter-Ocular Trauma Test
Presentation of nal results is just as important as design, execution, and analysis. A study will be of
limited usefulness if it sits on a shelf because other readers are unable to interpret the ndings. Good
graphical methods (gures, plots, charts, etc.) or presentations will pass the Inter- Ocular Trauma
Test, i.e., the results will hit you between the eyes!
Finally: Despite their limitations, the study of uncontrolled events can play a useful role in adaptive man-
agement. The study of uncontrolled events and designed experiments differ in two important dimensions:
1. the amount of control. As the name implies, the study of uncontrolled events does not give the
manager the ability to manipulate the explanatory variables.
2. the degree of extrapolation to other settings. The lack of randomization implies that the manager
must be careful in extrapolating to new situations because of the possible presence of latent, lurking
factors.
These differences imply that inferences are not as strong as can be made after carefully controlled ex-
periments, but the results often lead to new hypotheses to be tested in future research. Despite the weaker
inferences from studying uncontrolled events, the same attention must be paid to the proper design of a
survey so that the conclusions are not tainted by inadvertent biases.
29.5 References
Cochran, W.G. (1977). Sampling Techniques. New York:Wiley.
One of the standard references for survey sampling. Very technical
Eberhardt, L.L. and Thomas, J.M. (1991). Designing environmental eld studies. Ecological Mono-
graphs, 61, 53-73.
An overview of the eight different eld situations as shown in Figure 1.
Green, R.H. (1979). Sampling design and statistical methods for environmental biologists. New York:
Wiley.
One of the rst comprehensive unied treatments of sampling issues for environmental biologists.
Very readable.
Hurlbert, S.H. (1984). Pseudo-replication and the design of ecological eld experiments. Ecological
Monographs, 52, 187-211.
A critique of many common problems encountered in ecological eld experiments.
Kish, L. (1965). Survey Sampling. New York: Wiley.
An extensive discussion of descriptive surveys mostly from a social science perspective.
Kish, L. (1984). On Analytical Statistics from complex samples. Survey Methodology, 10, 1-7.
An overview of the problems in using complex surveys in analytical surveys.
Kish, L. (1987). Statistical designs for research. New York: Wiley.
One of the more extensive discussions of the use of complex surveys in analytical surveys. Very
technical.
Krebs, C.J. (1989) Ecological Methodology. New York: Harper and Row.
A methods books for common techniques used in ecology.
Milliken, G.A. and Johnson, D.E. (1984). The Analysis of Messy Data: Volume 1, Designed experi-
ments. New York: Van Nostrand Reinhold.
A complete treatise on the analysis of unbalanced data in designed experiment. Requires a background
in the use of ANOVA methodology.
Montgomery, D.C. (1996). Introduction to Statistical Quality Control. New York: Wiley.
A standard introduction to the principles of process control.
Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystem management. New York: Wi-
ley.
Good primer on how to measure common ecological data using direct survey methods, aerial photog-
raphy, etc. Includes a discussion of common survey designs for vegetation, hydrology, soils, geology,
and human inuences.
Nemec, A.F.L. (1993). Standard error formulae for cluster sampling (unequal cluster sizes). Biometric
Information Pamphlet No. 43. Victoria, B.C.: B.C. Ministry of Forestry Research Branch.
Rao, J.N.K. (1973). On double sampling for stratication and analytical surveys. Biometrika, 60,
125-133.
Rasmussen, P.W., Heisey, D.M., Nordheim, E.V. and Frost, T.M. (1993). Time series intervention
analysis: unreplicated large-scale experiments. In: Scheiner. S.M. and Gurevitch, J. (Editors). Design
and analysis of ecological experiments. New York: Chapman and Hall.
Sedransk, J. (1965a). A double sampling scheme for analytical surveys. Journal of the American
Statistical Association, 60, 985-1004.
Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journal of the Royal Statistical Soci-
ety, Series B, 27, 264-278.
Sedransk, J. (1966). An application of sequential sampling to analytical surveys. Biometrika, 53,
85-97.
Stewart-Oaten, A., Murdoch, W.M. and Parker, K. (1986). Environmental impact assessment: pseudo-
replication in time? Ecology, 67, 929-940.
One of the rst extensions of the BACI design discussed in Green (1979).
Thompson, S.K. (1992). Sampling. New York:Wiley.
A good companion to Cochran (1977). Has many examples of using sampling for biological popu-
lations. Also has chapters on mark-recapture, line-transect methods, spatial methods, and adaptive
sampling.
Underwood, A.J. (1991). Beyond BACI: Experimental designs for detecting human environmental
impacts on temporal variations in natural populations. Australian Journal of Marine and Freshwater
Research, 42, 569-87. doi:10.1071/MF9910569
Adiscussion of an current BACI designs, and an enhanced BACI design to detect changes in variability
as well as in the mean response.
Underwood, A.J. (1994). Things environmental scientists (and statisticians) need to know to receive
(and give) better statistical advice. In: Statistics in Ecology and Environmental Monitoring, Fletcher,
D.J. and Manly, B.F. (editors). Dunedin, New Zealand; University of Otago Press.
29.6 Selected journal articles
Here are a series of questions about each publication that should be understood as part of the course.
29.6.1 Designing Environmental Field Studies
Eberhardt, L. L., and Thomas, J.M. (1991)
Designing Environmental Field Studies
Ecological Monographs, 61, 53-73.
urlhttp://dx.doi.org/10.2307/1942999.
Some of the material in this paper will be covered later in the course, e.g. xed and random effects,
pseudo-replication, and the details on ANOVA tables. It is not crucial at this point to fully understand these
sections of the paper.
Compare and contrast the 8 different types of studies shown in Figure 1. Why type of information do
they provide? What are the limitation of each type of study?
What are the tradeoffs involved in Type I and Type II errors?
The material on Sampling for Modelling may be quite different from what many have seen before,
but the concept of contrast in designing experiments is crucial. What does contrast mean and what
implications does it have in experimentation.
What are the key differences between analytical and descriptive surveys.
29.6.2 Beyond BACI
Underwood, A.J. (1991).
Beyond BACI: Experimenal design for detecting human environmental impacts on temporal variations in
natural populations.
Australian Journal of Marine and Freshwater Research, 42, 569-587.
http://dx.doi.org/10.1071/MF9910569
The material on the construction of the F-tests will be covered later in the course. It is not crucial at this
point to fully understand these sections of the paper.
Distinguish between the various designs in the paper (including the obliquely mentioned design with
replicated control sites). What are the limitations of each design? What are the advantages of each
design?
Distinguish between impacts that cause changes in the mean response and other types of effects.
Which designs are best for detecting these types of changes?
29.6.3 Environmental impact assessment
Stewart-Oaten, A., Murdoch, W.W., and Parker, K.R. (1986).
Environmental impact assesment: Pseudo-replication in time?
Ecology, 67, 929-940.
http://dx.doi.org/10.2307/1939815
What was Greens proposed design and how does the authors proposed solution differ. What limita-
tions does the new design overcome from Greens proposed design?
Under what conditions will the proposed design fail?
29.7 Examples of studies for discussion - good exam questions!
The following are examples of studies and questions that have arisen in various context. Some were student
projects, some were questions from email discussion groups, etc.
29.7.1 Effect of burn upon salamanders
A student wrote on an email discussion forum:
Im a grad student at ** studying the effects of prescribed re on terrestrial salamanders. I have
a 900 sq. meter mark-recapture grid with 2 species in it. Im planning to compare before and
after the burn, and have just nished taking the rst seasons before data. I used 4 visits, either
1 or 2 weeks apart.
Do you have any suggestions as to which model would be best for comparing before/after treat-
ments?
Some questions to ponder:
What will be the response variable?
What are some of the problems with this design?
How could this design be improved?

PDFbigbook SAS

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

PDFbigbook SAS

Caricato da

Copyright:

Formati disponibili

Sampling, Regression, Experimental Design and Analysis for

Environmental Scientists, Biologists, and Resource Managers

- an equation with the x-mu above the Greek letter sigma

n, i.e. it takes 4 the sample size to reduce the standard error

C is not twice as hot!

C difference has some physical meaning. Note that 0

C is arbitrary, so that it does not

n. Suppose that the present rse is .07. A rse

3.7.6 Sample Size for Stratied Designs

where is the standard deviation of units around each population mean.

n. Every design leads to a different formula

n. Every design leads to a different formula

C is about 1/2 of that at 16

C? What is the effect

MSE = RMSE value from the ANOVA

C levels would include 4 mice at the restricted diet and 4 mice at

C diet would include 9 mice at the

C for the Nav1.2+Beta(CW)

C for the Nav1.2+Beta unit because the estimated

C was different among

. Notice that because of

C is 20 seconds. Because the differences

BACI = 29.35 34.25 20.2 + 26.25 = 1.15

BACI = 29.30 34.25 20.2 + 26.25 = 1.10

average counts under the assumption a

Y indicates that we are referring to

c. This can be done using Proc Genmod in SAS:

MSE which is an estimate of the

where is the standard deviation of units around each population mean.

where is the standard deviation of units around each population mean.

Potrebbero piacerti anche