Quantitative Methods
OX F O R D L I B R A RY O F P S Y C H O L O G Y
Editor-in-Chief
Peter E. Nathan
A r e a E d i t o r s:
Clinical Psychology
David H. Barlow
Cognitive Neuroscience
Kevin N. Ochsner and Stephen M. Kosslyn
Cognitive Psychology
Daniel Reisberg
Counseling Psychology
Elizabeth M. Altmaier and Jo-Ida C. Hansen
Developmental Psychology
Philip David Zelazo
Health Psychology
Howard S. Friedman
History of Psychology
David B. Baker
Neuropsychology
Kenneth M. Adams
Organizational Phychology
Steve W. J. Kozlowski
Vo lu m e ı : Fo u n d at i o n s
1
3
Oxford University Press, Inc., publishes works that further
Oxford University’s objective of excellence in
research, scholarship, and education.
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
9 8 7 6 5 4 3 2 1
Printed in the United States of America
on acid-free paper
SHORT CONTENTS
Contributors xi
Chapters 1–479
Index 480
v
This page intentionally left blank
OX F O R D L I B R A RY O F P SYC H O LO GY
vii
nation’s and world’s most productive and best-respected psychologists have agreed
to edit Library handbooks or write authoritative chapters in their areas of expertise.
For whom has the Oxford Library of Psychology been written? Because of its breadth,
depth, and accessibility, the Library serves a diverse audience, including graduate
students in psychology and their faculty mentors, scholars, researchers, and practi-
tioners in psychology and related fields. Each will find in the Library the information
they seek on the subfield or focal area of psychology in which they work or are
interested.
Befitting its commitment to accessibility, each handbook includes a comprehen-
sive index, as well as extensive references to help guide research. And because the
Library was designed from its inception as an online as well as a print resource,
its structure and contents will be readily and rationally searchable online. Further,
once the Library is released online, the handbooks will be regularly and thoroughly
updated.
In summary, the Oxford Library of Psychology will grow organically to provide
a thoroughly informed perspective on the field of psychology, one that reflects
both psychology’s dynamism and its increasing interdisciplinarity. Once published
electronically, the Library is also destined to become a uniquely valuable interactive
tool, with extended search and browsing capabilities. As you begin to consult this
handbook, we sincerely hope you will share our enthusiasm for the more than 500-
year tradition of Oxford University Press for excellence, innovation, and quality, as
exemplified by the Oxford Library of Psychology.
Peter E. Nathan
Editor-in-Chief
Oxford Library of Psychology
Todd D. Little
Todd D. Little, Ph.D., is a Professor of Psychology, Director of the Quantitative
training program, Director of the undergraduate Social and Behavioral Sciences
Methodology minor, and a member of the Developmental training program. Since
2010, Todd has been Director of the Center for Research Methods and Data Analysis
(CRMDA) at Kansas University. Little is internationally recognized for his quanti-
tative work on various aspects of applied SEM (e.g., indicator selection, parceling,
modeling developmental processes) as well as his substantive developmental research
(e.g., action-control processes and motivation, coping, and self-regulation). In
2001, Little was elected to membership in the Society for Multivariate Experimental
Psychology. In 2009, he was elected President of APA’s Division 5 (Evaluation, Mea-
surement, and Statistics) and in 2010 was elected Fellow of the division. In 2012,
he was elected Fellow in the Association for Psychological Science. He founded,
organizes, and teaches in the internationally renowned KU “Stats Camps” each June
(see crmda.KU.edu for details of the summer training programs). Little has edited
five books related to methodology including The Oxford Handbook of Quantitative
Methods and the Guilford Handbook of Developmental Research Methods (with Brett
Laursen and Noel Card). Little has been principal investigator or co-principal inves-
tigator on more than 15 grants and contracts, statistical consultant on more than 60
grants and he has guided the development of more than 10 different measurement
tools.
ix
This page intentionally left blank
CONTRIBUTORS
xi
Pascal R. Deboeck Brian D. Haig
Department of Psychology Department of Psychology
University of Kansas University of Canterbury
Lawrence, KS Canterbury, New Zealand
Sarah Depaoli Kelly Hallberg
Department of Educational Psychology Institute for Policy Research
University of Wisconsin Madison Northwestern University
Madison, WI Evanston, IL
Cody S. Ding Lisa L. Harlow
College of Education Department of Psychology
University of Missouri-Saint Louis University of Rhode Island
Saint Louis, MO Kingston, RI
M. Brent Donnellan Emily J. Hart
Department of Psychology Department of Psychology
Michigan State University University at Buffalo
East Lansing, MI The State University of New York
Dawnté R. Early Buffalo, NY
Department of Human and Community Kit-Tai Hau
Development The Chinese University of Hong Kong
University of California at Davis Hong Kong
Davis, CA Joop J. Hox
Craig K. Enders Department of Methodology and Statistics
Department of Psychology Utrecht University
Arizona State University Utrecht, The Netherlands
Tempe, AZ James Jaccard
David M. Erceg-Hurn Department of Psychology
School of Psychology Florida International University
University of Western Australia Boca Raton, FL
Crawley, WA, Australia Paul E. Johnson
Aurelio José Figueredo Department of Political Science
Department of Psychology Kansas University
School of Mind, Brain, & Behavior Lawrence, KS
Division of Family Studies and Human Kelly M. Kadlec
Development Department of Psychology
College of Agriculture and Life Sciences University of Southern California
University of Arizona Los Angeles, CA
Tucson, AZ David Kaplan
Rafael Antonio Garcia Department of Educational
Department of Psychology Psychology
School of Mind, Brain, & Behavior University of Wisconsin-Madison
University of Arizona Madison, WI
Tucson, AZ Ken Kelley
Amanda C. Gottschall Department of Management
Department of Psychology University of Notre Dame
Arizona State University Notre Dame, IN
Tempe, AZ Harvey J. Keselman
Michael J. Greenacre Department of Psychology
Department of Economics and Business University of Manitoba
Universitat Pompeu Fabra, Barcelona Winnipeg, Canada
Barcelona, Spain
xii contributors
Neal M. Kingston Professor Emeritus
School of Education Macquarie University
University of Kansas † April 16, 1928 – October, 29, 2011
Lawrence, KS Sarah E. Medland
Yasemin Kisbu-Sakarya Genetic Epidemiology Laboratory
Department of Psychology Queensland Institute of Medical Research
Arizona State University School of Psychology
Tempe, AZ University of Queensland
Laura B. Kramer Brisbane, Australia
School of Education Peter C. M. Molenaar
University of Kansas Department of Human Development
Lawrence, KS and Family Studies
Todd D. Little Pennsylvania State University
Center for Research Methods and Data University Park, PA
Analysis Alexandre J.S. Morin
Department of Psychology Department of Psychology
University of Kansas University of Sherbrooke
Lawrence, KS Sherbrooke, Quebec, Canada
Richard E. Lucas Miriam A. Mosing
Department of Psychology Genetic Epidemiology Laboratory
Michigan State University Queensland Institute of Medical Research
East Lansing, MI School of Psychology
David P. MacKinnon University of Queensland
Department of Psychology Brisbane, Australia
Arizona State University Keith E. Muller
Tempe, AZ Department of Health Outcomes
Patrick Mair and Policy
Institute for Statistics and Mathematics University of Florida
Vienna University of Economics Gainesville, FL
and Business Eun-Young Mun
Vienna, Austria Center of Alcohol Studies
Herbert W. Marsh Rutgers University
Department of Education Piscataway, NJ
University of Oxford Alan T. Murray
Oxford, UK GeoDa Center for Geospatial Analysis
Katherine E. Masyn and Computation
Graduate School of Education School of Geographical Sciences
Harvard University and Urban Planning
Cambridge, MA Arizona State University
John J. McArdle Tempe, AZ
Department of Psychology Jay I. Myung
University of Southern California Department of Psychology
Los Angeles, CA The Ohio State University
Roderick P. McDonald† Columbus, OH
Sydney University Benjamin Nagengast
Sydney, Australia Department of Education
Professor Emeritus Oxford University
University of Illinois Oxford, UK
at Urbana-Champaign
contributors xiii
Sally Gayle Olderbak André A. Rupp
Department of Psychology Department of Measurement, Statistics,
School of Mind, Brain, & and Evaluation (EDMS)
Behavior University of Maryland
University of Arizona College Park, MD
Tucson, AZ Gabriel Lee Schlomer
Jamie M. Ostrov Division of Family Studies and
Department of Psychology Human Development
University at Buffalo College of Agriculture and Life Sciences
The State University of New York University of Arizona
Buffalo, NY Tucson, AZ
Trond Peterson Christof Schuster
Department of Sociology Department of Psychology
University of California-Berkeley Justus-Liebig-Unversitat Giessen
Berkeley, CA Giessen, Germany
Mark A. Pitt James P. Selig
Department of Psychology Department of Psychology
The Ohio State University University of New Mexico
Columbus, OH Albuquerque, NM
Larry R. Price Paul E. Spector
College of Education and College of Science Department of Psychology
Texas State University-San Marcos University of South Florida
San Marcos, TX Tampa, FL
Nilam Ram Peter M. Steiner
Department of Human Development and Department of Educational Psychology
Family Studies University of Wisconsin-Madison
Pennsylvania State University Madison, WI
University Park, PA Carolin Strobl
Max Plank Institute for Human Ludwig-Maximilians-Universität
Development München Faculty of Mathematics,
Berlin, Germany Informatics and Statistics Institute
Sergio J. Rey of Statistics
GeoDa Center for Geospatial Analysis Munich, Germany
and Computation Bruce Thompson
School of Geographical Sciences and Baylor College of Medicine
Urban Planning Austin, TX
Arizona State University Terry T. Tomazic
Tempe, AZ Department of Sociology and
Joseph L. Rodgers Criminal Justice
Department of Psychology Saint Louis University
University of Oklahoma Saint Louis, MO
Norman, OK James T. Townsend
Robert Rosenthal Department of Psychological and
Department of Psychology Brain Sciences
University of California, Riverside Indiana University
Riverside, CA Bloomington, IN
Ralph L. Rosnow Trisha Van Zandt
Department of Psychology Department of Psychology
Temple University The Ohio State University
Philadelphia, PA Columbus, OH
xiv contributors
Alexander von Eye Keith F. Widaman
Departments of Psychology Department of Psychology
Michigan State University University of California at Davis
East Lansing, MI Davis, CA
University of Vienna Rand R. Wilcox
Vienna, Austria Department of Psychology
Stefan von Weber University of Southern California
Department of Mechanical Engineering Los Angeles, CA
Hochschule Furtwangen University Lisa M. Willoughby
Furtwangen im Schwarzwald, Germany Department of Psychology
Karin J.H. Verweij Saint Louis University
Genetic Epidemiology Laboratory Saint Louis, MO
Queensland Institute of Medical Research Coady Wing
School of Psychology Institute for Policy Research
University of Queensland Northwestern University
Brisbane, Australia Evanston, IL
Theodore A. Walls Pedro Sofio Abril Wolf
Department of Behavioral Science Department of Psychology
University of Rhode Island University of Cape Town
Kingston, RI Cape Town, South Africa
Lihshing Leigh Wang Vivian Wong
Educational Studies Program School of Education and Social Policy
University of Cincinnati Northwestern University
Cincinnati, OH Evanston, IL
Amber S. Watts Carol M. Woods
Center for Research Methods and Data Center for Research Methods and
Analysis and Data Analysis and
Lifespan Institute, Gerontology Center Department of Psychology
University of Kansas University of Kansas
Lawrence, KS Lawrence, KS
William W.S. Wei Wei Wu
Department of Statistics Center for Research Methods and
Temple University Data Analysis and
Philadelphia, PA Department of Psychology
Zhonglin Wen University of Kansas
South China Normal University Lawrence, KS
Stephen G. West Ke-Hai Yuan
Department of Psychology Department of Psychology
Arizona State University University of Notre Dame
Tempe, AZ Notre Dame, IN
contributors xv
This page intentionally left blank
CONTENTS
1. Introduction 1
Todd D. Little
2. The Philosophy of Quantitative Methods 7
Brian D. Haig
3. Quantitative Methods and Ethics 32
Ralph L. Rosnow and Robert Rosenthal
4. Special Populations 55
Keith F. Widaman, Dawnté R. Early, and Rand D. Conger
5. Theory Construction, Model Building, and Model Selection 82
James Jaccard
6. Teaching Quantitative Psychology 105
Lisa L. Harlow
7. Modern Test Theory 118
Roderick P. McDonald
8. The IRT Tradition and its Applications 144
R.J. de Ayala
9. Survey Design and Measure Development 170
Paul E. Spector
10. High-Stakes Test Construction and Test Use 189
Neal M. Kingston and Laura B. Kramer
11. Effect Size and Sample Size Planning 206
Ken Kelley
12. Experimental Design for Causal Inference: Clinical Trials and
Regression Discontinuity Designs 223
Kelly Hallberg, Coady Wing, Vivian Wong, and Thomas D. Cook
13. Matching and Propensity Scores 237
Peter M. Steiner and David Cook
14. Designs for and Analyses of Response Time Experiments 260
Trisha Van Zandt and James T. Townsend
15. Observational Methods 286
Jamie M. Ostrov and Emily J. Hart
xvii
16. A Primer of Epidemiologic Methods, Concepts, and Analysis with
Examples and More Advanced Applications within Psychology 305
David E. Bard, Joseph L. Rodgers, and Keith E. Muller
17. Program Evaluation: Principles, Procedures, and Practices 332
Aurelio José Figueredo, Sally Gayle Olderbak, Gabriel Lee Schlomer,
Rafael Antonio Garcia, and Pedro Sofio Abril Wolf
18. Overview of Statistical Estimation Methods 361
Ke-Hai Yuan and Christof Schuster
19. Robust Statistical Estimation 388
David M. Erceg-Hurn, Rand R. Wilcox, and Harvey J. Keselman
20. Bayesian Statistical Methods 407
David Kaplan and Sarah Depaoli
21. Mathematical Modeling 438
Daniel R. Cavagnaro, Jay I. Myung, and Mark A. Pitt
22. Monte Carlo Analysis in Academic Research 454
Paul E. Johnson
Index 480
xviii contents
CHAPTER
Introduction
1
Todd D. Little
Abstract
In this introductory chapter to The Oxford Handbook of Quantitative Methods, I provide an overview of
the two volumes. More specifically, I describe the rationale and motivation for the selected topics that
are presented in volumes. I also list out my instructions to the chapter authors and then describe how
the chapters fit together into thematic groupings. I also extend my sincerest gratitude to the persons
who assisted me along the way, as no work this comprehensive can be done without the considerable
help and assistance of many persons. I conclude with how pleased I am with the quality and
comprehensiveness of the chapters that are included.
Key Words: Overview; Quantitative Methods; Methodology; Statistics
1
authors were very responsive to my requests. The support his or her chapter contribution. At the top of
guidelines I asked authors to consider were: that page is a link that encourages you to offer com-
ments and suggestions on the topics and coverage
• Handbook chapters should be comprehensive of the handbook. These comments will be reviewed
and authoritative; readers will rely heavily on these and integrated into future editions of this handbook.
chapters, particularly when they move to the online I encourage you, therefore, to take advantage of this
format. opportunity to help shape the directions and content
• Handbook chapters should present not only coverage of this handbook.
the strengths of the topic covered but also any Statistical software has blossomed with the advent
limitations. of hardware that provides the necessary speed and
• Handbook chapters should make all assump- memory and programming languages coupled with
tions underlying the topic explicit. numerical algorithms that are more efficient and
• Regarding citations, handbook chapters optimized than yesteryear. These software advances
should cover the historical origins as well as the have allowed many of the advances in modern statis-
recent renditions of a given key topic. tics to become accessible to the typical end-user.
• Handbook chapters should not present one- Modern missing data algorithms and Bayesian esti-
sided views on any debate; rather, they should report mation procedures, for example, have been the
the issues and present the arguments—both pro and beneficiaries of these advances. Of course, some of
con. Authors can direct readers to other platforms the software developments have included simplified
where a position piece is presented. interfaces with slick graphic user interfaces. The crit-
• To facilitate the online linkages, handbook ical options are usually prefilled with default settings.
chapters should point to other online resources These latter two aspects of advancing software are
unfortunate because they lead to mindless applica-
related to the topic presented.
tions of the statistical techniques. I would prefer
• Every element of every formula presented must
that options not be set as default but, rather, have
be explicitly explained; assume no knowledge of how
the software prompt the user to make a choice (and
to read formulae.
give good help for what each choice means). I would
• Examples, examples, examples, and, when
prefer that a complete script of the GUI choices and
in doubt, provide an example! Concrete examples the order in which steps were taken be automatically
are absolutely critical to communicate quantitative saved and displayed.
content. I have organized the handbook by starting with
• Avoid jargon and acronyms. Please spell out some basics. It begins with the philosophical under-
acronyms, and if you use jargon, please remind the pinnings associated with science and quantitative
reader of the meaning or definition of the jargon methods (Haig, Chapter 2, Volume 1) followed
every three to four times it is used; similarly, if you by a discussion of how to construct theories and
use an acronym, then remind the reader of what it models so that they can be tested empirically
means every three to four times it is used. and the best model selected (Jaccard, Chapter 5,
• Use active voice, and do not shy away from the Volume 1). I then turn to an enlightened discus-
use of I/me or we/us. Channel how you lecture on sion of ethics in the conduct of quantitative research
the topic. It will create a crisp and enjoyable read. (Rosnow & Rosenbloom, Chapter 3, Volume 1)
• Do not start a sentence with “This” followed and related issues when quantitative methods are
by a verb. The referent to “this” must be restated applied in special populations (Widaman, Early, &
because of the ambiguity this creates. This general Conger, Chapter 4, Volume 1). Harlow (Chapter 6,
guideline should be followed as a rule! Volume 1) follows with an encompassing and
impassioned discussion of teaching quantitative
Authors, like editors, have preferences and habits, methods.
so you will find places, chapters, and so on where The theme in the next grouping of chapters
some of my admonitions were not followed. But the centers on measurement issues. First, the late
quality of the product that each chapter provides is McDonald (Chapter 17, Volume 1) provides a thor-
nonetheless uncompromised. We have established a ough overview of Modern Test Theory.1 De Ayala
Wiki-based resource page for the handbook, which (Chapter 8, Volume 1) adds a detailed discussion
can be found at crmda.KU.edu/oxford. Each author of Item Response Theory as an essential mea-
has been asked to maintain and upload materials to surement and analysis tool. After these principles
2 introduction
of measurement are discussed, the principles and Volume 2) then extend the General Linear Model
practices surrounding survey design and measure to discuss the Generalized Linear Model. This dis-
development are presented (Spector, Chapter 9, cussion is easily followed by Woods (Chapter 4,
Volume 1). Kingston and Kramer (Chapter 10, Volume 2), who synthesizes the various techniques
Volume 1) further this discussion in the context of of analyzing categorical data. After the chapter on
high-stakes testing. configural frequency analysis by Von Eye, Mun,
A next grouping of chapters covers various design Mair and von Weber (Chapter 5, Volume 5), I then
issues. Kelley (Chapter 11, Volume 1) begins segway into nonparametric techniques (Buskirk,
this section by covering issues of power, effect Tomazic, & Willoughby, Chapter 6, Volume 2) and
size, and sample size planning. Hallberg, Wing, the more specialized techniques of correspondence
Wong, and Cook (Chapter 12, Volume 1) then analysis (Greenacre, Chapter 7, Volume 2) and spa-
address key experimental designs for causal infer- tial analysis (Anselin, Murry, & Rey, Chapter 8,
ence: the gold standard randomized clinical trials Volume 2). This section is capped with chapters
(RCT) design and the underutilized regression dis- dedicated to special areas of research—namely,
continuity design. Some key quasi-experimental techniques and issues related to the analysis of
procedures for comparing groups are discussed imaging data (e.g., fMRI; Price, Chapter 9, Vol-
in Steiner and Cooks’ (Chapter 13, Volume 1) ume 2). The closely aligned worlds of behavior
chapter on using matching and propensity scores. genetics (i.e., twin studies; Blokland, Mosing,
Finally, Van Zandt and Townsend (Chapter 14, Verweij, & Medland, Chapter 10, Volume 2)
Volume 1) provide a detailed discussion of the and genes (Medland, Chapter 11, Volume 2)
designs for and analyses of response time experi- follows.
ments. I put observational methods (Ostrov & Hart, The foundations of multivariate techniques are
Chapter 15, Volume 1), epidemiological methods grouped beginning with Ding’s (Chapter 12, Vol-
(Bard, Rodgers, & Mueller, Chapter 16, Volume 1), ume 2) presentation of multidimensional scaling
and program evaluation (Figueredo, Olderbak, & and Brown’s (Chapter 13, Volume 2) summary
Schlomer, Chapter 17, Volume 1) in with these of the foundations of latent variable measurement
chapters because they address more collection and models. Hox layers in the multilevel issues as han-
design issues, although the discussion of program dled in both the manifest regression framework
evaluation also addresses the unique analysis and and the latent variable work of structural equation
presentation issues. modeling. McArdle and Kadlec (Chapter 15, Vol-
I have a stellar group of chapters related to esti- ume 2) detail, in broad terms, different structural
mation issues. Yuan and Schuster (Chapter 18, equation models and their utility. MacKinnon,
Volume 1) provide an overview of statistical estima- Kisbu-Sakarya, and Gottschall (Chapter 16, Vol-
tion method; Erceg-Hurn, Wilcox, and Keselman ume 2) address the many new developments in
(Chapter 19, Volume 1) provide a nice comple- mediation analysis, while Marsh, Hau, Wen, and
ment with a focus on robust estimation tech- Nagengast (Chapter 17, Volume 2) do the same for
niques. Bayesian statistical estimation methods are analyses of moderation.
thoroughly reviewed in the Kaplan and Depaoli The next group of chapters focuses on repeated
(Chapter 20, Volume 1) contribution. The details measures and longitudinal designs. It begins with
of mathematical modeling are synthesized in this a chapter I co-wrote with Wu and Selig and pro-
section by Cavagnaro, Myung, and Pitt (Chapter 21, vides a general overview of longitudinal models
Volume 1). This section is completed by John- (Wu, Selig, & Little, Chapter 18, Volume 2).
son (Chapter 22, Volume 1), who discusses the Deboeck (Chapter 19, Volume 2) takes things fur-
many issues and nuances involved in conducting ther into the burgeoning world of dynamical systems
Monte Carlo simulations to address the what- and continuous-time models for longitudinal data.
would-happen-if questions that we often need to Relatedly, Walls (Chapter 20, Volume 2) provides
answer. an overview of designs for doing intensive longitu-
The foundational techniques for the statistical dinal collection and analysis designs. The wonderful
analysis of quantitative data start with a detailed world of dynamic-factor models (a multivariate
overview of the traditional methods that have model for single-subject data) is presented by Ram,
marked social and behavioral sciences (i.e., the Brose, and Molenaar (Chapter 21, Volume 2). Wei
General Linear Model; Thompson, Chapter 2, (Chapter 22, Volume 2) covers all the issues of tradi-
Volume 2). Coxe, West, and Aiken (Chapter 3, tional time-series models and Peterson (Chapter 23,
little 3
Volume 2) rounds out this section with a thorough Author note
coverage of event history models. Partial support for this project was provided by
The volume finishes with two small sections. grant NSF 1053160 (Todd D. Little & Wei Wu,
The first focuses on techniques dedicated to find- co-PIs) and by the Center for Research Methods
ing heterogeneous subgroups in one’s data. Rupp and Data Analysis at the University of Kansas (Todd
(Chapter 24, Volume 2) covers tradition clustering D. Little, director). Correspondence concerning this
and classification procedures. Masyn and Nylund- work should be addressed to Todd D. Little, Center
Gibson (Chapter 25, Volume 2) cover the model- for Research Methods and Data Analysis, Univer-
based approaches encompassed under the umbrella sity of Kansas, 1425 Jayhawk Blvd. Watson Library,
of mixture modeling. Beauchaine (Chapter 26, Vol- 470. Lawrence, KS 66045. E-mail: yhat@ku.edu.
ume 2) completes this first group with his coverage Web: crmda.ku.edu.
of the nuances of taxometrics. The second of the
final group of chapters covers issues related to sec- Note
ondary analyses of extant data. I put the chapter on 1. This chapter was completed shortly before Rod’s unex-
missing data in here because it generally is applied pected passing. His legacy and commitment to quantitative
after data collection occurs, but it is also a little out methods was uncompromising and we will miss his voice of wis-
dom and his piercing intellect; R.I .P., Rod McDonald and, as
of order here because of the terrific and powerful fea-
you once said, pervixi… .
tures of planned missing data designs. In this regard,
Baraldi and Enders (Chapter 27, Volume 2) could
have gone into the design section. Donnellan and
References
Anselin, L., Murry, A. T., & Rey, S. J. (2012). Spatial analysis.
Lucas (Chapter 28, Volume 2) cover the issues asso- In T. D. Little (Ed.), The Oxford Handbook of Quantitative
ciated with analyzing the large-scale archival data Methods (Vol. 2, pp. 154–174). New York: Oxford University
sets that are available via federal funding agencies Press.
such as NCES, NIH, NSF, and the like. Data mining Baraldi, A. N. & Enders, C. K. (2012). Missing data methods.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative
can also be classified as a set of secondary modeling Methods (Vol. 2, pp. 635–664). New York: Oxford University
procedures, and Strobl’s (Chapter 29, Volume 2) Press.
chapter covers the techniques and issues in this Bard, D. E., Rodgers, J. L., & Muller, K. E. (2012). A primer of
emerging field of methodology. Card and Casper epidemiology methods with applications in psychology. In T.
(Chapter 30, Volume 2) covers the still advancing D. Little (Ed.), The Oxford Handbook of Quantitative Methods
(Vol. 1, pp. 305–349). New York: Oxford University Press.
world of meta-analysis and current best practices Beauchaine, T. P. (2012). Taxometrics. In T. D. Little (Ed.),
in quantitative synthesis of published studies. The The Oxford Handbook of Quantitative Methods (Vol. 2,
final chapter of The Oxford Handbook of Quantita- pp. 612–634). New York: Oxford University Press.
tive Methods is one I co-authored with Wang, Watts, Brown, T. A. (2012). Latent variable measurement models. In T.
and Anderson (Wang, Watts, Anderson, & Little, D. Little (Ed.), The Oxford Handbook of Quantitative Methods
(Vol. 2, pp. 257–280). New York: Oxford University Press.
Chapter 31, Volume 2). In this capstone chapter, Buskirk, T. D., Tomazic, T. T., & Willoughbby, L. (2012).
we address the many pervasive fallacies that still Nonparametric statistical techniques. In T. D. Little (Ed.),
permeate the world of quantitative methodology. The Oxford Handbook of Quantitative Methods (Vol. 2,
A venture such as this does involve the gener- pp. 106–141). New York: Oxford University Press.
ous and essential contributions of expert reviewers. Card, N. A. & Casper, D. M. (2012). Meta-analysis and quan-
titative research synthesis. In T. D. Little (Ed.), The Oxford
Many of the chapter authors also served as review- Handbook of Quantitative Methods (Vol. 2, pp. 701–717).
ers for other chapters, and I won’t mention them by New York: Oxford University Press.
name here. I do want to express gratitude to a num- Cavagnaro, D. R., Myung, J. I., & Pitt, M. A. (2012). Mathe-
ber of ad hoc reviewers who assisted me along the matical modeling. In T. D. Little (Ed.), The Oxford Handbook
way (in arbitrary order): Steve Lee, Kris Preacher, of Quantitative Methods (Vol. 1, pp. 438–453). New York:
Oxford University Press.
Mijke Rhemtulla, Chantelle Dowsett, Jason Lee, Coxe, S., West, S. G., & Aiken, L. S. (2012). Generalized lin-
Michael Edwards, David Johnson (I apologize now ear models. In T. D. Little (Ed.), The Oxford Handbook of
if I have forgotten that you reviewed a chapter for Quantitative Methods (Vol. 2, pp. 26–51). New York: Oxford
me!). I also owe a debt of gratitude to Chad Zim- University Press.
merman at OUP, who was relentless in guiding us De Ayala, R. J. (2012). The IRT tradition and its applications.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative
through the incremental steps needed to herd us all Methods (Vol. 1, pp. 144–169). New York: Oxford University
to a final and pride-worthy end product and to Anne Press.
Dellinger who was instrumental in bringing closure Deboeck, P. R. (2012). Dynamical systems and models of con-
to this mammoth project. tinuous time. In T. D. Little (Ed.), The Oxford Handbook
4 introduction
of Quantitative Methods (Vol. 2, pp. 411–431). New York: Marsh, H. W., Hau, K-T., Wen, Z., & Nagengast, B. (2012).
Oxford University Press. Moderation. In T. D. Little (Ed.), The Oxford Handbook
Ding, C. S. (2012). Multidimensional scaling. In T. D. Little of Quantitative Methods (Vol. 2, pp. 361–386). New York:
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, Oxford University Press.
pp. 7–25). New York: Oxford University Press. Masyn, K. E. & Nylund-Gibson, K. (2012). Mixture modeling.
Donnellan, M. B. & Lucas, R. E. (2012). Secondary data analysis. In T. D. Little (Ed.), The Oxford Handbook of Quantitative
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, pp. 551–611). New York: Oxford University
Methods (Vol. 2, pp. 665–677). New York: Oxford University Press.
Press. McArdle, J. J. & Kadlec, K. M. (2012). Structural equation mod-
Erceg-Hurn, D. M., Wilcox, R. R., & Keselman, H. H. (2012). els. In T. D. Little (Ed.), The Oxford Handbook of Quantitative
Robust statistical estimation. In T. D. Little (Ed.), The Oxford Methods (Vol. 2, pp. 295–337). New York: Oxford University
Handbook of Quantitative Methods (Vol. 1, pp. 388–406). Press.
New York: Oxford University Press. McDonald, R. P. (2012). Modern test theory. In T. D. Little
Figueredo, A. J., Olderbak, S. G., & Schlomer, G. L. (2012) (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1,
Program evaluation: Principles, procedures, and practices. pp. 118–143). New York: Oxford University Press.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Medland, S. E. (2012). Quantitative analysis of genes. In T.
Methods (Vol. 1, pp. 332–360). New York: Oxford University D. Little (Ed.), The Oxford Handbook of Quantitative Meth-
Press. ods (Vol. 2, pp. 219–234). New York: Oxford University
Greenacre, M. J. (2012). Correspondence analysis. In T. D. Little Press.
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, Ostrov, J. M. & Hart, E. J. (2012). Observational methods. In T.
pp. 142–153). New York: Oxford University Press. D. Little (Ed.), The Oxford Handbook of Quantitative Methods
Haig, B. D. (2012). The philosophy of quantitative methods. (Vol. 1, pp. 286–304). New York: Oxford University Press.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Peterson, T. (2012) Analyzing event history data. In T. D. Little
Methods (Vol. 1, pp. 7–31). New York: Oxford University (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
Press. pp. 486–516). New York: Oxford University Press.
Hallberg, K., Wing, C., Wong, V., & Cook, T. D. (2012). Exper- Price, L. R. (2012). Analysis of imaging data. In T. D. Lit-
imental design for causal inference: Clinical trials and regres- tle (Ed.), The Oxford Handbook of Quantitative Methods
sion discontinuity designs. In T. D. Little (Ed.), The Oxford (Vol. 2, pp. 175–197). New York: Oxford University
Handbook of Quantitative Methods (Vol. 1, pp. 223–236). Press.
New York: Oxford University Press. Ram, N., Brose, A., & Molenaar, P. C. M. (2012). Dynamic fac-
Harlow, L. (2012). Teaching quantitative psychology. In tor analysis: Modeling person-specific process. In T. D. Little
T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
ods (Vol. 1, pp. 105–117). New York: Oxford University pp. 441–457). New York: Oxford University Press.
Press. Rosnow, R. L. & Rosenthal, R. (2012). Quantitative methods
Hox, J. J., (2012). Multilevel regression and multilevel struc- and ethics. In T. D. Little (Ed.), The Oxford Handbook of
tural equation modeling In T. D. Little (Ed.), The Oxford Quantitative Methods (Vol. 1, pp. 32–54). New York: Oxford
Handbook of Quantitative Methods (Vol. 2, pp. 281–294). University Press.
New York: Oxford University Press. Rupp, A. A. (2012). Clustering and classification. In T. D. Little
Jaccard, J. (2012). Theory construction, model building, and (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
model selection. In T. D. Little (Ed.), The Oxford Handbook pp. 517–611). New York: Oxford University Press.
of Quantitative Methods (Vol. 1, pp. 82–104). New York: Spector, P. E. (2012). Survey design and measure development.
Oxford University Press. In T. D. Little (Ed.), The Oxford Handbook of Quantitative
Johnson, P. E. (2012). Monte Carlo analysis in academic research. Methods (Vol. 1, pp. 170–188). New York: Oxford University
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Press.
Methods (Vol. 1, pp. 454–479). New York: Oxford University Steiner, P. M. & Cook, D. (2012). Matching and Propensity
Press. Scores. In T. D. Little (Ed.), The Oxford Handbook of Quan-
Kaplan, D. & Depaoli, S. (2012). Bayesian statistical methods. titative Methods (Vol. 1, pp. 237–259). New York: Oxford
In T. D. Little (Ed.), The Oxford Handbook of Quantitative University Press.
Methods (Vol. 1, pp. 407–437). New York: Oxford University Strobl, C. (2012). Data mining. In T. D. Little (Ed.), The Oxford
Press. Handbook of Quantitative Methods (Vol. 2, pp. 678–700).
Kelley, K. (2012). Effect size and sample size planning. In New York: Oxford University Press.
T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- Thompson, B. (2012). Overview of traditional/classical statisti-
ods (Vol. 1, pp. 206–222). New York: Oxford University cal approaches. In T. D. Little (Ed.), The Oxford Handbook of
Press. Quantitative Methods (Vol. 2, pp. 7–25). New York: Oxford
Kingston, N. M. & Kramer, L. B. (2012). High stakes test con- University Press.
struction and test use. In T. D. Little (Ed.), The Oxford Van Zandt, T., & Townsend, J. T. (2012). Designs for and
Handbook of Quantitative Methods (Vol. 1, pp. 189–205). analyses of response time experiments. In T. D. Little
New York: Oxford University Press. (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1,
MacKinnon, D. P., Kisbu-Sakarya, Y., & Gottschall, A. C. pp. 260–285). New York: Oxford University Press.
(2012). Developments in mediation analysis. In T. D. von Eye, A., Mun, E. U., Mair, P., & von Weber, S. Configural
Little (Ed.), The Oxford Handbook of Quantitative Meth- frequency analysis. In T. D. Little (Ed.), The Oxford Hand-
ods (Vol. 2, pp. 338–360). New York: Oxford University book of Quantitative Methods (Vol. 2, pp. 73–105). New York:
Press. Oxford University Press.
little 5
Walls, T. A. (2012). Intensive longitudinal data. In T. D. Little Quantitative Methods (Vol. 1, pp. 55–81). New York: Oxford
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, University Press.
pp. 432–440). New York: Oxford University Press. Woods, C. M. (2012). Categorical methods. In T. D. Little
Wang, L. L., Watts, A. S., Anderson, R. A., & Little, T. D. (2012). (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
Common fallacies in quantitative research methodology. In pp. 52–73). New York: Oxford University Press.
T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- Wu, W., Selig, J. P., & Little, T. D. (2012). Longitudinal data
ods (Vol. 2, pp. 718–758). New York: Oxford University analysis. In T. D. Little (Ed.), The Oxford Handbook of Quan-
Press. titative Methods (Vol. 2, pp. 387–410). New York: Oxford
Wei. W. W. S. (2012). Time series analysis. In T. D. Little (Ed.), University Press.
The Oxford Handbook of Quantitative Methods (Vol. 2, pp. Yuan, K-H., & Schuster, C. (2012). Overview of statistical esti-
458–485). New York: Oxford University Press. mation methods. In T. D. Little (Ed.), The Oxford Handbook
Widaman, K. F., Early, D. R., & Conger, R. D. (2012). Special of Quantitative Methods. (Vol. 1, pp. 361–387). New York:
populations. In T. D. Little (Ed.), The Oxford Handbook of Oxford University Press.
6 introduction
CHAPTER
Abstract
7
1992; Manicas & Secord, 1983). Nineteenth of which we are part and (2) that both the observ-
century British empiricism had a major influence able and unobservable features of that world can be
on the development of British statistics in the first known by the proper use of scientific methods. Some
half of the twentieth century (e.g., Mulaik, 1985). versions of scientific realism incorporate additional
The statistical methods developed in that intellectual theses (e.g., the claims that truth is the primary aim
milieu remain an important part of psychology’s sta- of science and that successive theories more closely
tistical research practice. For example, Karl Pearson’s approximate the truth), and some will also nominate
product moment correlation coefficient was taken optional doctrines that may, but need not, be used
by its founder to be the quantitative expression of by scientific realists (e.g., the claim that causal rela-
a causal relation viewed in empiricist terms. Simi- tions are relations of natural necessity; see Hooker,
larly, Fisher’s endorsement of inductive methods as 1987). Others who opt for an “industrial strength”
the proper view of scientific method stemmed from version of scientific realism for the physical sciences
a commitment to the empiricism of his day. Even are more cautious about its successful reach in the
in the current postpositivist philosophical climate, behavioral sciences. Trout (1998), for example, sub-
authors of research methods textbooks sometimes scribes to a modest realism in psychology, based on
portray quantitative research as essentially positivist his skepticism about the discipline’s ability to pro-
in its empiricist commitments (Yu, 2006). Among duce deeply informative theories like those of the
other things, positivism restricts its attention to physical sciences.
what can be observed and regards theories as instru- Given that this chapter is concerned with the
ments that organize claims about observables but philosophical foundations of quantitative methods,
that do not explain them by appeal to hidden the remaining characterization of scientific realism
causes. will limit its attention to research methodology.
Qualitative methodologists also often bolster
their preferred conception of qualitative research by Scientific Realist Methodology
comparing it with an unflattering positivist picture Scientific realism boasts a rich conception of
of quantitative research. They tend to adopt the phi- methodology, which is of considerable help in
losophy of social constructionism, which is opposed understanding and guiding research. The resource-
to the traditional notions of truth, objectivity, and fulness of realist methodology is suggested in the
reason, maintaining that our understanding of the following description of its major characteristics (see
world is determined by social negotiation. In one or Hooker, 1987; Nickles, 1987). First, realist method-
another of its various forms, it is the philosophy of ology has three major tasks: to describe how methods
choice for many qualitative researchers, and it tends function; to evaluate methods critically against their
to be employed by those who are opposed, or indif- rivals; and to recommend how to use particular
ferent, to quantitative methods. I shall not consider methods to pursue chosen research goals.
it further in this chapter. Second, realist methodology is critically aim-
In what follows, I will adopt a scientific realist oriented. At a broad level, it recommends the pursuit
perspective on research methods. Although the sub- of valuable truth, explanatory understanding, and
ject of considerable debate, and opposed by many effective control as primary research goals; and it is
antirealist positions, scientific realism is the domi- concerned with the mutual adjustment of methods
nant philosophy of science today. It is also the tacit and research goals.
philosophy of most working scientists. This fact, Third, realist methodology is naturalistic—that
combined with its current heavy emphasis on the is, it is a substantive domain that uses the meth-
nature of scientific practice, makes scientific realism ods of the various sciences to study method itself.
a philosophy for science—not just a philosophy of Proctor and Capaldi (2001) advocate a naturalistic
science. approach to methodology in psychology in which
the empirical justification of methodological ideas
Scientific Realism is emphasized.
The philosophies of positivism, social construc- A fourth feature of realist methodology is that it
tionism, and scientific realism just mentioned are is both generative and consequentialist. Generative
really family positions. This is especially true of sci- methodology involves reasoning to, and accept-
entific realism, which comes in many forms. Most ing, knowledge claims in question from warranted
versions of scientific realism display a commitment premises. Exploratory factor analysis is a prominent
to at least two doctrines: (1) that there is a real world example of a method in psychology that involves
8 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
a generative justification of the factorial hypotheses Exploratory Data Analysis
to which it gives rise. By contrast, consequentialist In psychological research, the major emphasis in
methodology focuses on reasoning from knowledge data analysis is placed on statistical inference, where
claims in question to their testable consequences. the task is to find out whether a data set exhibits a
The widely used hypothetico-deductive method, designated feature of interest characterized with ref-
with its emphasis on predictive accuracy, clearly erence to a probabilistic model. Unfortunately, the
exhibits a consequentialist approach to justifying dominance of this goal has had the effect of discour-
knowledge claims. aging a concerted examination of data sets in terms
Fifth, realist methodology acknowledges the need of their quality and structure. Detailed explorations
for two quite different approaches to justifying of data are important in science, and it often makes
knowledge claims. In philosophy these are com- good sense to conduct them instead of a probabilis-
monly known as reliabilism and coherentism. With tic model or before the model is formulated and
reliabilism, a belief is justified to the extent that it is adopted.
acquired by reliable processes. In general, the innu- Consistent with this emphasis on the close exam-
merable methods that contribute to the detection of ination of data, the last 30 years have witnessed the
empirical phenomena are concerned with reliabilist strong development of an empirical, data-oriented
justification. With coherentism, a belief is justified approach to statistics. One important part of this
in virtue of its coherence with other beliefs. Tha- movement is exploratory data analysis, which con-
gard’s (1992) theory of explanatory coherence, used trasts with the more familiar traditional statistical
for the comparative evaluation of scientific theories, methods with their characteristic emphasis on the
embodies an illuminating coherentist perspective on confirmation of knowledge claims.
knowledge justification. These two forms of justifi-
cation are different, complementary, and of equal Exploratory Data Analysis and John Tukey
importance. Spelling out a philosophy of exploratory data
As a sixth feature, realist methodology regards analysis is difficult, and few methodologists have
science as a problem-oriented endeavor in which attempted to do so (for an initial attempt to do
problems are conceptualized as constraints on their this from a Bayesian perspective, see Good, 1983).
effective solution (Haig, 1987; Nickles, 1981). On However, the intellectual progenitor of modern
this formulation, the constraints are actually con- exploratory data analysis, John Tukey, has devel-
stitutive of the problem itself; they characterize the oped a systematic perspective on the subject that
problem and give it structure. Further, by includ- has helped to highlight its importance to research.
ing all the constraints in the problem’s articulation, It deserves to be considered as a philosophy of data
the problem enables the researcher to direct inquiry analysis in its own right. Therefore, this brief exami-
effectively by pointing the way to its own solution. In nation of the philosophy of exploratory data analysis
a real sense, stating the problem is half the solution! pays particular attention to Tukey’s thinking on the
Finally, realist methodology takes the researcher’s topic.
make up as a “knowing subject” seriously. Among According to Tukey (1980), data analysis should
other things, the researcher is regarded as a satisficer be treated as a two-stage compound process in
who makes heavy use of heuristics to guide her which the patterns in the data are first sug-
inquiries. For example, McGuire (1997) discusses gested by exploratory data analysis and then crit-
many useful heuristics that can be employed to facil- ically checked through the use of confirmatory
itate the generation of hypotheses in psychological data analysis procedures. Exploratory data analysis
research. involves descriptive—and frequently quantitative—
Scientific realist methodology undergirds a wide detective work designed to reveal structure or pattern
variety of methods, strategies, and heuristics that in the data sets under scrutiny. The data analyst
have been successfully used to produce worthwhile is encouraged to undertake an open-eyed investi-
knowledge about both empirical phenomena and gation of the data and perform multiple analyses
explanatory theories. If quantitative researchers in using a variety of intuitively appealing and easily
psychology engage this literature seriously, then they used techniques.
will find resources for enhancing their understand- The compendium of methods for the explo-
ing of research methods. ration of data, many of which were developed by
I turn now to a philosophical consideration of Tukey (1977), is designed to facilitate both dis-
the selected research methods. covery and communication of information. These
haig 9
methods are concerned with the effective organi- there is no acknowledged place for such work in
zation of data, the construction of graphical and the hypothetico-deductive conception of inquiry
semi-graphical displays, and the examination of (Wilkinson & The Task Force, 1999).
distributional assumptions and functional depen- I think the worth of the inductive method as a
dencies. Two additional attractive features of Tukey’s model for data analysis is dismissed too quickly in
methods are their robustness to changes in under- the above quotation. The major failing of the induc-
lying distributions and their resistance to outliers tive account of scientific method lies not so much
in data sets. Exploratory methods with these two with its perspective on data analysis, but with its
features are particularly suited to data analysis in psy- prohibition of the formulation of explanatory the-
chology, where researchers are frequently confronted ories. A modern conception of inductive method is
with ad hoc sets of data on amenable variables, which embedded in the important scientific process of phe-
have been acquired in convenient circumstances. nomena detection. Phenomena are relatively stable
recurrent general features of the world that we seek
Exploratory Data Analysis and Scientific to explain (Woodward, 1989), and their detection
Method frequently involves an inductive process of empiri-
In his writings on data analysis, Tukey (1969) cal generalization. With its emphasis on phenomena
has emphasized the related ideas that psychology is detection, inductive method reserves an important
without an agreed-upon model of data analysis and place for the exploratory analysis of data. In detect-
that we need to think more broadly about scientific ing phenomena, one is concerned to extract a signal
inquiry. In an invited address to the American Psy- from the noise of data, and for this the inten-
chological Association in 1968, Tukey presented the sive search of large amounts of data is frequently
following excerpt from a prominent psychologist for essential. It is precisely because securing a heavy
his audience to ponder. I quote in part: information yield for our data is likely to throw
up potentially interesting data patterns that might
I have the feeling that psychology is currently turn out to be genuine phenomena. In this con-
without a dominant viewpoint concerning a model text, data mining is encouraged, and the capabilities
for data analysis. In the forties and early fifties, a of exploratory techniques in this regard often make
hypothetico-deductive framework was popular, and them the appropriate methods of choice.
our mentors were keen on urging the design of By contrast, Behrens and Yu (2003) suggest that
“crucial” experiments for the refutation of specific the inferential foundations of exploratory data anal-
predictions made from one or another theory. ysis are to be found in the idea of abduction, or
Inductive empiricism was said to be disorderly and explanation (and by implication, not in the notions
inefficient. You and I knew then, as we know now, of hypothetico-deductive testing and inductive gen-
that no one approach is uniformly most powerful. eralization). However, exploratory data analysis is a
(Tukey, 1969, p. 90) descriptive pattern-detection process that is a pre-
Consider the hypothetico-deductive and induc- cursor to the inductive generalizations involved in
tive conceptions of scientific methods, which are phenomena detection. As will be seen later in the
mentioned here as candidate models for data anal- consideration of exploratory factor analysis, abduc-
ysis. Most psychological researchers continue to tive inference is reserved for the construction of
undertake their research within the confines of the causal explanatory theories that are introduced to
hypothetico-deductive method. Witness their heavy explain empirical phenomena. Beherens and Yu’s
preoccupation with theory testing, where confirma- suggestion conflates the quite different ideas of
tory data analyses are conducted on limited sets of descriptive and explanatory inference.
data gathered in accord with the dictates of the test
predictions of theories. In this regard, psychologists Exploratory Data Analysis and a Model of
frequently employ tests of statistical significance Data Analysis
to obtain binary decisions about the credibility of In the spirit of Tukey’s (1962; 1980) push for
the null hypothesis and its substantive alternatives. breadth of vision in data analysis, one might use-
However, the use of statistical significance tests in fully take a perspective on data analysis that extends
this way strongly blunts our ability to look for Tukey’s two-stage model (Haig, 2005b). Before
more interesting patterns in the data. Indeed, the exploring data for patterns of potential interest,
continued neglect of exploratory data analysis in researchers should assiduously screen their data for
psychological research occurs in good part because their quality. This initial data analysis involves
10 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
checking for the accuracy of data entries, identify- principle asserts that statements about observational
ing and dealing with missing and outlying data, and data, including empirical generalizations, have a
examining the data for their fit to the assumptions degree of acceptability on their own. Such claims
of the data analytic methods to be used. This impor- are not indubitable, but they do stand by them-
tant, and time-consuming, preparatory phase of selves better than claims justified solely in terms of
data analysis has not received the amount of explicit what they explain. What justifies the provisional
attention that it deserves in behavioral science edu- acceptance of data statements is that they have
cation and research practice. Fidell and Tabachnick been achieved by reliable methods; what strength-
(2003) provide a useful overview of the task and ens our provisional belief in the patterns thrown
techniques of initial data analysis. up by exploratory data analysis is their confirma-
Confirmation of the initial data patterns sug- tion through use of computer-based resampling
gested by exploratory data analysis is a “just check- methods.
ing” strategy and as such should be regarded as a Further, it is important to appreciate that the
process of close replication. However, it is essential acceptability of claims provided by the reliabilist jus-
to go further and undertake constructive replica- tification of computer-intensive resampling meth-
tions to ascertain the extent to which results hold ods can be enhanced by making appropriate use of
across different methods, treatments, subjects, and a coherentist approach to justification. One impor-
occasions. Seeking results that are reproducible tant form of coherence is explanatory coherence, and
through constructive replications requires data ana- one method that delivers judgments of explanatory
lytic strategies that are designed to achieve significant coherence is the theory of explanatory coherence
sameness rather than significant difference (Ehren- (Thagard, 1992). According to this theory, data
berg & Bound, 1993). Exploratory data analysis, claims, including empirical generalizations, receive
then, can usefully be regarded as the second in an additional justification if and when they enter
a four-stage sequence of activities that, in turn, into, and cohere with, the explanatory relations of
attend to data quality, pattern suggestion, pattern the theory that explains them.
confirmation, and generalization.
A Philosophy for Teaching Data Analysis
Resampling Methods and Reliabilist An underappreciated, but important, feature of
Justification Tukey’s writings on exploratory data analysis is the
Since the 1980s, statisticians have been able to illuminating remarks on the teaching of data analysis
exploit the massive computational power of the that they contain. These remarks can be assem-
modern computer and develop a number of com- bled into a constructive philosophy for teaching data
puter intensive resampling methods, such as the analysis, which can properly be regarded as an aspect
jackknife, the bootstrap, and cross-validation (Efron of an overall philosophy of exploratory data analy-
& Tibshirani, 1993). These methods constitute one sis. This philosophy of teaching advises us to think
important set of confirmatory procedures that are about and teach data analysis in a way that is quite
well suited to the task of checking on the data different from the prevailing custom.
patterns thrown up by exploratory data analysis. Provocatively, Tukey (1980) maintained that the
By exploiting the computer’s computational power, proper role of statistics teachers is to teach that
these resampling methods free us from the restric- which is most difficult and leave that which is more
tive assumptions of modern statistical theory, such as manageable to good textbooks and computers. He
the belief that the data are normally distributed, and recommended teaching data analysis the way he
permit us to gage the reliability of chosen statistics understood biochemistry was taught, concentrat-
by making thousands, even millions, of calculations ing on what the discipline of statistics has learned,
on many data points. perhaps with a discussion of how such things were
It is important to appreciate that the resampling learned. The detail of methods should be assigned
methods just mentioned make use of a reliabilist to laboratory work, and the practice of learning
approach to justification. Here, the reliability checks data analytic techniques should be assigned to a
on emergent data patterns are provided by consis- different course in which problems arose. He fore-
tency of test outcomes, which are time-honored saw that such a redirection in teaching data analysis
validating strategies. Our willingness to accept the would have to be introduced in phases. In Tukey’s
results of such checks is in accord with what Tha- (1962) words, “The proposal is really to go in the
gard (1992) calls the principle of data priority. This opposite direction from cookbookery; to teach not
haig 11
‘what to do,’ nor ‘how we learned what to do,’ but has undertaken a useful extensive review of the con-
rather, ‘what we have learned”’ (p. 63). This advice troversy since its beginning. Despite the plethora
is broadly consistent with the idea that we should of critiques of statistical significance testing, most
teach research methods in terms of their accompa- psychologists understand them poorly, frequently
nying methodology, a recommendation considered use them inappropriately, and pay little attention
at the end of this chapter. to the controversy they have generated (Gigerenzer,
Another prominent feature of Tukey’s philosophy Krauss, & Vitouch, 2004).
of teaching data analysis is his recommendation that The significance testing controversy is multi-
we should teach both exploratory and confirmatory faceted. This section will limit its attention to a
data analysis and that we have an obligation to do so. consideration of the two major schools of signifi-
Tukey’s strong promotion of the value of exploratory cance testing, their hybridization and its defects, and
data analysis was intended as a counter to the dom- the appropriateness of testing scientific hypotheses
inance of confirmatory data analysis in statistical and theories using tests of statistical significance.
practice. However, for Tukey, exploratory data anal- Psychologists tend to assume that there is a sin-
ysis was not to be understood as more important gle unified theory of tests of statistical significance.
than confirmatory data analysis because both are However, there are two major schools of thought
essential to good data analysis. regarding significance tests: Fisherian and Neyman-
Tukey also suggested that exploratory data anal- Pearson. Initially, Neyman and Egon Pearson sought
ysis should probably be taught before confirma- to build on and improve Fisher’s theory, but they
tory data analysis. There are several reasons why subsequently developed their own theory as an alter-
this recommendation has merit. Properly taught, native to Fisher’s theory. There are many points of
exploratory data analysis is probably easier to learn, difference between the two schools, which adopt
and it promotes a healthy attitude to data anal- fundamentally different outlooks on the nature of
ysis (encouraging one to be a dataphile without scientific method. The uncritical combination of
becoming a data junkie). It requires the investiga- the two schools in psychology has led to a confused
tor to get close to the data, analyze them in various understanding of tests of statistical significance and
ways, and seek to extract as much as possible poten- to their misuse in research.
tially important information from the data. This is
done to detect indicative patterns in the data before
establishing through confirmatory data analysis that The Fisherian Significance Testing School
they are genuine patterns. The Fisherian school of significance testing (e.g.,
Tukey emphasized that learning exploratory data Fisher, 1925) tests a hypothesis or theory of sub-
analysis centrally involves acquiring an appropri- stantive interest against the null hypothesis that the
ate attitude toward the data, which includes the experimental effect to be demonstrated is in fact
following elements: exploratory data analysis is suf- absent. Fisher argued that an experiment is per-
ficiently important to be given a great deal of time; formed solely to give the data an opportunity to
exploratory data analysis should be carried out flex- disprove the null hypothesis. No alternative hypoth-
ibly with multiple analyses being performed; and esis is specified, and the null hypothesis is the
exploratory data analysis should employ a multiplic- hypothesis to be nullified; it need not be the hypoth-
ity of methods that enhance visual display. esis of zero difference. Because one cannot accept
the null hypothesis, no provision is made for Type
II error, and relatedly, there is no place for a sta-
Statistical Significance Testing tistical concept of power. Most importantly, and as
It is well known that tests of statistical significance noted earlier, Fisher subscribed to an inductive con-
are the most widely used methods for evaluating ception of scientific method and maintained that
hypotheses in psychology (e.g., Hubbard & Ryan, significance tests are vehicles of inductive reasoning.
2000). These tests have been popular in psychology As such they are concerned with evidence for beliefs.
for nearly 50 years and in statistics for about 75 years.
Since the 1960s, there has developed a massive crit-
ical literature in psychology regarding their worth. Should We Use Fisher’s Significance Tests?
Important early contributions to this debate are col- The question of whether behavioral scientists
lected in Morrison and Henkel (1970; see also Giere, should use Fisherian significance tests as a defensible
1972). Cohen (1994) provides a short perceptive form of hypothesis testing largely centers on whether
review of the controversy, whereas Nickerson (2000) p−values are good measures of scientific evidence.
12 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
Although many psychological researchers think, and neo-Fisherian alternative. Key elements of this alter-
some methodologists argue, that p−values can be native are that the probability of Type I error is not
used to measure strength of evidence, others hold specified, p−values are not misleadingly described
them to be deeply problematic in this respect (e.g., as “significant” or “nonsignificant,” judgment is sus-
Royall, 1997; Hubbard & Lindsay, 2008). pended about accepting the null hypothesis on the
Some have argued that the concept of evidence basis of high p-values, the “three-valued logic” that
adopted by Fisher is defective. For one thing, it is gives information about the direction of the effect
widely agreed by philosophers of science that the- being tested is adopted, accompanying effect size
ory or hypothesis evaluation is a comparative affair information is provided, and use is made of adjunct
in which evidence against one hypothesis is evidence information such as confidence intervals, where
for another hypothesis (e.g., Sober, 2008). However, appropriate. Two things to note here about this neo-
as just noted, Fisher only countenanced the null, Fisherian perspective are that it is concerned with
and without an alternative hypothesis with which significance assessments but not null hypothesis sig-
to compare the null, the logic of his significance nificance tests and that it is concerned with statistical
testing is importantly incomplete; one cannot have tests as distinct from the tests of scientific hypothe-
evidence against the null without it being evidence ses. There are empirical studies in psychology that
for another hypothesis. The idea that one might approximate this modified Fisherian perspective on
allow for alternative hypotheses by joining Fisher’s significance tests.
perspective with that of Neyman and Pearson will
be seen by many to make matters worse. As will be
The Neyman-Pearson Hypothesis Testing
noted shortly, Neyman and Pearson were concerned
School
with the reliability of errors in decision making in
Neyman and Pearson rejected Fisher’s notion of a
the long run rather than with evidence for believing
significance test and its use of a threshold p-value as a
hypotheses in a particular experiment.
basis for rejecting the null hypothesis. In this regard,
Others contend that a further major problem
they added the requirement of the specification of an
with Fisher’s p-value is that it doesn’t measure evi-
alternative hypothesis as well as the null hypothesis,
dence properly (e.g., Goodman, 1993). In this
and they replaced Fisher’s evidential p-value with the
regard, four claims about p-values are thought to
Type I error rate, α (e.g., Neyman & Pearson, 1933).
disqualify it as a proper measure of evidence. First,
In addition, Neyman and Pearson permitted a more
the p-value is not a direct measure of the probability
liberal formulation of the null hypothesis than did
that the null is false; it is a conditional probability
Fisher and regarded it as legitimate to speak of its
of obtaining the data, calculated on the assumption
acceptance. Thus, Type II error was admitted, and
that the null hypothesis is true. Second, to take a
explicit provision was made for a statistical concept
number that quantifies rare data under the null is to
of power. To capture these differences, Neyman and
confuse the strength of evidence with the probability
Pearson spoke of their approach as hypothesis testing
of its occurrence (Royall, 1986). These two things
rather than significance testing.
are different because the probability is an indication
However, the Neyman-Pearson school differs
of the long-run Type I error rate, which is separate
from the Fisherian school most fundamentally in
from strength of evidence. Third, the calculation of
maintaining that significance tests are rules of induc-
a p-value combines the rarity of an obtained result
tive behavior rather than vehicles for inductive rea-
with the probability of results that didn’t happen. As
soning. On this view, significance testing is regarded
Jeffreys (1939) stated long ago, “What the use of P
as a theory of prudential decision-making; accept-
implies . . . is that a hypothesis that may be true may
ing or rejecting a hypothesis amounts to adopting an
be rejected because it has not predicted observable
appropriate course of action, rather than believing
results that have not occurred” (p. 136). Finally, it is
it to be probably true or false. At root, Neyman and
claimed that the p-value can exaggerate the strength
Pearson held different views from Fisher about the
of evidence against the point null and small inter-
nature of science.
val hypotheses (Berger & Selke, 1987), which are
frequently tested in psychology.
These criticisms are not convincing to all. For Should We Use Neyman and Pearson’s
example, Hurlbert and Lombardi (2009) recently Hypothesis Tests?
considered these arguments and recommend a shift It might seem that by focusing on hypothesis test-
in focus from the classical Fisherian framework to a ing and making provision for alternative hypotheses
haig 13
and a statistical concept of power, Neyman and Pear- To make matters worse, this confusion is com-
son’s approach marks an improvement over Fisher’s pounded by a tendency of psychologists to misrep-
significance tests. However, as noted earlier, Ney- resent the cognitive accomplishments of significance
man and Pearson were concerned with measuring tests in a number of ways. For example, levels
the reliability of errors in decision making in the of statistical significance are taken as measures of
long run rather than with evidence for believing confidence in research hypotheses, likelihood infor-
hypotheses in a particular experiment; Type I and mation is taken as a gage of the credibility of the
Type II error are objective probabilities, under- hypotheses being tested, and reported levels of sig-
stood as long-run relative frequencies. As such, they nificance are taken as measures of the replicability
belong to reference classes of endless series of trials of findings (Gigerenzer, Krauss, & Vitouch, 2004).
that might have happened but never did. They are
not about single events such as individual exper- Significance Tests and Theory Testing
iments. Further, being concerned with decision Meehl (1967, 1978, 1997) has made one of the
making understood as behavioral courses of action, strongest criticisms of the use of tests of statisti-
Neyman-Pearson statistics do not measure strength cal significance in psychology. He argued that the
of evidence for different hypotheses and thus do not widespread use of tests of statistical significance to
tell us how confident we should be in our beliefs test substantive hypotheses and theories is deeply
about those hypotheses. flawed because the support for a hypothesis or the-
It would seem, then, that the Neyman-Pearson ory obtained by rejecting the null hypothesis is very
approach is suitable for use only when the focus is on weak.
controlling errors in the long run (e.g., quality con- Sometimes psychological researchers test a
trol experiments). However, this does not happen hypothesis of substantive interest against the point
often in psychology. null hypothesis that the difference between the rel-
evant population parameters is exactly zero. But a
The Hybrid fact, long known to professional statisticians and
Although the Neyman-Pearson theory of testing appreciated by Meehl, is that the point null hypoth-
is the official theory of statistical testing in the field esis is virtually always false in the behavioral and
of professional statistics, textbooks in the behav- social sciences. The reason for this is that in these
ioral sciences ensure that researchers are instructed to sciences, most things are related to one another at
indiscriminately adopt a hybrid account of tests of least to some small extent. In many parts of psy-
statistical significance, one that is essentially Fish- chology, “everything in the brain is connected with
erian in its logic but that is often couched in the everything else,” resulting in a large positive man-
decision-theoretic language of Neyman and Pearson. ifold in which many variables correlate positively
The hybrid logic is a confused and inconsistent with one another to a significant degree. Thus, in
amalgam of the two different schools of thought the “softer” precincts of psychology, where “true”
(Acree, 1979; Gigerenzer, 1993; Spielman, 1974; experiments are often not possible, obtaining a rea-
but see Lehmann, 1993, for the suggestion that the sonable sample size makes the achievement of a
best elements of both can be combined in a unified statistically significant result the likely outcome of an
position). To the bare bones of Fisherian logic, the empirical study. Meehl (1967) reasoned that if the
hybrid adds the notion of Type II error (opposed by null hypothesis of zero group differences is almost
Fisher) and the associated notion of statistical power always false, then with sufficient power, directional
(which Fisher thought could not be quantified) hypotheses in these parts of psychology have a 50:50
but only at the level of rhetoric (thereby ignoring chance of achieving statistical significance! Meehl
Neyman and Pearson), while giving a behavioral and Lykken provided some empirical evidence for
interpretation of both Type I and Type II errors (vig- this claim more than 40 years ago (Meehl, 1967; see
orously opposed by Fisher)! Because the authors of also Meehl, 1997). A recent simulation study on real
statistical textbooks in the behavioral sciences tend data carried out by Waller (2004) confirmed Meehl’s
to present hybrid accounts of significance testing, claim.
aspiring researchers in these sciences almost always One can better appreciate what is wrong with
acquire a confused understanding of such tests. It is using tests of statistical significance to appraise psy-
most unfortunate that many writers of statistics text- chological theories by considering the logic involved
books in the behavioral sciences have unwittingly in such testing. It is helpful to begin by observ-
perpetuated these basic misunderstandings. ing the important distinction between scientific
14 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
hypotheses and the statistical hypotheses that may long and hard to answer this important and chal-
be derived from them (Bolles, 1962). Often in lenging question by developing theories of scientific
psychology, scientific theories about psychological confirmation. Despite the considerable fruits of
processes or structures are formulated, and then their labors, there is widespread disagreement about
statistical hypotheses are derived to facilitate their which theory of confirmation we should accept. In
empirical testing. The former will characteristically recent times, a large number of philosophers of
invoke causal mechanisms for explanatory purposes, science have contributed to Bayesian confirmation
whereas the latter will appeal to statistical tests theory (e.g., Earman, 1992; Howson & Urbach,
of null hypotheses about the population parame- 2006). Many philosophical methodologists now
ters of observed variables. Meehl has argued that believe that Bayesianism, including Bayesian phi-
psychological researchers tend to conflate the sub- losophy of science, holds the best hope for building
stantive theory and the statistical hypothesis and a comprehensive and unified theory of scientific
unwarrantedly take the successful refutation of the inference.
null hypothesis as grounds for concluding that the Bayesianism is a comprehensive position. It com-
substantive theory has been strongly confirmed. prises a theory of statistical inference, an account
However, if we have good grounds for believing that of scientific method, and a perspective on a vari-
the point null hypothesis is probably false at the out- ety of challenging methodological issues. Today, it
set, and we use the null as the observation hurdle for also boasts a fully fledged philosophy of science. In
our theories to surmount, then support for a theory this section, attention is limited to a consideration
by rejecting this implausible null alternative is quite of the strengths and weaknesses of Bayesian statis-
feeble. tical inference, the ability of Bayesian confirmation
For good reason, then, Meehl (1990) has urged theory to improve upon the hypothetic-deductive
psychologists to abandon tests of statistical signifi- method, and the question of whether Bayesianism
cance for purposes of substantive theory testing in provides an illuminating account of the approach
psychology. He suggests that psychologists should to theory evaluation known as inference to the best
replace them with a strategy that is adapted from the explanation.
philosopher of science, Imre Lakatos (1970), who
argued against Popper’s strict falsificationist posi-
Bayesian Statistical Inference
tion on theory testing for the comparative theory
The Bayesian approach to statistical inference is
appraisal of research programs over time. Meehl has
so called because it makes central use of a theorem
maintained that one should defend and amend a
of the mathematical calculus of probability known
theory only if it has a good track record of successful
as Bayes’ theorem. This theorem can be written in a
or near-miss predictions of low prior probability.
simple form as:
In conclusion, it is clear that there are fundamen-
tal philosophical differences between the Fisherian Pr(H) × Pr(D/H)
Pr(H/D) =
and Neyman-Pearson schools of statistical thought. Pr(D)
Fisher’s statistical contributions can been seen as With the proviso that Pr (D) and Pr (H) cannot
a deliberate attempt to develop an objective alter- be zero, the theorem says that the posterior probabil-
native to Bayesian statistical thinking popular in ity of the hypothesis is obtained by multiplying the
Europe at the time, whereas those of Neyman and prior probability of the hypothesis by the probabil-
Pearson can be seen as an attempt to develop a ity of the data, given the hypothesis (the likelihood),
position that is even more objective than Fisher’s. and dividing the product by the prior probability of
Nevertheless, Bayesian thinking today is an the data. It is through use of this and other ver-
attractive option for many statisticians who hold sions of Bayes’ Theorem that Bayesians are able to
misgivings about one or both of these schools of implement their view of statistical inference, which
thought. It is also the focus of much attention in is the orderly revision of opinion in the light of new
the philosophy of science. It is to the elements of information.
the Bayesian position that we now turn. For Bayesians, a couple of features of this gloss
on Bayesian statistical inference recommend them-
Bayesian Confirmation Theory selves. Most importantly, the Bayesian approach
What is it for empirical evidence to provide con- squares with the stated purpose of scientific inquiry
firmation or disconfirmation of a scientific hypothe- noted above—namely, securing the probability of a
sis or theory? Methodologists of science have worked hypothesis in the light of the relevant evidence. The
haig 15
informational output of a traditional test of signifi- on the basis of evidence and logic, and assert that in
cance is the probability of the data, given the truth real-life situations, there is no alternative to relying
of our hypothesis, but it is just one input in the on a subjective component in our testing efforts.
Bayesian scheme of things. A second stated desir- In deciding whether to adopt a Bayesian posi-
able feature of the Bayesian view is its willingness to tion on statistical inference, it should be kept in
make use of relevant information about the hypoth- mind that one does not have to embrace a gen-
esis before the empirical investigation is conducted eral Bayesian theory of scientific confirmation rather
and new data are obtained, explicitly in the form of than, say, the hypothetico-deductive alternative.
a prior probability estimate of our hypothesis. Tra- One might be a Bayesian when dealing with prob-
ditional tests of statistical significance are premised lems of statistical inference but remain wedded to a
on the assumption that inferences should be based general hypothetico-deductive conception of scien-
solely on present data, without any regard for what tific method. Or, more plausibly, one might employ
we might bring to a study in the way of belief or Bayesian statistical methods when concerned with
knowledge about the hypothesis to be tested—a inferential problems about hypotheses for which
position that Bayesians contend is hardly designed to we have the relevant probabilistic information, but
maximize our chances of learning from experience. adopt a non-probabilistic count of theory evaluation
To achieve their goal of the systematic revision of such as Thagard’s theory of explanatory coherence,
opinion on the basis of new information, Bayesians which will be referred to later in the chapter. The
are able to employ Bayes’ theorem iteratively. Hav- general point to be made here is that Bayes’ theorem
ing obtained a posterior probability assignment for can help us deal with some problems of statistical
their hypothesis via Bayes’ theorem, they can then inference, but clearly, a great deal of scientific work
go on and use that posterior probability as the new will be done with the use of other methods—some
prior probability in a further use of Bayes’ theorem of them statistical and some of them not.
designed to yield a revised posterior probability, and
so on. In this way, the Bayesian researcher learns Bayesianism and the Hypothetico-Deductive
from experience. Method
One of the clear achievements of Bayesianism is
Criticisms of Bayesian Hypothesis Testing its ability to improve on the unsatisfactory approach
Although my consideration of the merits of to hypothesis and theory appraisal taken by the
conventional significance tests and their Bayesian hypothetico-deductive method. The hypothetico-
alternative is both sketchy and selective, some read- deductive method has long been the method of
ers will sense that the Bayesian view provides an choice for the evaluation of scientific theories (Lau-
attractive alternative to the traditional approach, dan, 1981), and it continues to have a dominant
particularly when the latter assumes its hybrid form. place in psychology. Despite its popularity, it is
However, as with all theories of confirmation, the usually characterized in an austere manner: The
Bayesian approach has come in for its share of crit- researcher takes a hypothesis or theory of interest
icism. These criticisms have tended to focus on the and tests it indirectly by deriving from it one or
alleged problematic nature of prior probabilities. In more observational predictions that are themselves
this regard, it is objected that because Bayesians directly tested. Predictions borne out by the data
adopt a subjectivist conception of probability and are taken to confirm the theory to some degree;
resort to personal estimates of the prior probabilities those predictions that do not square with the data
of their hypotheses, they introduce an ineliminable, count as disconfirming instances of the theory.
but highly undesirable, subjective element into their Normally, the theory is not compared with rival
calculations. To this objection, the Bayesians have theories in respect of the data, only with the data
two plausible replies: they can concede that personal themselves.
estimates of prior probabilities are subjective, that The hypothetico-deductive method, in some-
they may differ markedly from person to person, and thing like this form, has been strongly criticized
that they are often very rough estimates and then go by methodologists on a number of counts (e.g.,
on to point out that when prior estimates err, they Glymour, 1980; Rozeboom, 1997). One major crit-
are brought into line by freshly obtained sets of data; icism of the method is that it is confirmationally
or, they may appeal to the failure of strictly empiri- lax. This laxity arises from the fact that any posi-
cist theories of confirmation, which hold that one tive confirming instance of a hypothesis submitted
may obtain an adequate test of a hypothesis solely to empirical test can confirm any hypothesis that is
16 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
conjoined with the test hypothesis, regardless of how Bayesianism and Inference to the Best
plausible it might be. This state of affairs is known as Explanation
the fallacy of irrelevant conjunction, or the tacking Recently, some Bayesians have claimed that
problem, because confirmation of a test hypothesis their perspective on scientific method can also pro-
also confirms any conjunct that is attached to the vide an enhanced characterization of the important
test hypothesis. The fallacy of irrelevant conjunc- approach to theory evaluation known as inference
tion arises with the hypothetico-deductive method to the best explanation. Inference to the best expla-
because predictions are deduced from hypotheses nation is based on the belief that much of what
only by making use of auxiliary hypotheses drawn we know about the world is based on consider-
from background knowledge. ations of explanatory worth. In contrast to the
Clearly, this is an unacceptable state of affairs. Bayesian approach, accounts of inference to the best
Bayesians have challenged the assumption that the explanation take theory evaluation to be a quali-
occurrence of the consequences of a theory con- tative exercise that focuses on explanatory criteria
firm the theory and its conjuncts holistically. They rather than a quantitative undertaking in which
argue that the Bayesian approach enables the dif- one assigns probabilities to theories (Haig, 2009;
ferential support of the elements of a theory, spec- Thagard, 1992).
ifying conditions showing that E never increases Although inference to the best explanation has
the probability of H conjoined with any additional typically been regarded as a competitor for Bayesian
hypothesis by more than it increases the probability theory evaluation, Lipton (2004) has recently
of H. argued that the two approaches are broadly compat-
Another major criticism of the hypothetico- ible and that, in fact, their proponents “should be
deductive method is that it tests a single hypothesis friends.” In broad terms, he suggests that judgments
or theory of interest against the empirical evidence; of the loveliest explanation, which are provided by
it does not test a hypothesis or theory in relation to the evaluative criteria of inference to the best expla-
rivals in respect of the evidence. This is held to be nation, such as unificatory power, precision, and
a major flaw because it is widely agreed that theory elaboration of explanatory mechanisms, contribute
evaluation is a comparative affair involving simul- to assessments of the likeliest explanation, which
taneous evaluation of two or more hypotheses or are provided by the probabilities of the Bayesian
theories. approach. Specifically, Lipton maintains that the
The comparative nature of theory evaluation is explanatory considerations invoked in inference to
straightforwardly handled by the Bayesian position the best explanation guide determination of the
by rewriting the simple form of Bayes’ theorem prior probabilities (and the likelihoods) that are
given earlier to deal with two or more hypothe- inserted in Bayes’ Theorem.
ses. Here, Bayes’ theorem is presented for the However, although appeal to explanatory matters
case of two hypotheses, where the theorem can be might be one way in which Bayesians can determine
written for each hypothesis in turn. For the first their prior probabilities, Lipton does not suggest
hypothesis, how this might be done. Further, those who hold
inference to the best explanation to be a norma-
Pr (H1 /D) tive approach to scientific theory evaluation, with
its own distinctive character, will worry that Lipton
Pr(H1 ) × Pr(D/H1 ) relegates it to a descriptive role within a Bayesian
=
Pr(H2 )×Pr(D/H2 )+Pr(H1 )×Pr(D/H1 ) normative framework (e.g., Psillos, 2004).
Another way of showing the compatibility of
This says that the posterior probability of the inference to the best explanation and Bayesianism is
first hypothesis is obtained by multiplying its prior to translate the evaluative criteria employed within
probability by the probability of the data, given that inference to the best explanation into probabilis-
hypothesis (the likelihood), and dividing the prod- tic terms. McGrew (2003) has done this by taking
uct by the value that results from adding the prior the important theoretical virtue of consilience, or
probability of the second hypothesis, multiplied explanatory breadth, and showing that its Bayesian
by the likelihood for that hypothesis, to the prior form leads to higher posterior probabilities of the
probability of the first hypothesis, multiplied by its hypotheses being evaluated. Nevertheless, McGrew
likelihood. Bayes’ theorem for the second hypothesis has acknowledged that by translating consilience
is written in a similar way. into its “flattened” probabilistic form, it no longer
haig 17
remains a genuine explanatory virtue. Not only is in domains where the evaluation of explanatory
there no guarantee that consilience will be concerned hypotheses and theories are of primary concern, sci-
with an explanation of the evidence, there is no entists have, for good reason, often employed some-
way that probabilistic translations of the explana- thing like inference to the best explanation. Like it
tory virtues can refer to the causal connections that or not, the intending Bayesian scientist will have to
are often appealed to in scientific explanations. Fur- consult the relevant philosophical literature, among
ther, Weisberg (2009) has recently argued that the other methodological literatures, to furnish an
explanatory loss incurred in such translations will informed justification for their Bayesian practices.
occur for any distinctively explanatory virtue that
is given such probabilistic treatment. In short, it
would seem that Bayesianism cannot capture the Meta-Analysis
intuitively important notion of explanatory power In the space of three decades meta-analysis has
without significant loss. become a prominent methodology in behavioral sci-
ence research, with the major developments coming
from the fields of education and psychology (Glass,
What Should We Think About McGaw, & Smith, 1981: Hedges & Olkin, 1985;
Bayesianism? Hunter & Schmidt, 2004). Meta-analysis is an
Philosophical assessment of the worth of approach to data analysis that involves the quantita-
Bayesianism range from claims that it is without peer tive analysis of the data analyses of primary empirical
as a theory of scientific reasoning to the view that it is studies. Hence, the term meta-analysis coined by
fundamentally wrong-headed. Howson and Urbach Glass (1976). Meta-analysis, which comes in a vari-
(2006) exemplify the former view, claiming that sci- ety of forms (Bangert-Drowns, 1986), is concerned
entific reasoning is both inductive and probabilistic with the statistical analyses of the results from many
and that the axioms of probability suffice to articu- individual studies in a given domain for the pur-
late such reasoning. The latter view is exemplified by pose of integrating or synthesizing those research
Bunge (2008), who has argued that Bayesianism is findings.
fundamentally wrong for three reasons: (1) it assigns The following selective treatment of meta-
probabilities to statements rather than taking them analysis considers its possible roles in scientific
as objective features of the world; (2) it conceives explanation and evaluation research before critically
of probabilities as subjective; and (3) it appeals to examining one extended argument for the con-
probabilities in the absence of randomness. clusion that meta-analysis is premised on a faulty
To add to this mix of views, many statisticians conception of science.
take Bayesian statistical inference to be a superior
alternative to classical statistical inference, for the
reasons stated earlier. Finally, some advocates of Meta-Analysis and Explanation
Bayesianism see it as a comprehensive theory of Meta-analysis is a prominent example of a dis-
confirmation, whereas others see it as having only tinctive use of statistical methods by behavioral
context-specific application. scientists to aid in the detection of empirical phe-
The difficulties of deciding just what to think nomena. By calculating effect sizes across primary
about Bayesianism are captured well by the ambiva- studies in a common domain, meta-analysis helps us
lence of John Earman (1992), a Bayesian philoso- detect robust empirical generalizations (cf. Schmidt,
pher of science. He confesses to being an enthusiastic 1992). By using statistical methods to ascertain the
Bayesian on Mondays, Wednesdays, and Fridays. existence of such regularities, meta-analysis can be
But on Tuesdays, Thursdays, and Saturdays, he usefully viewed as the statistical analog of direct
holds doubts about the totalizing ambitions of experimental replication. It is in this role that meta-
Bayesianism and indeed whether it can serve as analysis currently performs its most important work
a proper basis for scientific inference. Faced with in science.
such difficulty, it is probably prudent to settle for However, given that the detection of empiri-
a contextual application of Bayesian thinking, as cal phenomena and the construction of explana-
indicated earlier in this section. For example, in par- tory theories are quite different research tasks, the
ticular domains such as medical diagnosis, where the recent suggestion that meta-analysis can directly
relevant probabilistic information is often available, contribute to the construction of explanatory the-
scientists sometimes appeal to the Bayesian corpus to ory (Cook et al., 1992; Schmidt, 1993) is an
justify the selective use of its methods. By contrast, arresting methodological claim. In approving this
18 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
extension of meta-analysis beyond a concern with thereby confer an explanatory role on meta-analysis
phenomena detection, Schmidt has acknowledged itself. One does not assign status simply on the basis
that scientific explanation normally involves the of association.
causal explanation of observed phenomena. Nev-
ertheless, he maintains that it is appropriate to
Meta-Analysis and Evaluative Inquiry
take scientific explanation to include “all research
It is surprising that methodological discussions of
processes that contribute ultimately to theory build-
meta-analysis and its applications have shown little
ing, including the first step of determining what
regard for the rationale that Glass originally pro-
the relationships are among important variable or
vided for its use. Glass claims that many researchers
constructs and how stable these relationships are”
misunderstand meta-analyses of outcome research
(Schmidt, 1993, p. 1164). Thus, the demonstration
because they fail to take cognizance of his rationale.
of a general effect, such as the pervasive influence
Specifically, this failure is offered by him as the rea-
of psycho-educational treatments on adult surgi-
son for the widespread misunderstanding of Smith,
cal patients, is deemed to be a meta-analysis at
Glass, and Miller’s (1980) original meta-analysis of
the “lowest level of explanation.” On the other
psychotherapy outcome studies.
hand, the use of meta-analysis to test compet-
In a number of different publications, Glass
ing theories of how patients cope with the stress
insists that meta-analysis should be understood as
of surgery is viewed as higher level explanatory
an exercise in evaluative research rather than in
meta-analysis.
scientific research (Glass, 1972; Smith, Glass, &
However, this attempt to extend the role of meta-
Miller, 1980; Glass & Kleigl, 1983). The core
analytic methods beyond phenomena detection to
of Glass’s underlying rationale for meta-analysis
explanation obscures the basic methodological dis-
involves drawing a strong distinction between scien-
tinction between phenomena detection and scien-
tific and evaluative inquiry. Glass’s position is that
tific explanation. As noted earlier in the chapter, the
researchers as scientists are concerned to satisfy their
stable general effects gleaned from meta-analysis are
curiosity by seeking truthful conclusions in the form
empirical phenomena, and statements about phe-
of theories comprising explanatory laws. By con-
nomena are the objects of scientific explanations;
trast, evaluators undertake research on behalf of a
they are not the explanations themselves. The ques-
client that is aimed at producing useful decisions
tion, “What do statements of empirical phenomena
based on descriptive determinations of the worth
explain?” occasions no natural reply. This is not sur-
of particular products or programs. For Glass, the
prising, for the successful detection of phenomena
meta-analysis of outcome studies properly involves
is essentially a descriptive achievement that involves
the integration of the products of evaluative research
investigative practices that are, for the most part,
only.
quite different from explanatory endeavors. In psy-
The methodology for this conception of meta-
chology, these methods are often statistical in kind.
analysis fashions the distinction between scientific
By contrast, scientific explanation is often causal-
and evaluative inquiry in terms of the relevance
mechanistic in nature (Salmon, 1984). On this
for each of the concepts of truth, explanation,
view, explanation requires the identification of the
values, problems, and generalizations. Because
mechanisms that underlie and give rise to empirical
of space limitations, I will consider just one of
phenomena, along with a detailing of the ways in
these contrasts—that of explanation. Glass contends
which those mechanisms produce the phenomena
that scientific inquiry involves the continual search
we seek to understand.
for subsurface explanations of surface phenomena.
When meta-analysis enters into the process of
Evaluative inquiry, on the other hand, does not seek
testing explanatory theories, it contributes to an
explanations:
evaluation of those theories in terms of predictive
success. However, this common strategy for evaluat- “A fully proper and useful explanation can be
ing scientific theories is not directly concerned with conducted without producing an explanation of why
their explanatory adequacy. To repeat, it is not being the product or program being evaluated is good or
denied that meta-analytic methods can be employed bad of how it operates to produce its effects . . . [It] is
in the course of testing theories, but meta-analysis usually enough for the evaluator to know that
itself is not an approach to theory testing (Chow, something attendant upon the [product or program]
1996). To employ meta-analysis to assist in the pre- is responsible for the valued outcomes.” (Glass, 1972,
dictive testing of an explanatory theory does not pp. 5–6)
haig 19
Glass’s position seems to be that although pro- as “. . . a two-tiered research enterprise [where] one
gram treatments can be causally responsible for their group of researchers will specialize in conducting
measured outcomes, it matters little that knowl- individual studies [and] another group will apply
edge of this gleaned from evaluation studies does complex and sophisticated meta-analysis methods to
not tell us how programs produce their effects, those cumulative studies and will make the scientific
because such knowledge is not needed for policy discoveries” (p. 1180). Sohn’s primary concerns are to
action. challenge the claim that meta-analysis is an impor-
Glass is surely correct in asserting that scientists tant vehicle of scientific discovery and to identify the
are centrally concerned with the construction of major problems besetting mainstream psychological
causal theories to explain phenomena, for this is the research.
normal way in which they achieve understanding Sohn (1996) has questioned the basic idea of
of the empirical regularities they discover. However, meta-analysis as a standalone literature review capa-
he is wrong to insist that proper evaluations should ble of discovering truths, whereas traditionally sci-
deliberately ignore knowledge of underlying causal entific discoveries were contained in the empirical
mechanisms. The reason for this is that the effective findings of the primary studies themselves. For
implementation and alteration of social programs Sohn, the idea that meta-analytic literature reviews
often benefits from knowledge of the relevant causal can make discoveries about nature rests on the
mechanisms involved (Gottfredson, 1984), and assumption that the primary research literature is
strategic intervention in respect of these is often a proxy for nature. It is an assumption that he has
the most effective way to bring about social change. roundly rejected.
Although standard versions of scientific realism are Noting the tendency of meta-analysts to paint a
wrong to insist that the relevant causal mechanisms bleak picture of progress in twentieth century psy-
are always unobserved mechanisms, it is the case that chology, Sohn (1996) has suggested that although
appeal to knowledge of covert causal mechanisms meta-analysis has been introduced to improve mat-
will frequently be required for understanding and ters in this regard, it is in fact symptomatic of its
change. poor progress. In his judgment, this lack of good
To conclude this highly selective evaluation of progress is a consequence of psychology adopting a
Glass’s rationale for meta-analysis, science itself is hypothesis-testing view of science. For Sohn, this
best understood as a value-laden, problem-oriented view of science seeks knowledge by testing research
human endeavor that tries to construct causal hypotheses about the relationship of descriptive
explanatory theories of the phenomena it discovers. variables without regard for causal mediating vari-
There is no sound way of drawing a principled con- ables. Essentially, the approach amounts to the
trast between scientific and evaluative inquiry. These hypothetico-deductive testing of outcome studies
critical remarks are not directed against the worth of through use of significance tests and effect size
evaluation as such or against the use of meta-analysis measures. Sohn maintains that there are, in fact,
in evaluating program or product effectiveness. They two deleterious consequences of such an approach
are leveled specifically at the conception of eval- to research: one is the lack of agreement about
uation research that appears to undergird Glass’s outcomes, and the other is the absence of knowl-
approach to meta-analysis. edge of the causal mechanisms that are responsi-
ble for those alleged outcomes. Meta-analysis is
indicted by Sohn for failing to remedy both types of
Meta-Analysis and the Nature of Science defect.
Proponents of meta-analysis often justify the However, Sohn has supported his claim that
use of these methods by pointing out the need to meta-analysis does not produce demonstrable evi-
glean valuable knowledge from the information in a dence for treatment effects in a curious way. He
domain that lies dormant in the cornucopia of scat- has acknowledged that Smith, Glass, and Miller’s
tered primary studies. However, Sohn (1996) has (1980) well-known meta-analytic treatment of the
expressed concern about the quality of empirical psy- benefits of psychotherapy has been corroborated
chological studies that are used in meta-analysis. He by subsequent meta-analyses yet has maintained
has urged resistance to the generally accepted view that this does not constitute evidence for replica-
that meta-analysis is a form of research rather than ble effects. He has expressed a distrust of research
a review of research, and he has balked at Schmidt’s that relies on statistical methods for making claims
(1992) revisionist model of possible future science about replicable effects. This distrust appears to
20 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
be founded in part on an extension of the view Factor Analytic Inference
attributed to Lord Rutherford that if an experimen- Alongside the debate between the fictionalist and
tal study requires statistics, then the experiment is realist interpretations of factors, there is a difference
in need of improvement. For Sohn, “If one’s science of view about whether the basic inferential nature
needs [meta-analysis], one should have done better of factor analytic inference is inductive or abductive
science.” in nature. Expositions of exploratory factor analysis
However, this view flies in the face of widely seldom consider its inferential nature, but when they
accepted scientific practice. Woodward’s (1989) do, the method is usually said to be inductive in char-
detailed examination of the practice of phenomena acter. This is not surprising, given that exploratory
detection in science strongly supports the view that factor analysis can be plausibly located histori-
different parts of the various sciences, from physics cally within seventeenth- and eighteenth-century
to anthropology, appropriately make extensive use of empiricist philosophy of science and its inductive
statistical methods in the detection of empirical phe- conception of inquiry (Mulaik, 1987). However,
nomena. It is hard to imagine that statistics would even if one relaxes the Baconian ideal that inductive
exist as we currently know it unless it provided a method is an algorithm that produces incorrigi-
necessary armament for science. In this regard, it ble knowledge, an inductive characterization of
is worth noting that Sohn has acknowledged the exploratory factor analysis seems inappropriate. This
claim made by Hedges and Olkin (1985) that meta- is because inductive inference, being descriptive
analysis in some form or other has a long history inference, cannot take the researcher from mani-
of use in the hard sciences. Sohn has stated his dis- fest effects to theoretical entities that are different in
agreement with this position, but he has not argued kind from those effects. However, abductive infer-
against it. Space limitations preclude further analy- ence, which is concerned with the generation and
sis of Sohn’s (1996) argument, but perhaps enough evaluation of explanatory hypotheses, can do so.
has been said to suggest that it should be regarded For this reason, exploratory factor analysis is bet-
with some skepticism. ter understood as an abductive method of theory
More work on the philosophical foundations generation (Haig, 2005a), a characterization that
of meta-analysis is clearly needed. However, from coheres well with its general acceptance as a latent
this highly selective examination of its conceptual variable method. With exploratory factor analysis,
foundations, it can be concluded that meta-analysis abductive inference is explanatory inference that
receives its primary justification in scientific research leads back from presumed effects to underlying
by articulating one, but only one, way in which causes.
researchers can fashion empirical generalization There are different forms of abductive reason-
from the findings of primary studies. It derives its ing. Exploratory factor analysis is a method that
importance in this role directly from the impor- can facilitate the drawing of explanatory inferences
tance accorded the goal of phenomena detection in that are known as existential abductions. Existen-
science. tial abductions enable researchers to hypothesize
the existence, but not the nature, of entities previ-
ously unknown to them. The innumerable examples
Exploratory Factor Analysis of existential abduction in science include the ini-
Despite the advanced statistical state and fre- tial postulation of hidden entities such as atoms,
quent use of exploratory factor analysis in the genes, tectonic plates, and personality traits. In
behavioral sciences, debate about its basic nature cases like these, the primary thrust of the ini-
and worth abounds. Thurstone (1947) has appro- tial abductive inferences is to claims about the
priately emphasized the exploratory nature of the existence of theoretical entities to explain empiri-
method, and many methodologists take it to be cal facts or phenomena. Similarly, the hypotheses
a method for postulating latent variables that are given to us through the use of exploratory factor
thought to underlie patterns of correlations. Some, analysis postulate the existence of latent variables
however, understand exploratory factor analysis as a such as Spearman’s g and extraversion. It remains
method of data reduction that provides an econom- for further research to elaborate on the first rudi-
ical description of correlational data. The present mentary conception of these variables and their
section considers this and other important foun- interrelation.
dational issues that have figured prominently in The factor analytic use of existential abduction
discussions of the method. to infer the existence of, say, the theoretical entity
haig 21
g , can be coarsely reconstructed in accordance with correlated, infer one or more common causes unless
the following argument schema: there is good reason not to.” Clearly, the princi-
ple should not be taken as a hard-and-fast rule,
The surprising empirical phenomenon of the positive for in many cases, proper inferences about corre-
correlations among different tests of ability is lated events will not be in terms of common causes.
identified. The qualifier, “unless there is good reason not to,”
If g exists, and it is validly and reliably measured by a should be understood as an injunction to consider
Weschler intelligence scale (and/or some other causal interpretations of the correlated events other
objective test), then the positive manifold would than the common causal kind. For example, in a
follow as a matter of course. given research situation, the correlated events might
Hence, there are grounds for judging the hypothesis be related as direct causes, or their relationship
of g to be initially plausible and worthy of further might be mediated by a third variable in a causal
pursuit. sequence.
Note that the schema for abductive inference and Although exploratory factor analysis is used to
its application to the generation of the hypothesis infer common causes, expositions of common factor
of g is concerned with the form of the arguments analysis that explicitly acknowledge the importance
involved, rather than with the actual generation of the principle of the common cause are rare. Kim
of the explanatory hypotheses. The explanatory and Mueller’s (1978) textbook exposition of factor
hypothesis is given in the second premise of the argu- analysis is a noteworthy exception. In discussing
ment. An account of the genesis of the explanatory the conceptual foundations of factor analysis, these
hypothesis must, therefore, be furnished by some authors evince the need to rely on what they call
other means. It is plausible to suggest that reasoning the postulate of factorial causation. The postulate
to explanatory hypotheses trades on our evolved cog- of factorial causation is characterized by them as
nitive ability to abductively generate such hypothe- “the assumption that the observed variables are lin-
ses (Carruthers, 2002). Whatever its origin, an ear combinations of underlying factors and that
informative methodological characterization of the the covariation between observed variables solely
abductive nature of factor analytic inference must results from their common sharing of one or more
appeal to the scientist’s own psychological resources of the common factors” (p. 78). The authors make
as well as those of logic. This injunction is motivated clear that the common factors mentioned in the
by the realist methodological thesis of naturalism assumption are to be regarded as underlying causal
stated near the beginning of the chapter. variables. Understood as a methodological injunc-
Although exploratory factor analysis exemplifies tion, this postulate functions as a variant of the
well the character of existential abduction, it is principle of the common cause. Without appeal
clearly not an all-purpose method for abductively to this principle, factor analysts could not iden-
generating explanatory hypotheses and theories. tify the underlying factor pattern from the observed
With its focus on common factors, it can prop- covariance structure.
erly serve as a generator of elementary theories only There are two features of the principle of the com-
in those multivariate domains that have common mon cause that make it particularly suitable for use
causal structures. in exploratory factor analysis. First, it can be applied
in situations where we do not know how likely it
is that the correlated effects result from a common
The Principle of the Common Cause cause. The abductive inference to common causes
It is well known that exploratory factor analysis is is a basic explanatory move that is non-probabilistic
a common factor analytic model in which the latent and qualitative in nature. It is judgments about the
factors it postulates are referred to as common factors. soundness of the abductive inferences, rather than
Less well known is the fact that there is an important the assignment of probabilities, that confer initial
principle of scientific inference, known as the prin- plausibility on the factorial hypotheses spawned by
ciple of the common cause. (e.g., Sober, 1988), that exploratory factor analysis. Second, the principle
can be used to drive the nature and shape of the exis- can also be used in situations where we are essentially
tential abductive inferences involved in exploratory ignorant of the nature of the common cause. With
factor analysis. The principle of the common cause this second feature, the principle of the common
can be formulated concisely as follows: “Whenever cause accommodates the fact the exploratory factor
two or more events are improbably, or significantly, analysis trades in existential abductions.
22 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
Further, it is important to appreciate that the not determined by the relevant evidence (Duhem,
principle of the common cause does not function 1954). In effect, methodological indeterminacy
in isolation from other methodological constraints. arises from our inability to justify accepting one
Embedded in exploratory factor analysis, the prin- theory among alternatives on the basis of empirical
ciple helps one limit existential abductive inference evidence alone.
to those situations where we reason back from Mulaik (1987) sees underdetermination in
correlated effects to one or more common causes. exploratory factor analysis as involving inductive
Although covariation is an important basic datum in generalizations that go beyond the data. How-
science, not all effects are expressed as correlations, ever, inductive underdetermination should be seen
and of course, not all causes are of the common as applying specifically to the task of establishing
causal variety. It follows from this that one should factorial invariance where one seeks constructive
not always look for common causal interpreta- or external replication of factor patterns. However,
tions of multivariate data, for there are numerous for exploratory factor analysis there is also need to
alternative latent variable models. acknowledge and deal with abductive underdeter-
mination involved in the generation of explanatory
factorial theories. The sound abductive generation
The Underdetermination of Factors of hypotheses is essentially educated guess work.
The methodological literature on exploratory fac- Thus, drawing from background knowledge, and
tor analysis has given considerable attention to the constrained by correlational empirical evidence, the
indeterminacy of factors in the common factor use of exploratory factor analysis can reasonably be
model. Factor indeterminacy arises from the fact expected to yield a plurality of factorial hypotheses
that the common factors are not uniquely deter- or theories that are thought to be in competition.
mined by their related manifest variables. As a This contrasts strongly with the unrealistic expec-
consequence, a number of different common fac- tation held by many earlier users of exploratory
tors can be produced to fit the same pattern of factor analysis that the method would deliver them
correlations in the manifest variables. strongly justified claims about the one best factorial
Although typically ignored by factor analytic hypothesis or theory.
researchers, factor indeterminacy is an epistemic How then, can exploratory factor analysis deal
fact of life that continues to challenge factor ana- with the specter of underdetermination in the con-
lytic methodologists. Some methodologists regard text of theory generation? One plausible answer is
factor indeterminacy as a serious problem for com- that exploratory factor analysis narrows down the
mon factor analysis and recommend the use of space of a potential infinity of candidate theories
alternative methods such as component analysis to a manageable subset by facilitating judgments
methods because they are considered to be deter- of initial plausibility (Haig, 2005a). It seems clear
minate methods. enough that scientists often make judgments about
One constructive perspective on the issue of fac- the initial plausibility of the explanatory hypothe-
tor indeterminacy has been suggested by Mulaik and ses and theories that they generate. However, it
McDonald (Mulaik & McDonald, 1978; McDon- is less clear just to what this evaluative criterion
ald & Mulaik, 1979; Mulaik, 1987). Their position amounts (cf. Whitt, 1992). With an abductive
is that the indeterminacy involved in interpreting conception of exploratory factor analysis, judg-
the common factors in exploratory factor analysis ments of the initial plausibility of theories are
is just a special case of the general indeterminacy judgments about the soundness of the abductive
of theory by empirical evidence widely encountered arguments employed in generating those theories.
in science, and it should not, therefore, be seen as It seems reasonable to suppose that those who
a debilitating feature that forces us to give up on employ exploratory factor analysis as an abductive
common factor analysis. method of theory generation often make com-
Indeterminacy is pervasive in science. It occurs in pressed judgments of initial plausibility. By confer-
semantic, metaphysical, and epistemological forms ring judgments of initial plausibility on the theories
(McMullin, 1995). Factor indeterminacy is essen- it spawns, exploratory factor analysis deems them
tially epistemological in nature. The basic idea of worthy of further pursuit, whereupon it remains
epistemological or, more precisely, methodological for the factorial theories to be further developed
indeterminacy is that the truth or falsity (better, and evaluated, perhaps through the use of confir-
acceptance or rejection) of a hypothesis or theory is matory factor analysis. It should be emphasized that
haig 23
using exploratory factor analysis to facilitate judg- contrary, confirmatory factor analysis is the more
ments about the initial plausibility of hypotheses will theoretically important—and should be the much
still leave the domains being investigated in a state more widely used—of the two major factor analytic
of considerable theoretical underdetermination. It approaches. (p. 134)
should also be stressed that the resulting plural-
ity of competing theories is entirely to be expected Although Gorsuch makes his claim in emphatic
and should not be thought of as an undesirable terms, he provides no justification for it. He seems
consequence of employing exploratory factor anal- to assume that theory testing is more important
ysis. To the contrary, it is essential for the growth than theory generation. However, this belief is diffi-
of scientific knowledge that we vigorously promote cult to defend, given the fact that there are many
theoretical pluralism (Hooker, 1987), develop the- other important phases of scientific inquiry that
oretical alternatives, and submit them to critical together demand most of the researcher’s method-
scrutiny. ological time. Recall, for example, the importance
to science of the detection of empirical phenomena
and the generation, development, and comparative
Exploratory Factor Analysis and appraisal of theories. Viewed in this light, theory
Confirmatory Factor Analysis testing is just one, albeit important, part of scien-
The aforementioned consideration of exploratory tific method (cf. Simon, 1968). Given the fact that
factor analysis supports the conclusion that there science is as much concerned with theory genera-
is an important role for its use in factor analytic tion as it is with theory testing, and acknowledging
research. However, this conclusion raises the ques- that exploratory factor analysis is a useful abduc-
tion of how exploratory factor analysis relates to its tive method of theory generation, exploratory factor
confirmatory namesake. In contrast to popular ver- analysis deserves to be regarded as important as con-
sions of the classical inductivist view of science that firmatory factor analysis in the theory constructor’s
inductive method can generate secure knowledge toolkit.
claims, the use of exploratory factor analysis as an To conclude, despite the fact that exploratory
abductive method of theory generation can only fur- factor analysis has been frequently employed in
nish researchers with a weak logic of discovery—one psychological research, the extant methodological
that gives them educated guesses about underlying literature on the method seldom acknowledges the
causal factors. It is for this reason that those who use explanatory and ontological import of the method’s
exploratory factor analysis to generate theories need inferential nature. Abduction is a major form of cre-
to supplement their generative assessments of the ative reasoning in science, and the principle of the
initial plausibility of those theories with additional common cause is a maxim of scientific inference with
consequentialist justification in the form of confir- important application in research. By incorporating
matory factor analytic testing or some alternative these two related elements into its fold, exploratory
approach to theory appraisal. factor analysis is ensured an important, albeit cir-
However, in the factor analytic literature, there is cumscribed, role in the construction of explanatory
a division of opinion about whether exploratory fac- theories in psychology and other sciences. By gen-
tor analysis and confirmatory factor analysis should erating structural models about common causes,
be viewed as complementary or competing methods exploratory factor analysis can serve as a valuable
of common factor analysis. Quite a number of factor precursor to confirmatory factor analysis.
analytic methodologists have expressed views that
discourage their complementary use in factor ana-
lytic research. For example, Gorsuch (1983), in his Causal Modeling
well-known book on factor analysis, has expressed During the last 50 years, social and behavioral
a view about the relative importance of exploratory science methodologists have developed a variety of
and confirmatory factor analysis that seems to be increasingly sophisticated statistical methods to help
quite widely held today: researchers draw causal conclusions from correla-
tional data. These causal modeling methods, as they
Although the next three chapters [of Factor analysis] have sometimes been called, include path analy-
are primarily concerned with exploratory factor sis, confirmatory factor analysis, and full structural
analysis, the space and time given to that technique is equation modeling.
a function of the complexity of resolving its Despite the fact that psychological researchers are
problems, not of its theoretical importance. On the increasingly employing more sophisticated causal
24 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
modeling methods in place of simple regression as a tendency—that is, an existing state of an object,
and partial correlation procedures, worries about which, if unimpeded, will produce its effect. We
both their accompanying methodology and their are, therefore, able to infer abductively the presence
misuse have been expressed (e.g., Cliff, 1983). In of the causal mechanism on the basis of knowl-
this section, I consider some philosophical aspects edge of the triggering condition and/or its presumed
of three foundational issues that have been dis- effect.
cussed in the literature on causal modeling: the Advocates of the generative theory of causa-
different ideas of causation presupposed by causal tion claim it has a number of important advan-
modeling; the suggestion that causal modeling tages over the regularity theory. One advantage of
can be viewed as a form of inference to the best the generative theory is that it is able to accom-
explanation; and the contested nature of latent modate deep structural, explanatory theories that
variables. postulate unobserved generative mechanisms. It is
argued that we need a theory of causation that
Causal Modeling and Theories of Causation affords us the conceptual space to do this, because
One central methodological issue in the debates many of the world’s causal mechanisms are not
about causal modeling has to do with the appropri- open to direct inspection. The latent variables of
ateness of the nature of causation involved in various many of our causal modeling methods are thought
causal modeling procedures. A popular view of the by many to be precisely of this kind. A related
matter is clearly expressed by Kenny (1979), who advantage of the generative theory is that it is
points out that three conditions must be satisfied needed for enlightened social policy because, as
for a researcher to claim that one variable is the noted in the discussion of evaluation research ear-
cause of another. The first condition is that the rela- lier, the possibility of ameliorative action depends
tionship be asymmetric. The second condition is on effecting change based on an understanding
that a functional relationship be present between of how things work, and for this, knowledge of
cause and effect. The third condition is that the the relevant underlying causal mechanisms is often
causal relationship be direct or non-spurious. These essential.
three conditions are exactly those of the regular- A third, and significant, advantage of the the-
ity theory of causation, which depicts the causal ory of generative causation is that it enables us to
relationship between events in terms of their reg- draw the important distinction between empirical
ular succession, covariation, and contiguity. The regularities and genuine causal laws. An adequate
regularity theory, which is more or less Humean methodology of causal modeling must be able to
in character, provides an important part of the draw the distinction between empirical regularities
epistemic backdrop against which traditional causal and causal laws, because the ability to do so is a
modeling methods like path analysis have been conceptual requirement of being able to differen-
understood. tiate properly direct causal relations from spurious
However, like other parts of the standard empiri- correlations. By collapsing this distinction, empiri-
cist enterprise, this theory has received strong criti- cists, with their regularity theory of causation, are
cism. Its claimed limitations can best be appreciated unable to articulate a satisfactory notion of spuri-
by contrasting it with a scientific realist alternative ousness. For example, Simon’s (1985) influential
known as the generative theory of causation (Harré analysis of spurious correlation explicitly rejects the
& Madden, 1975). Briefly stated, the generative the- generative theory of causation and endeavors to
ory depicts causation as a relationship where, under ground the distinction between true and spurious
appropriate conditions, a causal mechanism produces correlations on a commitment to an empiricist view
its effect. For this to happen, the causal mechanism of causation. The common or intervening causes
must connect to its effect and have the power to that bring about spurious correlations will typically
generate that effect, usually when stimulated by the be unobserved. However, for a statistical treat-
appropriate causal condition. It is the productivity ment of these variables to be consistent with the
of a generative mechanism that makes it a causal regularity theory, Simon’s view of causation forces
mechanism, and for this to occur, there must be a researchers to focus on altogether different variables
naturally necessary connection that allows for the at the manifest level. But this cavalier ontologi-
transmission of power from cause to effect. This cal slide wrecks our efforts to obtain worthwhile
causal power exists irrespective of whether it is cur- causal knowledge, because the manifest replacement
rently being exercised. As such, it is properly viewed variables cannot act as effective surrogates for their
haig 25
presumed common and intervening causes. They the best explanation being considered here will
are ontologically distinct from such causes and, allow the researcher to say that a good explana-
although as causal conditions they may trigger their tory theory will rate well on the explanatory criteria
latent counterparts, they do not function as major and, at the same, boast a measure of predictive
causal mechanisms that can bring about spurious success. Most methodologists and scientists will
correlations. agree that an explanatory theory that also makes
Although it can plausibly be argued that a gener- accurate predictions will be a better theory for
ative view of causation is required to make sense of doing so.
research that embraces hidden causal mechanisms, it Although the use of structural equation mod-
does not follow, as is often supposed (e.g., Manicas, eling in psychology often involves testing models
1989; Sayer, 1992), that the regularity theory has in hypothetico-deductive fashion, it also contains
no place in a realist conception of science. With a minority practice that amounts to inference to
its emphasis on the ideas of regularity, it would the best explanation in the sense just noted. This
seem to be a suitable account of causation for claims latter practice involves the explicit comparison of
about phenomena that take the form of empirical models or theories in which an assessment of
generalizations. Nor should it be thought that the their goodness-of-fit to the empirical evidence is
regularity theory and the generative theory together combined with the weighting of the fit statistics
give one a full understanding of causation in science. in terms of parsimony indices (Kaplan, 2000).
For example, structural equation modeling provides Here goodness-of-fit provides information about the
knowledge of causal networks. As such, it does not so empirical adequacy of the model, whereas parsi-
much encourage the development of detailed knowl- mony functions as a criterion having to do with
edge of the nature of latent variables as it specifies the explanatory value of the model. Both are used
the range and order of causal relations into which in judgments of model goodness. Markus, Hawes,
latent and manifest variables enter. For this type of and Thasites (2008) recently have suggested that
research, a network theory of causation is needed in structural equation modeling, model fit can
(Thagard, 1999). be combined with model parsimony, understood
The suggestion that different conceptions of cau- as explanatory power, to provide an operational-
sation are relevant to causal modeling fits with a ized account of inference to the best explanation.
philosophy of causal pluralism, which is increasingly They discussed the prospects of using structural
being recommended in contemporary methodolog- equation modeling in this way to evaluate the com-
ical studies of the nature of causation (Godfrey- parative merits of two- and three-factor models of
Smith, 2009). psychopathy.
26 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
descriptions of such factors are properly considered factor analysis can facilitate the drawing of sound
to serve the instrumentalist function of econom- abductive inference about the existence of latent
ically redescribing the original correlational data. factors.
This interpretation of exploratory factor analysis This highly selective consideration of the phi-
has been quite influential in psychometrics (Block, losophy of causal modeling points to three con-
1976) and has been taught to generations of psychol- clusions: (1) that causation in causal modeling
ogy students through textbooks on psychological manifests itself in a number of different ways;
testing (e.g., Anastasi & Urbina, 1997). Fictional- (2) that causal modeling can transcend the limi-
ism seems to be the preferred option of many factor tations of the hypothetico-deductive method and
analytic researchers in the domains of personality adopt the practice of inference to the best expla-
and intelligence. nation; and (3) that latent variables deserve to be
However, fictionalism is a difficult position to given a realist interpretation as genuine theoretical
defend, and it seems to fail in factor analysis for entities.
the reason it fails in science generally: it inappro-
priately grants ontological significance to a sharp
distinction between observation and theory to but- Conclusion
tress the claim that only observable, or manifest, The philosophy of research methods is an aspect
entities exist, when observability is really a matter of research methodology that receives limited atten-
of degree. Fictionalists argue that because we do tion in behavioral science education. The majority
not have perceptual experience of theoretical enti- of students and research practitioners in the behav-
ties, we do not have grounds for saying they exist; ioral sciences obtain the bulk of their knowledge of
we only have grounds for claiming that observable research methods from textbooks. However, a casual
entities exist. But realist philosophers of science (e.g., examination of these texts shows that they tend to
Maxwell, 1962) assert in reply that fictionalists can- pay little, if any, serious regard to the philosophy
not maintain a sharp distinction between what is of science and its bearing on the research process.
observable and what is unobservable. What can- As Kuhn pointed out nearly 50 years ago (Kuhn,
not be seen directly by the unaided eye might be 1962; 1996), textbooks play a major role in dogmat-
observable through a magnifying glass and what ically initiating students into the routine practices
cannot be observed through a magnifying glass of normal science. Serious attention to the philos-
might be observed through a microscope. Impor- ophy of research methods would go a considerable
tantly, how we draw the observable/unobservable way toward overcoming this uncritical practice. As
distinction at a particular time is a function of contemporary philosophy of science increasingly
prior knowledge, our physiological make-up, and focuses on the contextual use of research methods in
available instrumentation. Thus, the distinction the various sciences, it is to be hoped that research
provides no basis for deciding what entities do, and methodologists and other behavioral scientists will
do not, exist. To assert that factors are theoretical avail themselves of the genuine methodological
entities is not to regard them as having a special insights that it contains.
existence; rather, it is to acknowledge that we come
to know them indirectly in terms of their correlated Future Directions
effects. On this realist interpretation, the factors are In this final section of the chapter, I suggest a
regarded as latent variables that underlie, and give number of directions that future work in the philos-
rise to, the correlated manifest variables. Borsboom ophy of quantitative methods might take. The first
(2005) has made a strong case for adopting a realist three suggestions are briefly discussed; the remaining
attitude to latent variables more generally by com- suggestions are simply listed.
bining an argument similar to Maxwell’s, along with
other foundational considerations in philosophy of
science and psychometrics. Understand Quantitative Methods Through
This general argument against fictionalism simul- Methodology
taneously supports the doctrine of realism in sci- A proper understanding of research methods can-
ence, but it does not by itself establish that not be had without an appreciation of their accom-
the factors of exploratory factor analysis should panying methodology (see Proctor & Capaldi,
be given a realist interpretation. Whether this 2001). Methodology is the interdisciplinary field
should happen depends also on whether exploratory that studies methods. It draws from the disciplines
haig 27
of statistics, philosophy of science, and cognitive However, the quantitative/qualitative debate has
science, among others. And yet, the professional not considered the possibility that most methods
literature of these disciplines does not figure in have both quantitative and qualitative dimensions.
the content of research methods courses. Further, In many cases, we are likely to gain a better under-
it is important to appreciate that methodology standing of the research methods we use not by view-
has descriptive, critical, and advisory dimensions: ing them as either qualitative or quantitative but by
Again, the typical methods curriculum does not sys- regarding them as having both qualitative and quan-
tematically deal with research methods with these titative dimensions. Three examples are mentioned
considerations in mind. It is not surprising, there- here. First, grounded theory (e.g., Strauss, 1987),
fore, that psychologists’ understanding of research the most prominent extant qualitative methodol-
methods often leaves a lot to be desired. ogy, is in good part the product of a translation
A realist-oriented methods curriculum would from some sociological quantitative methods of the
profitably consider methods in the light of the pri- 1950s. Moreover, there is nothing in principle to
mary characteristics of realist methodology outlined stop researchers using quantitative methods within
early in the chapter. To mention just three of these: the fold of grounded theory. Exploratory factor
Greater prominence would be given to generative analysis, for example, could sometimes be used for
methodology in which reasoning well to hypothe- generating grounded theory.
ses and theories would figure in the assessment Second, although exploratory factor analysis
of those knowledge claims. The sound abductive itself is standardly characterized as a multivariate sta-
reasoning to factorial hypotheses using exploratory tistical method, the inferential heart of the method
factor analysis is perhaps psychology’s best example is the important scientific heuristic known as the
of generative justification. Similarly, the coherentist principle of the common cause. Importantly, this
justification of explanatory theories using methods principle, which guides the factor analytic inference
of inference to the best explanation would feature from correlations to underlying common factors,
much more prominently than it does at present. can be effectively formulated in qualitative terms.
Finally, in adopting methods that are apt for us Finally, the theory of explanatory coherence
as knowing subjects, heuristic procedures would (Thagard, 1992), which evaluates theories in terms
receive much more explicit attention in the methods of their explanatory power, is a qualitative method
curriculum as realistic guides to our thinking. of theory appraisal, but it is implemented by a com-
The British Psychological Society now takes con- puter program that is part of the method proper,
ceptual and historical issues as one of psychol- and that has a connectionist architecture that is
ogy’s seven core areas. Teaching methods through mathematically constrained.
methodology is the appropriate way to employ this It is recommended, then, that methodologists
core area in the teaching of research methods. The and researchers seriously entertain the prospect that
American Psychological Association and the Asso- individual methods are likely to have a mix of
ciation of Psychological Science would do well to qualitative and quantitative features—that is, that
follow suit, for it is only by making considered use individual methods are themselves mixed methods.
of methodology that a genuine education in research
methods can be achieved.
Evaluate the Philosophical Critiques of
Rethink the Quantitative/Qualitative Quantitative Research Methods
Distinction Most of the occasional references to scientific real-
A major feature of the methodological landscape ism in psychology are to Bhaskar’s (1975; 1979)
has been the discussion of the distinction between critical realism (e. g., Manicas & Secord, 1983),
quantitative and qualitative methods. Although per- a philosophy that has had considerable impact on
haps necessary in establishing a legitimate role for the the social sciences (e.g., Sayer, 1992). Interestingly,
use of qualitative methods in research, the distinc- critical realists have expressed strong reservations
tion is now the subject of critical scrutiny. The way about the use of statistical methods in quantitative
the original distinction was drawn has been ques- research. Bhaskar himself goes so far as to say that
tioned (e.g., Michell, 2004), and the combination causal models should be “totally discarded.” There
of qualitative and quantitative methods in mixed are various reasons for this attitude (see Pratschke,
methods strategies has been strongly promoted in 2003), but perhaps the most fundamental one is
recent times. the claim that statistical models themselves do not
28 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
provide researchers with the substantive models that • Apply insights from the “new experimental-
are sought in causal modeling research. ism” in the philosophy of science to the understand-
However, this claim rests on a mistaken concep- ing of quantitative research methods.
tion of the relation between statistical models and • Develop the philosophical foundations of the-
substantive theoretical models. It is hard to deny ory construction methods in the behavioral sciences.
that consideration of much more than the statistical • Assess the implications of different theories of
machinery of causal modeling is needed to ground causality for research methods.
substantive conclusions. Indeed, it is difficult to • Examine the philosophical foundations of
see how any statistical method could be properly “new” research methods such as data mining, struc-
understood and used in research without appeal to tural equation modeling, and functional neuroimag-
suprastatistical matters. Consider, for example, the
ing.
oft-made claim that factors of exploratory factor anal-
ysis are statistical entities and that the method cannot,
therefore, be used to favor one substantive facto- Author note
rial theory of intelligence over another (e.g., Gould, Brian D. Haig, Department of Psychology, Uni-
1996). This claim is false because factor analysts typ- versity of Canterbury Christchurch, New Zealand,
ically transcend the statistical level of the method Email: brian.haig@canterbury.ac.nz
and makes use of the relevant part of Latent Vari- References
able Theory to generate plausible hypotheses about Acree, M. C. (1979). Theories of statistical inference in psy-
the existence of latent variables. Of central relevance chological research: A historico-critical study. Dissertation
here is the fact that exploratory factor analysis exploits Abstracts International, 39, 5073B. (UMI No. 7907000).
theso-called“principleofthecommoncause”tosanc- Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.)
Upper Saddle River, NJ: Prentice-Hall.
tion inferences to the initial plausibility of interpreted Bangert-Drowns, R. L. (1986). Review of developments in meta-
latent variables. We saw earlier that inferences from analytic method. Psychological Bulletin, 99, 388–399.
manifest to latent variables made in accordance with Behrens, J. T., & Yu, C-H. (2003). Exploratory data analysis. In
this principle are abductive, or explanatory, in nature J. A. Schinka & W. F. Velicer (Eds.), Handbook of psychology,
and are made by factor analysts themselves. Although Vol. 2 (pp. 33–64). New York: Wiley.
Berger, J. O. & Selke, T. (1987). Testing a point null hypothesis:
the statistical machinery of multiple regression and The irreconcilability of p values and evidence (with com-
partial correlation theory is obviously an impor- ments). Journal of the American Statistical Association, 82,
tant part of exploratory factor analysis, its primary 112–139.
function is to facilitate researchers’ suprastatistical Bhaskar, R. (1975). A realist philosophy of science. Brighton,
inferences to latent factors. England: Harvester.
Bhaskar, R. (1979). The possibility of naturalism. Brighton,
It is important to appreciate that the interpretive England: Harvester.
dimension on causal modeling methods is a proper Block, N. J. (1976). Fictionalism, functionalism, and factor
part of its methodology. There is nothing in critical analysis. In R. S. Cohen, C. A. Hooker, & A. C. Micha-
realism, or other variants of scientific realism, that los (Eds.), Boston Studies in the Philosophy of Science, Vol. 32,
prevents one from taking such an outlook on causal (pp. 127–141). Dordrecht, the Netherlands: Reidel.
Bolles, R. C. (1962). The difference between statistical and
modeling. Indeed, scientific realism comports well scientific hypotheses. Psychological Reports, 11, 629–645.
with causal modeling methods that countenance Borsboom, D. (2005). Measuring the mind: Conceptual issues in
latent variables. contemporary psychometrics. Cambridge, England: Cambridge
University Press.
Borsboom, D. (2008). Latent variable theory. Measurement:
Additional Directions Interdisciplinary Research and Perspectives, 6, 25–53.
Space considerations prevent discussion of addi- Bunge, M. (2008). Bayesianism: Science or pseudoscience?
International Review of Victimology, 15, 165–178.
tional future directions in the philosophy of quan- Carruthers, P. (2002). The roots of scientific reasoning: Infancy,
titative methods. However, the following points modularity, and the art of tracking. In P. Carruthers, S. Stich,
deserve to be on an agenda for future study. & M. Siegal (Eds.), The cognitive basis of science (pp. 73–95).
Cambridge, England: Cambridge University Press.
• Develop a modern interdisciplinary concep- Chow, S. L. (1996). Statistical significance: Rationale, validity, and
tion of research methodology. utility. London, England: Sage.
• Give more attention to investigative strategies Cliff, N. (1983). Some cautions concerning the application of
causal modeling methods. Multivariate Behavioral Research,
in psychological research. 18, 115–126.
• Take major philosophical theories of scientific Cohen, J. (1994). The earth is round (p <.05). American
method seriously. Psychologist, 49, 997–1003.
haig 29
Cook, T. D., Cooper, H., Cordray, D. S., Hartmann, H., Haig, B. D. (2009). Inference to the best explanation: A neglected
Hedges, L. V., Light, R. J., et al. (1992). Meta-analysis for approach to theory appraisal in psychology. American Journal
explanation: A casebook. New York: Russell Sage Foundation. of Psychology, 122, 219–234.
Duhem, P. (1954). The aim and structure of physical theory (2nd Harré, R., & Madden, E. H. (1975). Causal powers. Oxford,
ed., P. P. Weiner, Trans.). Princeton, NJ: Princeton University England: Blackwell.
Press. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-
Earman, J. (1992). Bayes or bust? A critical examination of Bayesian analysis. New York: Academic Press.
confirmation theory. Cambridge, MA: MIT Press. Hooker, C. A. (1987). A realistic theory of science. New York: State
Efron, B., & Tibshirani, R. (1993). An introduction to the University of New York Press.
bootstrap. New York: Chapman & Hall. Howson, C., & Urbach, P. (2006). Scientific reasoning: The
Ehrenberg, A. S. C., & Bound, J. A. (1993). Predictability and Bayesian approach (3rd ed.). La Salle, IL: Open Court.
prediction. Journal of the Royal Statistical Society, Part 2, 156, Hubbard, R., & Lindsay, R. M. (2008). Why p values are not a
167–206. useful measure of evidence in statistical significance testing.
Fidell, L. S., & Tabachnick, B. G. (2003). Preparatory data anal- Theory and Psychology, 18, 69–88.
ysis. In J. A. Schinka & W. F. Velicer (Eds.), Handbook of Hubbard, R., & Ryan, P. A. (2000). The historical growth of
psychology, Vol. 2 (pp. 115–121). New York: Wiley. statistical significance testing in psychology—and its future
Fisher, R. A. (1925). Statistical methods for research workers. prospects. Educational and Psychological Measurement, 60,
Edinburgh, Scotland: Oliver & Boyd. 661–681.
Giere, R. N. (1972). The significance test controversy. British Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-
Journal for the Philosophy of Science, 23, 170–180. analysis: Correcting error and bias in research findings (2nd
Gigerenzer, G. (1993). The superego, the ego, and the id in ed.). Newbury Park, CA: Sage.
statistical reasoning. In G. Keren & C. Lewis (Eds.), A hand- Hurlbert, S. H., & Lombardi, C. M. (2009). Final col-
book for data analysis in the behavioral sciences (pp. 311–339). lapse of the Neyman-Pearson decision theoretic framework
Hillsdale, NJ: Lawrence Erlbaum. and rise of the neoFisherian. Annales Zoologici Fennici, 46,
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null 311–349.
ritual: What you wanted to know about significance testing Jeffreys, H. (1939). Theory of probability. Oxford, England:
but were afraid to ask. In D. Kaplan (Ed.), The Sage handbook Clarendon.
of quantitative methodology for the social sciences (pp. 391– Kaplan, D. (2000). Structural equation modeling: Foundations and
408). Thousand Oaks, CA: Sage. extensions. Thousand Oaks, CA: Sage.
Glass, G. V. (1972). The wisdom of scientific inquiry on Kenny, D. (1979). Correlation and causation. New York: Wiley.
education. Journal of Research in Science Teaching, 9, 3–18. Kim, J-O., & Mueller, C. W. (1978). Introduction to factor
Glass, G. V. (1976). Primary, secondary, and meta-analysis of analysis. Beverly Hills, CA: Sage.
research. Educational Researcher, 5, 3–8. Kuhn, T. S. (1996). The structure of scientific revolutions (3rd
Glass, G. V., & Kleigl, R. M. (1983). An apology for research inte- ed.). Chicago, IL: University of Chicago Press (originally
gration in the study of psychotherapy. Journal of Consulting published, 1962).
and Clinical Psychology, 51, 28–41. Lakatos, I. (1970). Falsification and the methodology of scien-
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis tific research programmes. In I. Lakatos & A. Musgrave, A.
in social research. Beverly Hills, CA: Sage. (Eds.). Criticism and the growth of knowledge (pp. 91–195).
Glymour, C. N. (1980). Theory and evidence. Princeton, NJ: Cambridge, England: Cambridge University Press.
Princeton University Press. Laudan, L. (1981). Science and hypothesis. Dordrecht, The
Godfrey-Smith, P. (2009). Causal pluralism. In H. Beebee, C. Netherlands: Reidel.
Hitchcock, & P. Menzies (Eds.), The Oxford handbook of Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories
causation (pp. 326–337). Oxford, England: Oxford Univer- of testing hypotheses: One theory or two? Journal of the
sity Press. American Statistical Association, 88, 1242–1249.
Good, I. J. (1983). The philosophy of exploratory data analysis. Lipton, P. (2004). Inference to the best explanation (2nd ed.).
Philosophy of Science, 50, 283–295. London, England: Routledge.
Goodman, S. N. (1993). P values, hypothesis tests, and likeli- Manicas, P. T. (1989). Explanation and quantification. In B.
hood: Implications for epidemiology of a neglected historical Glassner & J. D. Moreno (Eds.), The qualitative-quantitative
debate. American Journal of Epidemiology, 137, 485–495. distinction in the social sciences (pp. 179–205). Dordrecht,
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: The Netherlands: Kluwer.
Lawrence Erlbaum. Manicas, P. T., & Secord, P. F. (1983). Implications for psychol-
Gottfredson, G. D. (1984). A theory-ridden approach to ogy of the new philosophy of science. American Psychologist,
program evaluation. American Psychologist, 39, 1101–1112. 38, 399–413.
Gould, S. J. (1996). The mismeasure of man (2nd ed.). New York: Markus, K., Hawes, S. S., & Thasites, R. (2008). Abductive infer-
Norton. ence to psychological variables: Steiger’s question and best
Greenwood, J. D. (1992). Realism, empiricism, and social explanations in psychopathy. Journal of Clinical Psychology,
constructionism. Theory and Psychology, 2, 131–151. 64, 1069–1088.
Haig, B. D. (1987). Scientific problems and the conduct of Maxwell, G. (1962). The ontological status of theoretical entities.
research. Educational Philosophy and Theory, 19, 22–32. In H. Feigl & G. Maxwell (Eds.), Minnesota Studies in the
Haig, B. D. (2005a). Exploratory factor analysis, theory genera- Philosophy of Science, Vol. 3 (pp. 3–28). Minneapolis, MN:
tion, and scientific method. Multivariate Behavioral Research, University of Minnesota Press.
40, 303–329. McDonald, R. P., & Mulaik, S. A. (1979). Determinacy of com-
Haig, B. D. (2005b). An abductive theory of scientific method. mon factors: A nontechnical review. Psychological Bulletin,
Psychological Methods, 10, 371–388. 86, 297–306.
30 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
McGrew, T. (2003). Confirmation, heuristics, and explanatory Salmon, W. C. (1984). Scientific explanation and the causal
reasoning. British Journal for the Philosophy of Science, 54, structure of the world. Princeton, NJ: Princeton University
553–567. Press.
McGuire, W. J. (1997). Creative hypothesis generating in psy- Sayer, A. (1992). Methods in social science: A realist approach (2nd
chology: Some useful heuristics. Annual Review of Psychology, ed.). London, England: Routledge.
48, 1–30. Schmidt, F. L. (1992). What do data really mean? Research find-
McMullin, E. (1995). Underdetermination. The Journal of ings, meta-analysis, and cumulative knowledge in psychology.
Medicine and Philosophy, 20, 233–252. American Psychologist, 47, 1173–1181.
Meehl, P. E. (1967). Theory testing in psychology and physics: A Schmidt, F. L. (1993). Meta-analysis and cumulative knowledge.
methodological paradox. Philosophy of Science, 34, 103–115. Contemporary Psychology, 38, 1163–1165.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Simon, H. (1985). Spurious correlation: A causal interpretation.
Karl, Sir Ronald, and the slow progress of soft psychology. In H. M. Blalock (Ed.), Causal models in the social sciences
Journal of Consulting and Clinical Psychology, 46, 806–834. (2nd ed.) (pp. 7–21). New York: Aldine.
Meehl, P. E. (1990). Appraising and amending theories: The Simon, H. A. (1968). On judging the plausibility of theories. In
strategy of Lakatosian defense and two principles that warrant B. van Rootselaar & J. F. Staal (Eds.), Logic, methodology, and
using it. Psychological Inquiry, 1, 108–141, 173–180. philosophy of science, Vol. 3 (pp. 439–459). Amsterdam, The
Meehl, P. E. (1997). The problem is epistemology, not statis- Netherlands: North-Holland.
tics: Replace significance tests by confidence intervals and Smith, M. L., Glass, G. V., & Miller, T. I. (1980). The benefits
quantify accuracy of risky numerical predictions. In L. L. Har- of psychotherapy. Baltimore, MD: Johns Hopkins University
low, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were Press.
no significance tests? (pp. 393–425). Mahwah, NJ: Lawrence Sober, E. (1988). The principle of the common cause. In J.
Erlbaum. H. Fetzer (Ed.), Probability and causality (pp. 211–229).
Michell, J. (2004). The place of qualitative research in psychology. Dordrecht, The Netherlands: Reidel.
Qualitative Research in Psychology, 1, 307–319. Sober, E. (2008). Evidence and evolution: The logic behind the
Morrison, D. E., & Henkel, R. E. (1970). The significance test science. Cambridge, England: Cambridge University Press.
controversy. Chicago: Aldine. Sohn, D. (1996). Meta-analysis and science. Theory and Psychol-
Mulaik, S. A. (1985). Exploratory statistics and empiricism. ogy, 6, 229–246.
Philosophy of Science, 52, 410–430. Spielman, S. (1974). The logic of tests of significance. Philosophy
Mulaik, S. A. (1987). A brief history of the philosophical foun- of Science, 14, 211–225.
dations of exploratory factor analysis. Multivariate Behavioral Strauss, A. L. (1987). Qualitative analysis for social scientists. New
Research, 22, 267–305. York: Cambridge University Press.
Mulaik, S. A., & McDonald, R. P. (1978). The effect of additional Thagard, P. (1992). Conceptual revolutions. Princeton, NJ:
variables on factor indeterminacy in models with a single Princeton University Press.
common factor. Psychometrika, 43, 177–192. Thagard, P. (1999). How scientists explain disease. Princeton, NJ:
Neyman, J., & Pearson, E. S. (1933). On the problem of the Princeton University Press.
most efficient tests of statistical hypotheses. Philosophical Thurstone, L. L. (1947). Multiple factor analysis. Chicago, IL:
Transactions of the Royal Society of London A, 231, 289–337. University of Chicago Press.
Nickerson, R. S. (2000). Null hypothesis significance testing: Trout, J. D. (1998). Measuring the intentional world: Realism,
A review of an old and continuing controversy. Psychological naturalism, and quantitative methods in the behavioral sciences.
Methods, 5, 241–301 New York, NY: Oxford University Press.
Nickles, T. (1981). What is a problem that we might solve it? Tukey, J. W. (1962). The future of data analysis. The Annals of
Synthese, 47, 85–118. Mathematical Statistics, 33, 1–67.
Nickles, T. (1987). Twixt method and madness. In N. J. Nerses- Tukey, J. W. (1969). Analyzing data: Sanctification or detective
sian (Ed.), The process of science (pp. 41–67). Dordrecht, The work? American Psychologist, 24, 83–91.
Netherlands: Martinus Nijhoff. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA:
Pratschke, J. (2003). Realistic models? Critical realism and Addison Wesley.
statistical models in the social sciences. Philosophica, 71, Tukey, J. W. (1980). We need both exploratory and confirmatory.
13–38. American Statistician, 34, 23–25.
Proctor, R. W., & Capaldi, E. J. (2001). Empirical evaluation Waller, N. G. (2004). The fallacy of the null hypothesis in soft
and justification of methodologies in psychological science. psychology. Applied and Preventive Psychology, 11, 83–86.
Psychological Bulletin, 127, 759–772. Weisberg, J. (2009). Locating IBE in the Bayesian framework.
Psillos, S. (2004). Inference to the best explanation and Bayesian- Synthese, 167, 125–143.
ism. In F. Stadler (Ed.), Induction and deduction in the sciences Whitt, L. A. (1992). Indices of theory promise. Philosophy of
(pp. 83–91). Dordrecht, The Netherlands: Kluwer. Science, 59, 612–634.
Royall, R. M. (1986). The effect of sample size on the meaning Wilkinson, L., & the Task Force on Statistical Inference. (1999).
of significance tests. American Statistician, 40, 313–315. Statistical methods in psychology journals: Guidelines and
Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. explanations. American Psychologist, 54, 594–604.
New York: Chapman & Hall. Woodward, J. (1989). Data and phenomena. Synthese, 79, 393–
Rozeboom, W. W. (1997). Good science is abductive, not 472.
hypothetico-deductive. In L. L. Harlow, S. A. Mulaik, & Yu, C-H. (2006). Philosophical foundations of quantitative
J. H. Steiger (Eds.), What if there were no significance tests? research methodology. Lanham, MD: University Press of
(pp. 335–391.). Hillsdale, NJ: Lawrence Erlbaum. America.
haig 31
CHAPTER
Abstract
The purpose of this chapter is to provide a context for thinking about the role of ethics in quantitative
methodology. We begin by reviewing the sweep of events that led to the creation and expansion of
legal and professional rules for the protection of research subjects and society against unethical
research. The risk–benefit approach has served as an instrument of prior control by institutional
review boards. After discussing the nature of that approach, we sketch a model of the costs and
utilities of the “doing” and “not doing” of research. We illustrate some implications of the expanded
model for particular data analytic and reporting practices. We then outline a 5 × 5 matrix of general
ethical standards crossed with general data analytic and reporting standards to encourage thinking
about opportunities to address quantitative methodological problems in ways that may have mutual
ethical and substantive rewards. Finally, we discuss such an opportunity in the context of problems
associated with risk statistics that tend to exaggerate the absolute effects of therapeutic interventions
in randomized trials.
Key Words: Accountability,American Psychological Association (APA) Ethics Code, Belmont Report,
ethical principles, health statistics, institutional review board (IRB), moral dilemmas, Nuernberg
(Nuremberg) Code, quantitative methodology, risk–benefit assessment, statistical illiteracy,
transparency, volunteer bias
32
matter, which researchers have trolled for patterns intentions, it might be noted that there is also empir-
or relationships that can be quantified by the use of ical evidence of (1) individual differences in this
simple summary statistics (e.g., Bordia & Rosnow, ability (described as interpersonal acumen) and (2) a
1995) or coded and visually mapped out using social hierarchy of intention–action combinations ranging
network analysis to highlight links and nodes in from the least to most cognitively taxing (Rosnow,
the observed relationships (e.g., Kossinets & Watts, Skleder, Jaeger, & Rind, 1994).
2006; see also Wasserman & Faust, 1994). Whether Societal sensitivities, group values, and situa-
blogs and online discussion groups’ data are treated tional pressures are subject to change in the face of
quantitatively or qualitatively, their use may raise significant events. On the other hand, some moral
ethical questions regarding the invasion of privacy. values seem to be relatively enduring and univer-
The fact that bloggers and participants in online sal, such as the golden rule, which is frequently
discussion groups are typically fully aware that their expressed as “Do unto others as you would have
communications are quite public minimizes the risk them do unto you.” In the framework of quanti-
of invasion of privacy. tative methods and ethics, a categorical imperative
The term ethics was derived from the Greek ethos, might be phrased as “Thou shalt not lie with statis-
meaning “character” or “disposition.” We use the tics.” Still, Huff, in his book, How to Lie with
term here to refer to the dos and don’ts of codified Statistics, first cautioned the public in 1954 that the
and/or culturally ingrained rules by which morally reporting of statistical data was rife with “bungling
“right” and “wrong” conduct can be differentiated. and chicanery” (Huff, 1982, p. 6). The progress
Conformity to such rules is usually taken to mean of science depends on the good faith that scien-
morality, and our human ability to make ethical tists have in the integrity of one another’s work
judgments is sometimes described as a moral sense and the unbiased communication of findings and
(a tradition that apparently goes back to David conclusions. Lying with statistics erodes the cred-
Hume’s A Treatise of Human Nature in the eigh- ibility of the scientific enterprise, and it can also
teenth century). Philosophers and theologians have present an imminent danger to the general pub-
frequently disagreed over the origin of the moral lic. “Lying with statistics” can refer to a number of
sense, but on intuitive grounds it would seem that more specific practices: for example, reporting only
morality is subject to societal sensitivities, group the data that agree with the researcher’s bias, omit-
values, and social pressures. It is not surprising ting any data not supporting the researcher’s bias,
that researchers have documented systematic biases and, most serious of all, fabricating the results of
in ethical judgments. For example, in a study by the research. For example, there was a case reported
Kimmel (1991), psychologists were asked to make in 2009 in which an anesthesiologist fabricated the
ethical judgments about hypothetical research cases. statistical data that he had published in 21 journal
Kimmel reported that those psychologists who were articles purporting to give the results of clinical trials
more (as compared to less) approving in their eth- of a pain medicine marketed by the company that
ical judgments were more often men; had held an funded much of the doctor’s research (Harris, 2009).
advanced degree for a longer period of time; had Another case, around the same time, involved a
received the advanced degree in an area such as medical researcher whose accounts of a blood test
experimental, developmental, or social psychology for diagnosing prostate cancer had generated con-
rather than counseling, school, or community psy- siderable excitement in the medical community, but
chology; and were employed in a research-oriented who was now being sued for scientific fraud by his
context as opposed to a service-oriented context. industry sponsor (Kaiser, 2009). As the detection of
Citing this work of Kimmel’s (1991), an Amer- lying with statistics is often difficult in the normal
ican Psychological Association (APA) committee course of events, there have been calls for the public
raised the possibility that inconsistent implemen- sharing of raw data so that, as one scientist put it,
tation of ethical standards by review boards might “Anyone with the skills can conduct their own anal-
result not only from the expanded role of review yses, draw their own conclusions, and share those
boards but also from the composition of particu- conclusions with others” (Allison, 2009, p. 522).
lar boards (Rosnow, Rotheram-Borus, Ceci, Blanck, That would probably help to reduce some of the
& Koocher, 1993). Assuming that morality is also problems of biased data analysis, but it would not
predicated on people’s abilities to figure out the help much if the shared data had been fabricated to
meaning of other people’s actions and underlying begin with.
rosnow, rosenthal 33
In the following section, we review the sweep of in war crimes and crimes against humanity were
events that led to the development and growth of presented before a military tribunal in Nuernberg,
restraints for the protection of human subjects and Germany. For allied atomic scientists, Hiroshima
society against unethical research.1 A thread running had been an epiphany that vaporized the old iconic
throughout the discussion is the progression of the image of a morally neutral science. For researchers
APA’s code of conduct for psychological researchers who work with human participants, the backdrop
who work with human subjects. We assume that to the formation of ethical and legal principles to
many readers of this Handbook will have had a protect the rights and welfare of all research partici-
primary or consulting background in some area of pants were the shocking revelations of the war crimes
psychology or a related research area. The develop- documented in meticulous detail at the Nuernberg
ment of the APA principles gives us a glimpse of the Military Tribunal. Beginning with the German inva-
specific impact of legal regulations and societal sensi- sion of Poland at the outbreak of World War II,
tivities in an area in which human research has been Jews and other ethnic minority inmates of concen-
constantly expanding into new contexts, includ- tration camps had been subjected to sadistic tortures
ing “field settings and biomedical contexts where and other barbarities in “medical experiments” by
research priorities are being integrated with the Nazi physicians in the name of science. As method-
priorities and interests of nonresearch institutions, ically described in the multivolume report of the
community leaders, and diverse populations” (Sales trials, “in every one of the experiments the subjects
& Folkman, 2000, p. ix). We then depict an ideal- experienced extreme pain or torture, and in most of
ized risk–benefit approach that review boards have them they suffered permanent injury, mutilation, or
used as an instrument of prior control of research, death” (Trials of War Criminals before the Nuernberg
and we also describe an expanded model focused Military Tribunals under Control Council Law No.
on the costs and utilities of “doing” and “not doing” 10, p. 181). Table 3.1 reprints the principles of the
research. The model can also be understood in terms Nuernberg Code, which have resonated to varying
of the cost–utility of adopting versus not adopting degrees in all ensuing codes for biomedical research
particular data analytic and reporting practices. We with human participants as well as having had a gen-
then outline a matrix of general ethical standards erative influence on the development of principles
crossed with general data analytic and reporting for the conduct of behavioral and social research.
standards as (1) a reminder of the basic distinction We pick up the story again in the 1960s in the
between ethical and technical mandates and (2) a United States, a period punctuated by the shock-
framework for thinking about promising opportu- ing assassinations of President John F. Kennedy in
nities for ethical and substantive rewards in quanti- 1963 and then of Dr. Martin Luther King, Jr., and
tative methodology (cf. Blanck, Bellack, Rosnow, Senator Robert F. Kennedy in 1968. The 1960s
Rotheram-Borus, & Schooler, 1992; Rosenthal, were also the beginning of the end of what Pat-
1994; Rosnow, 1997). We discuss such an opportu- tullo (1982) called “the hitherto sacrosanct status”
nity in the context of the way in which a fixation on of the human sciences, which moved “into an era
relative risk (RR) in large sample randomized trials of uncommonly active concern for the rights and
of therapeutic interventions can lead to misconcep- welfare of segments of the population that had tra-
tions about the practical meaning to patients and ditionally been neglected or exploited” (p. 375).
health-care providers of the particular intervention One highly publicized case in 1963 involved a noted
tested. cancer researcher who had injected live cancer cells
into elderly, noncancerous patients, “many of whom
were not competent to give free, informed consent”
The Shaping of Principles to Satisfy Ethical (Pattullo, p. 375). In 1966, the U.S. Surgeon Gen-
and Legal Standards eral issued a set of regulations governing the use
If it can be said that a single historical event in of subjects by researchers whose work was funded
modern times is perhaps most responsible for ini- by the National Institutes of Health (NIH). Most
tially galvanizing changes in the moral landscape of NIH grants funded biomedical research, but there
science, then it would be World War II. On Decem- was also NIH support for research in the behav-
ber 9, 1946 (the year after the surrender of Germany ioral and social sciences. In 1969, following the
on May 8, 1945 and the surrender of Japan on exposure of further instances in which the welfare
August 14, 1945), criminal proceedings against Nazi of subjects had been ignored or endangered in
physicians and administrators who had participated biomedical research (cf. Beecher, 1966, 1970; Katz,
34 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.1. The Nuernberg Principles of 1946–1949 for Permissible Medical Experiments∗
1. The voluntary consent of the human subject is absolutely essential.
This means that the person involved should have legal capacity to give consent; should be so situated as to be able to
exercise free power of choice, without the intervention of any element of force, fraud, deceit, duress, over-reaching,
or other ulterior form of constraint or coercion; and should have sufficient knowledge and comprehension of the
elements of the subject matter involved as to enable him to make an understanding and enlightened decision. This
latter element requires that before the acceptance of an affirmative decision by the experimental subject there should
be made known to him the nature, duration, and purpose of the experiment; the method and means by which it is
to be conducted; all inconveniences and hazards reasonably to be expected; and the effects upon his health or person
which may possibly come from his participation in the experiment.
The duty and responsibility for ascertaining the quality of the consent rests upon each individual who initiates,
directs or engages in the experiment. It is a personal duty and responsibility which may not be delegated to another
with impunity.
2. The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or
means of study, and not random and unnecessary in nature.
3. The experiment should be so designed and based on the results of animal experimentation and a knowledge of the
natural history of the disease or other problem under study that the anticipated results will justify the performance
of the experiment.
4. The experiment should be so conducted as to avoid all unnecessary physical and mental suffering and injury.
5. No experiment should be conducted where there is an a priori reason to believe that death or disabling injury will
occur; except, perhaps, in those experiments where the experimental physicians also serve as subjects.
6. The degree of risk to be taken should never exceed that determined by the humanitarian importance of the problem
to be solved by the experiment.
7. Proper preparations should be made and adequate facilities provided to protect the experimental subject against even
remote possibilities of injury, disability, or death.
8. The experiment should be conducted only by scientifically qualified persons. The highest degree of skill and care
should be required through all stages of the experiment of those who conduct or engage in the experiment.
9. During the course of the experiment the human subject should be at liberty to bring the experiment to an end if he
has reached the physical or mental state where continuation of the experiment seems to him to be impossible.
10. During the course of the experiment the scientist in charge must be prepared to terminate the experiment at any
stage, if he has probable cause to believe, in the exercise of the good faith, superior skill and careful judgment required
of him that a continuation of the experiment is likely to result in injury, disability, or death to the experimental
subject.
∗ Reprinted from pp. 181–182 in Trials of War Criminals before the Nuernberg Military Tribunals under Control Council Law No. 10, October
1946–April 1949, Vol. II. Washington, DC: U.S. Government Printing Office.
1972), the Surgeon General extended the earlier never informed they had syphilis but only told they
safeguards to all human research. In a notorious had “bad blood.” They also were not offered peni-
case (not made public until 1972), a study con- cillin when it was discovered in 1943 and became
ducted by the U.S. Public Health Service (USPHS) widely available in the 1950s, and they were warned
simply followed the course of syphilis in more than not to seek treatment elsewhere or they would be
400 low-income African-American men residing in dropped from the study. The investigators went so
Tuskegee, Alabama, from 1932 to 1972 (Jones, far as to have local doctors promise not to treat the
1993). Recruited from churches and clinics with the men in the study with antibiotics (Stryker, 1997). As
promise of free medical examinations and free health the disease progressed in its predictable course with-
care, the men who were subjects in this study were out any treatment, the men experienced damage to
rosnow, rosenthal 35
their skeletal, cardiovascular, and central nervous by the Department of Health, Education, and Wel-
systems and, in some cases, death. In 1972, the fare (DHEW) 3 years earlier were codified as gov-
appalling details were finally made public by a lawyer ernment regulations by the National Research Act of
who had been an epidemiologist for the USPHS, July 12, 1974 (Pub. L. 93–348). Among the require-
and the study was halted (Fairchild & Bayer, 1999). ments instituted by the government regulations was
The following year, the Senate Health Subcommit- that institutions that received federal funding estab-
tee (chaired by Senator Edward Kennedy) aired the lish an institutional review board (IRB) for the
issue of scientific misconduct in public hearings. purpose of making prior assessments of the possi-
The early 1960s was also a period when emotions ble risks and benefits of proposed research.2 This
about invasions of privacy were running high in the federal act also created the National Commission
United States after a rash of reports of domestic wire- for the Protection of Human Subjects of Biomed-
tapping and other clandestine activities by federal ical and Behavioral Research. Following hearings
agencies. In the field of psychology, the morality of that were held over a 3-year period, the document
the use of deception was being debated. As early called “The Belmont Report” was issued in April,
as the 1950s, there had been concerned statements 1979 (available online and also reprinted in Sales &
issued about the use of deception in social psycho- Folkman, 2000). Unlike other reports of the Com-
logical experiments (Vinacke, 1954). The spark that mission, the Belmont Report did not provide a list of
lit a fuse in the 1960s in the field of psychology specific recommendations for administrative action
was the publication of Stanley Milgram’s studies on by the DHEW, but the Belmont Report recom-
obedience to authority, in which he had used an mended that the report be adopted in its entirety
elaborate deception and found that a majority of as a statement of DHEW policy. In the pream-
ordinary research subjects were willing to adminis- ble, the report mentioned the standards set by the
ter an allegedly dangerous level of shock to another Nuernberg (“Nuremberg”) Code as the prototype
person when “ordered” to do so by a person in of many later codes consisting of rules, some gen-
authority, although no shock was actually admin- eral and others specific, to guide researchers and
istered (cf. Blass, 2004; Milgram, 1963, 1975). assure that research involving human participants
Toward the end of the 1960s, there were impas- would be carried out in an ethical manner. Not-
sioned pleas by leading psychologists for the ethical ing that the rules were often inadequate to cover
codification of practices commonly used in psycho- complex situations, that they were often difficult to
logical research (Kelman, 1968; Smith, 1969). As apply or interpret, and that they often came into
there were new methodological considerations and conflict with one another, the National Commis-
federal regulations since the APA had formulated sion had decided to issue broad ethical principles
a professional code of ethics in 1953, a task force to provide a basis on which specific rules could
was appointed to draft a set of ethical principles for then be formulated, criticized, and interpreted. As
research with human subjects. Table 3.2 shows the we track the development of the APA principles
final 10 principles adopted by the APA’s Council of in this discussion, we will see that there has been
Representatives in 1972, which were elucidated in a similar progression, and later we will emphasize
a booklet that was issued the following year, Ethi- some broad ethical principles when we discuss the
cal Principles in the Conduct of Research with Human interface of ethical and technical standards in quan-
Participants (APA, 1973). An international survey titative methodology. For now, however, it can be
conducted 1 year later found there were by then two noted that the Belmont Report proposed that (1)
dozen codes of ethics that had been either adopted respect for persons, (2) beneficence, and (3) jus-
or were under review by professional organizations tice provide the foundation for research ethics. The
of social scientists (Reynold, 1975). Although viola- report also proposed norms for scientific conduct
tions of such professional codes were supported by in six major areas: (1) the use of valid research
penalties such as loss of membership in the organiza- designs, (2) the competence of researchers, (3) the
tion, the problem was that many researchers engaged identification of risk–benefit consequences, (4) the
in productive, rewarding careers did not belong to selection of research participants, (5) the importance
these professional organizations. of obtaining informed voluntary consent, and (6)
By the end of the 1970s, the pendulum had compensation for injury.3
swung again, as accountability had become the In 1982, the earlier APA code was updated, and
watchword of the decade (National Commission on a new version of Ethical Principles in the Conduct
Research, 1980). In 1974, the guidelines provided of Research with Human Participants was published
36 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.2. The Ethical Principles Adopted in December, 1972, by the Council of Representatives of
the American Psychological Association∗
The decision to undertake research rests upon a considered judgment by the individual psychologist
about how best to contribute to psychological science and to human welfare. The responsible psychologist
weighs alternative directions in which personal energies and resources might be invested. Having made
the decision to conduct research, psychologists must carry out their investigation with respect for the
people who participate and with concern for their dignity and welfare. The Principles that follow make
explicit the investigator’s ethical responsibilities toward participants over the course of research, from the
initial decision to pursue a study to the steps necessary to protect the confidentiality of research data.
These Principles should be interpreted in terms of the context provided in the complete document offered
as a supplement to these Principles.
1. In planning a study the investigator has the personal responsibility to make a careful evaluation of
its ethical acceptability, taking into account these Principles for research with human beings. To
the extent that this appraisal, weighing of scientific and humane values, suggests a deviation from
any Principle, the investigator incurs an increasingly serious obligation to seek ethical advice and to
observe more stringent safeguards to protect the rights of the human research participants.
2. Responsibility for the establishment and maintenance of acceptable ethical practice in research always
remains with the individual investigator. The investigator is also responsible for the ethical treatment
of research participants by collaborators, assistants, students, and employees, all of whom, however,
incur parallel obligations.
3. Ethical practice requires the investigator to inform the participant of all features of the research that
reasonably might be expected to influence willingness to participate and to explain all other aspects of
the research about which the participant inquires. Failure to make full disclosure gives added emphasis
to the investigator’s responsibility to protect the welfare and dignity of the research participant.
4. Openness and honesty are essential characteristics of the relationship between investigator and research
participant. When the methodological requirements of a study necessitate concealment or deception,
the investigator is required to ensure the participant’s understanding of the reasons for this action
and to restore the quality of the relationship with the investigator.
5. Ethical research practice requires the investigator to respect the individual’s freedom to decline to
participate in research or to discontinue participation at any time. The obligation to protect this
freedom requires special vigilance when the investigator is in a position of power over the participant.
The decision to limit this freedom increases the investigator’s responsibility to protect the participant’s
dignity and welfare.
6. Ethically acceptable research begins with the establishment of a clear and fair agreement between the
investigator and the research participant that clarifies the responsibilities of each. The investigator
has the obligation to honor all promises and commitments included in that agreement.
7. The ethical investigator protects participants from physical and mental discomfort, harm, and danger.
If the risk of such consequences exists, the investigator is required to inform the participant of that
fact, secure consent before proceeding, and take all possible measures to minimize distress. A research
procedure may not be used if it is likely to cause serious and lasting harm to participants.
8. After the data are collected, ethical practice requires the investigator to provide the participant
with a full clarification of the nature of the study and to remove any misconceptions that may have
arisen. Where scientific or human values justify delaying or withholding information, the investigator
acquires a special responsibility to assure that there are no damaging consequences for the participant.
rosnow, rosenthal 37
Table 3.2. (Continued)
9. Where research procedures may result in undesirable consequences for the participant, the investigator
has the responsibility to detect and remove or correct these consequences, including, where relevant,
long-term aftereffects.
10. Information obtained about the research participants during the course of an investigation is con-
fidential. When the possibility exists that others may obtain access to such information, ethical
research practice requires that this possibility, together with the plans for protecting confidentiality,
be explained to the participants as a part of the procedure for obtaining informed consent.
∗ Quoted from pp. 1–2 in Ethical Principles in the Conduct of Research with Human Participants. Washington, DC: American
Psychological Association. Copyright © 1973 by the American Psychological Association.
by the APA. In the earlier version and in the 1982 were perhaps likely to take their ethical cues from
version, the principles were based on actual ethi- the legislated morality and its oversight by IRBs as
cal problems that researchers had experienced, and opposed to the aspirational principles embodied in
extensive discussion throughout the profession was professional codes, such as the APA code. Another
incorporated in each edition of Ethical Principles. complication in this case is that there was a frac-
The principles in the 1982 code are reprinted in tious splintering of the APA in the late-1980s, which
Table 3.3. Notice that there were several new terms resulted in many members resigning from the APA
(subject at risk and subject at minimal risk) and also an and the creation of the rival American Psycholog-
addendum sentence to informed consent (referring ical Society, subsequently renamed the Association
to “research with children or with participants who for Psychological Science (APS). For a time in the
have impairments that would limit understanding 1990s, a joint task force of the APA and the APS
and/or communication”). The concept of minimal attempted to draft a revised ethics code, but the
risk (which came out of the Belmont Report) means APS then withdrew its participation following an
that the likelihood and extent of harm to the partic- apparently irresolvable disagreement. In 2002, after
ipants are presumed to be no greater than what may a 5-year revision process, APA adopted a reworked
be typically experienced in everyday life or in routine ethics code that emphasized the five general princi-
physical or psychological examinations (Scott-Jones ples defined (by APA) in Table 3.4 and also “specific
& Rosnow, 1998, p. 149). In actuality, the extent of standards” that fleshed out these principles.5 The
harm may not be completely anticipated, and esti- tenor of the final document was apparently intended
mating the likelihood of harm is frequently difficult to reflect the remaining majority constituency of
or impossible. Regarding the expanded statement the APA (practitioners) but also the residual con-
on deception, the use of deception in research had stituency of psychological scientists who perform
been frowned upon for some years although there either quantitative or qualitative research in fun-
had long been instances in which active and passive damental and applied contexts. Of the specific
deceptions had been used routinely. An example was principles with some relevance to data analysis or
the withholding of information (passive deception). quantitative methods, there were broadly stated rec-
Randomized clinical trials would be considered of ommendations such as sharing the research data
dubious value in medical research had the experi- for verification by others (Section 8.14), not mak-
menters and the participants not been deprived of ing deceptive or false statements (Section 8.10),
information regarding which condition was assigned using valid and reliable instruments (Section 9.02),
to each participant. On the other hand, in some drawing on current knowledge for design, standard-
areas of behavioral experimentation, the use of ization, validation, and the reduction or elimina-
deception has been criticized as having “reached tion of bias when constructing any psychometric
a ‘taken-for-granted’ status” (Smith, Kimmel, & instruments (Section 9.05). We turn next to the
Klein, 2009, p. 486).4 risk–benefit process, but we should also note that
Given the precedence of federal (and state) regu- ethical values with relevance to statistical practices
lations since the guidelines developed by the DHEW are embodied in the codes developed by statistical
were codified by the National Research Act in 1974 organizations (e.g., American Statistical Association,
(and revised as of November 6, 1975), researchers 1999; see also Panter & Sterba, 2011).
38 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.3. The Revised Ethical Principles Adopted in August, 1982, by the Council of Representatives of the
American Psychological Association for Research with Human Participants∗
The decision to undertake research rests upon a considered judgment by the individual psychologist about how best to
contribute to psychological science and human welfare. Having made the decision to conduct research, the psychologist
considers alternative directions in which research energies and resources might be invested. On the basis of this consid-
eration, the psychologist carries out the investigation with respect and concern for the dignity and welfare of the people
who participate and with cognizance of federal and state regulations and professional standards governing the conduct
of research with human participants.
A. In planning a study, the investigator has the responsibility to make a careful evaluation of its ethical acceptability. To
the extent that the weighing of scientific and human values suggests a compromise of any principle, the investigator
incurs a correspondingly serious obligation to seek ethical advice and to observe stringent safeguards to protect the
rights of human participants.
B. Considering whether a participant in a planned study will be a “subject at risk” or a “subject at minimal risk,” according
to recognized standards, is of primary ethical concern to the investigator.
C. The investigator always retains the responsibility for ensuring ethical practice in research. The researcher is also
responsible for the ethical treatment of research participants by collaborators, assistants, students, and employees, all
of whom, however, incur similar obligations.
D. Except in minimal-risk research, the investigator establishes a clear and fair agreement with research participants, prior
to their participation, that clarifies the obligations and responsibilities of each. The investigator has the obligation
to honor all promises and commitments included in that agreement. The investigator informs the participants of all
aspects of the research that might reasonably be expected to influence willingness to participate and explains all other
aspects of the research about which the participants inquire. Failure to make full disclosure prior to obtaining informed
consent requires additional safeguards to protect the welfare and dignity of the research participants. Research with
children or with participants who have impairments that would limit understanding and/or communication requires
special safeguarding procedures.
E. Methodological requirements of a study may make the use of concealment or deception necessary. Before conducting
such a study, the investigator has a special responsibility to (1) determine whether the use of such techniques is justified
by the study’s prospective scientific, educational, or applied value; (2) determine whether alternative procedures are
available that do not use concealment or deception; and (3) ensure that the participants are provided with sufficient
explanation as soon as possible.
F. The investigator respects the individual’s freedom to decline to participate in or to withdraw from the research at any
time. The obligation to protect this freedom requires careful thought and consideration when the investigator is in
a position of authority or influence over the participant. Such positions of authority include, but are not limited to,
situations in which research participation is required as part of employment or in which the participant is a student,
client, or employee of the investigator.
G. The investigator protects the participant from physical and mental discomfort, harm, and danger that may arise
from research procedures. If risks of such consequences exist, the investigator informs the participant of that fact.
Research procedures likely to cause serious or lasting harm to a participant are not used unless the failure to use
these procedures might expose the participant to risk of greater harm or unless the research has great potential benefit
and fully informed and voluntary consent is obtained from each participant. The participant should be informed
of procedures for contacting the investigator within a reasonable time period following participation should stress,
potential harm, or related questions or concerns arise.
H. After the data are collected, the investigator provides the participant with information about the nature of the study
and attempts to remove any misconceptions that may have arisen. Where scientific or human values justify delaying
or withholding this information, the investigator incurs a special responsibility to monitor the research and to ensure
that there are no damaging consequences for the participant.
rosnow, rosenthal 39
Table 3.3. (Continued)
I. Where research procedures result in undesirable consequences for the individual participant, the investigator has the
responsibility to detect and remove or correct these consequences, including long-term effects.
J. Information obtained about a research participant during the course of an investigation is confidential unless otherwise
agreed upon in advance. When the possibility exists that others may obtain access to such information, this possibility,
together with the plans for protecting confidentiality, is explained to the participant as part of the procedure for
obtaining informed consent.
∗ Quoted from pp. 5–7 in Ethical Principles in the Conduct of Research with Human Participants. Washington, DC: American Psychological Association.
Expanding the Calculation of Risks and beings are not omniscient, there was also the prob-
Benefits lem that “neither the risks nor the benefits . . . can
After the Belmont Report, it seemed that every- be perfectly known in advance” (Mark et al., 1999,
thing changed permanently for scientists engaged p. 49).
in human subject research, and it made little dif- These complications notwithstanding, another
ference whether they were engaged in biomedical, catch-22 of the risk–benefit assessment is that it
behavioral, or social research. As the philosopher focuses only on the doing of research. Some years ago,
John E. Atwell (1981) put it, the moral dilemma we proposed a way of visualizing this predicament—
was to defend the justification of using human sub- first, in terms of an idealized representation of the
jects as the means to an end that was beneficial risk–benefit assessment and, second, in terms of an
in some profoundly significant way (e.g., the pro- alternative model focused on the costs and benefits
gression of science, public health, or public policy) of both the doing and not doing of research (Rosen-
while protecting the moral “ideals of human dignity, thal & Rosnow, 1984). The latter model also has
respect for persons, freedom and self-determination, implications for the risk–benefit (we prefer the term
and a sense of personal worth” (p. 89). Review cost–utility) of using or not using particular quanti-
boards were now delegated the responsibility of mak- tative methods (we return to this idea in a moment).
ing prior assessments of the future consequences of First, however, Figure 3.1 shows an idealized repre-
proposed research on the basis of the probability sentation of the traditional risk–benefit assessment.
that a certain magnitude of psychological, physi- Risk (importance or probability of harm) is plotted
cal, legal, social, or economic harm might result, from low (C) to high (A) on the vertical axis, and
weighed against the likelihood that “something of the benefit is plotted from low (C) to high (D) on
positive value to health or welfare” might result. the horizontal axis. In other words, studies in which
Quoting the Belmont Report, “risk is properly con- the risk–benefit assessment is close to A would pre-
trasted to probability of benefits, and benefits are sumably be less likely to be approved; studies close
properly contrasted with harms rather than risks of to D would be more likely to be approved; and stud-
harms,” where the “risks and benefits of research ies falling along the B–C “diagonal of indecision”
may affect the individual subjects, the families of exist in a limbo of uncertainty until relevant infor-
the individual subjects, and society at large (or mation nudges the assessment to either side of the
special groups of subjects in society).” The moral diagonal. The idea of “zero risk” is a methodological
calculus of benefits to risks was said to be “in a conceit, however, because all human subject research
favorable ratio” when the anticipated risks were out- can be understood as carrying some degree of risk.
weighed by the anticipated benefits to the subjects The potential risk in the most benign behavioral
(assuming this was applicable) and the anticipated and social research, for example, is the “danger of
benefit to society in the form of the advance- violating someone’s basic rights, if only the right of
ment of knowledge. Put into practice, however, privacy” (Atwell, 1981, p. 89). However, the funda-
researchers and members of review boards found it mental problem of the traditional model represented
difficult to “exorcize the devil from the details” when in Figure 3.1 is that it runs the risk of ignoring the
challenged by ethical guidelines that frequently con- “not doing of research.” Put another way, there are
flicted with traditional technical criteria (Mark, also moral costs when potentially useful research is
Eyssell, & Campbell, 1999, p. 48). As human forestalled, or if the design or implementation is
40 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.4. General Principles Adopted in 2003 by the American Psychological Association∗
General Principles
General Principles, as opposed to Ethical Standards, are aspirational in nature. Their intent is to guide and inspire
psychologists toward the very highest ethical ideals of the profession. General Principles, in contrast to Ethical Standards,
do not represent obligations and should not form the basis for imposing sanctions. Relying upon General Principles for
either of these reasons distorts both their meaning and purpose.
Principle A: Beneficence and Nonmaleficence
Psychologists strive to benefit those with whom they work and take care to do no harm. In their professional actions,
psychologists seek to safeguard the welfare and rights of those with whom they interact professionally and other affected
persons, and the welfare of animal subjects of research. When conflicts occur among psychologists’ obligations or
concerns, they attempt to resolve these conflicts in a responsible fashion that avoids or minimizes harm. Because
psychologists’ scientific and professional judgments and actions may affect the lives of others, they are alert to and
guard against personal, financial, social, organizational, or political factors that might lead to misuse of their influence.
Psychologists strive to be aware of the possible effect of their own physical and mental health on their ability to help
those with whom they work.
Principle B: Fidelity and Responsibility
Psychologists establish relationships of trust with those with whom they work. They are aware of their professional
and scientific responsibilities to society and to the specific communities in which they work. Psychologists uphold
professional standards of conduct, clarify their professional roles and obligations, accept appropriate responsibility for
their behavior, and seek to manage conflicts of interest that could lead to exploitation or harm. Psychologists consult
with, refer to, or cooperate with other professionals and institutions to the extent needed to serve the best interests
of those with whom they work. They are concerned about the ethical compliance of their colleagues’ scientific and
professional conduct. Psychologists strive to contribute a portion of their professional time for little or no compensation
or personal advantage.
Principle C: Integrity
Psychologists seek to promote accuracy, honesty, and truthfulness in the science, teaching, and practice of psychology.
In these activities psychologists do not steal, cheat, or engage in fraud, subterfuge, or intentional misrepresentation of
fact. Psychologists strive to keep their promises and to avoid unwise or unclear commitments. In situations in which
deception may be ethically justifiable to maximize benefits and minimize harm, psychologists have a serious obligation
to consider the need for, the possible consequences of, and their responsibility to correct any resulting mistrust or other
harmful effects that arise from the use of such techniques.
Principle D: Justice
Psychologists recognize that fairness and justice entitle all persons to access to and benefit from the contributions of
psychology and to equal quality in the processes, procedures, and services being conducted by psychologists. Psychol-
ogists exercise reasonable judgment and take precautions to ensure that their potential biases, the boundaries of their
competence, and the limitations of their expertise do not lead to or condone unjust practices.
Principle E: Respect for People’s Rights and Dignity
Psychologists respect the dignity and worth of all people, and the rights of individuals to privacy, confidentiality, and
self-determination. Psychologists are aware that special safeguards may be necessary to protect the rights and welfare
of persons or communities whose vulnerabilities impair autonomous decision making. Psychologists are aware of and
respect cultural, individual, and role differences, including those based on age, gender, gender identity, race, ethnicity,
culture, national origin, religion, sexual orientation, disability, language, and socioeconomic status and consider these
factors when working with members of such groups. Psychologists try to eliminate the effect on their work of biases based
on those factors, and they do not knowingly participate in or condone activities or others based upon such prejudices.
∗ Quoted from the American Psychological Association’s Ethical Principles of Psychologists and Code of Conduct (http://www.apa.org/
ethics/code2002.html). Effective date June 1, 2003, copyrighted in 2002 by the American Psychological Association.
compromised in a way that jeopardizes the integrity appraisal of the “doing of research,” and the model
of the research (cf. Haywood, 1976). on the right corresponds to an appraisal of the “not
Figure 3.2 shows an alternative representing a doing of research.” We use the terms cost and utility
cost–utility assessment of both the doing and not each in a collective sense. That is, the cost of doing
doing of research. In Part A, the decision plane and the cost of not doing a particular research study
model on the left corresponds to a cost–utility include more than only the risk of psychological or
rosnow, rosenthal 41
A B physical harm; they also include the cost to society,
High funding agencies, and to scientific knowledge when
imagination and new scientifically based solutions
are stifled. As one scientist observed, “Scientists
know that questions are not settled; rather, they are
on
given provisional answers for which it is contingent
Risk of doing
isi
ec
upon the imagination of followers to find more illu-
nd
fi minating solutions” (Baltimore, 1997, p. 8). We
lo
a
on
n
io
io
cis
cis
de
de
Cost
Cost
in
in
of
of
al
al
on
on
iag
iag
D
Low Low
C DO D C' DON’T DO D'
Low High Low High
Utility Utility
(Decision diagonal A'–D' of right plane in Part A)
D'
DON’T DO
Decision not to do
n
io
ecis
nd
fi
o
al
on
iag
D
Low
DO
A'
A D
Decision to do
(Decision diagonal A–D of left plane in Part A)
Figure 3.2 Decision-planes representing the ethical assessment of the costs and utilities of doing and not doing research (after Rosenthal
& Rosnow, 1984, 2008). (A) Costs and utilities of doing (left plane) and not doing (right plane) research. (B) Composite plane
representing both cases in Part A (above).
42 q u a n t i tat i v e m e t h o d s a n d e t h i c s
in Part A (in contrast to B–C and B’–C’, the diag- Unfortunately, there are also cases involving the
onals of indecision). For any point in the plane of suppression of data where the cost is not only that
doing, there would be a location on the cost axis it impedes progress in the field, but it also under-
and on the utility axis, where any point could be mines the authority and trustworthiness of scientific
translated to an equivalent position on the decision research and, in some instances, can cause harm
diagonal. Thus, if a point were twice as far from to the broader society, such as when public pol-
A as from D, the transformed point would then be icy is based on only partial information or when
located two-thirds of the way on the decision diago- there is selective outcome reporting of the efficacy
nal A–D (closer to D than to A). Similar reasoning is of clinical interventions in published reports of ran-
applicable to not doing, with the exception that close- domized trials (Turner, Matthews, Linardatos, Tell,
ness to A would mean “do” rather than “not do.” & Rosenthal, 2008; Vedula, Bero, Scherer, & Dick-
Points near D tell us the research should be done, ersin, 2009). In an editorial in Science, Cicerone
and points near D’ tell us the research should not be (2010), then president of the NAS, stated that his
done.6 impression—based on information from scattered
Figure 3.2 can also be a way of thinking about public opinion polls and various assessments of lead-
cost–utility dilemmas regarding quantitative meth- ers in science, business, and government—was that
ods and statistical reporting practices. In the 2009 “public opinion has moved toward the view that
edition of the U.S. National Academy of Sciences scientists often try to suppress alternative hypotheses
(NAS) guide to responsible conduct in scientific and ideas and that scientists will withhold data and
research, there are several hypothetical scenarios, try to manipulate some aspects of peer review to pre-
including one in which a pair of researchers (a post- vent dissent” (p. 624). Spielmans and Parry (2010)
doctoral and a graduate student) discuss how they described a number of instances of “marketing-based
should deal with two anomalous data points in a medicine” by pharmaceutical firms. Cases included
graph they are preparing to present in a talk (Com- the “cherry-picking” of data for publication, the sup-
mittee on Science, Engineering, and Public Policy, pression or understatement of negative results, and
2009). They want to put the best face on their the publication (and distribution to doctors) of jour-
research, but they fear that discussing the two out- nal articles that were not written by the academic
liers will draw people’s attention away from the bulk authors who lent their names, titles, and purported
of the data. One option would be to drop the out- independence to the papers but instead had been
liers, but, as one researcher cautions, this could be written by ghost writers hired by pharmaceutical and
viewed as “manipulating” the data, which is unethi- medical-device firms to promote company products.
cal. The other person comments that if they include Spielmans and Parry displayed a number of screen
the anomalous points, and if a senior person then shots of company e-mails, which we do not usually
advises them to include the anomalous data in a get to see because they go on behind the curtain. In
paper they are drafting for publication, this could an editorial in PLoS Medicine (2009) lamenting the
make it harder to have the paper accepted by a problem of ghost writers and morally dubious prac-
top journal. That is, the reported results will not tices in the medical marketing of pharmaceutics, the
be unequivocal (a potential reason for rejection), editors wrote:
and the paper will also then be too wordy (another
reason to reject it?). In terms of Figure 3.2, not How did we get to the point that falsifying the
including the two anomalous data points is anal- medical literature is acceptable? How did an industry
ogous to the “not doing of research.” There are, of whose products have contributed to astounding
course, additional statistical options, which can also advances in global health over the past several
be framed in cost–utility terms, such as using a suit- decades come to accept such practices as the norm?
able transformation to pull in the outlying stragglers Whatever the reasons, as the pipeline for new drugs
and make them part of the group (cf. Rosenthal dries up and companies increasingly scramble for an
& Rosnow, 2008, pp. 310–311). On the other ever-diminishing proportion of the market in
hand, outliers that are not merely recording errors “me-too” drugs, the medical publishing and
or instrument errors can sometimes provide a clue pharmaceutical industries and the medical
as to a plausible moderator variable. Suppressing community have become locked into a cycle of
this information could potentially impede scientific mutual dependency, in which truth and a lack of bias
progress (cf. Committee on Science, Engineering, have come to be seen as optional extras. Medical
and Public Policy, 2009, p. 8). journal editors need to decide whether they want to
rosnow, rosenthal 43
roll over and just join the marketing departments of research the focus is usually on the magnitude of
pharmaceutical companies. Authors who put their the difference between the experimental and con-
names to such papers need to consider whether doing trol group means. Such differences, we and other
so is more important than having a medical literature investigators observed, were sometimes affected by
that can be believed in. Politicians need to consider the use of volunteers (Rosenthal & Rosnow, 1975,
the harm done by an environment that incites 2009).
companies into insane races for profit rather than for With problems such as these serving as beginning
medical need. And companies need to consider points for empirical and meta-analytic investiga-
whether the arms race they have started will in the tions, we explored the characteristics that differen-
end benefit anyone. After all, even drug company tiated volunteers and nonvolunteers, the situational
employees get sick; do they trust ghost authors? determinants of volunteering, some possible inter-
actions of volunteer status with particular treatment
effects, the implications for predicting the direction
Ethical Standards and Quantitative and, sometimes, the magnitude of the biasing effects
Methodological Standards in research situations, and we also thought about
We turn now to Table 3.5, which shows a matrix the broader ethical implications of these findings
of general ethical standards crossed with quantitative (Rosenthal & Rosnow, 1975; Rosnow & Rosenthal,
methodological standards (after Rosnow & Rosen- 1997). For example, in one aspect of our meta-
thal, 2011). We do not claim that the row and analytic inquiry, we put the following question to
column standards are either exhaustive or mutually the research literature: What are the variables that
exclusive but only that they are broadly represen- tend to increase or decrease the rates of volunteering
tative of (1) aspirational ideals in the society as a obtained? Our preliminary answers to this question
whole and (2) methodological, data analytic, and may have implications for both the theory and prac-
reporting standards in science and technology. The tice of behavioral science. That is, if we continue
matrix is a convenient way of reminding ourselves to learn more about the situational determinants of
of the distinction between (1) and (2), and it is also volunteering, we can learn more about the social
a way of visualizing a potential clash between (1) psychology of social influence processes. Method-
and (2) and, frequently, the opportunity to exploit ologically, once we learn more about the situational
this situation in a way that could have rewarding determinants of volunteering, we should be in a bet-
ethical and scientific implications. Before we turn ter position to reduce the bias in our samples that
specifically to the definitions of the row and col- derives from the volunteer subjects being systemat-
umn headings in Table 3.5, we will give a quick ically different from nonvolunteers in a variety of
example of what we mean by “rewarding ethical characteristics. For example, one situational corre-
and scientific implications” in the context of the late was that the more important the research was
recruitment of volunteers. For this example, we perceived, the more likely people were to volun-
draw on some of our earlier work on specific threats teer for it. Thus, mentioning the importance of the
to validity (collectively described as artifacts) deriv- research during the recruitment phase might coax
ing from the volunteer status of the participants for more of the “nonvolunteers” into the sampling pool.
research participation. Among our concerns when It would be unethical to exaggerate or misrepresent
we began to study the volunteer was that ethical the importance of the research. By being honest,
sensitivities seemed to be propelling psychological transparent, and informative, we are treating people
science into a science of informed volunteers (e.g., with respect and also giving them a well-founded
Rosenthal & Rosnow, 1969; Rosnow & Rosenthal, justification for asking them to volunteer their valu-
1970). It was long suspected that people who vol- able time, attention, and cooperation. In sum, the
unteered for behavioral and social research might five column headings of Table 3.5 frequently come
not be fully adequate models for the study of behav- precorrelated in the real world of research, often with
ior in general. To the extent that volunteers differ implications for the principles in the row headings
from nonvolunteers on dimensions of importance, of the table.
the use of volunteers could have serious effects on Turning more specifically to the row headings in
such estimated parameters as means, medians, pro- Table 3.5, rows A, B, C, and E reiterate the three
portions, variances, skewness, and kurtosis. The “basic ethical principles” in the Belmont Report,
estimation of parameters such as these is the princi- which were described there as respect for persons,
pal goal in survey research, whereas in experimental beneficence, and justice. Beneficence (the ethical
44 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.5. General Ethical Standards Crossed with Quantitative Methodology Standards (after Rosnow &
Rosenthal, 2011)
Quantitative methodological standards
A. Beneficence
B. Nonmaleficence
C. Justice
D. Integrity
E. Respect
ideal of “doing good”) was conflated with the princi- results are presented in an open, frank, and can-
ple (b) of nonmaleficence (“not doing harm”), and the did way, that any technical language used is clear
two were also portrayed as obligations assimilating and appropriate, and that visual displays do not
two complementary responsibilities: (1) do not obfuscate the data but instead are as crystal clear as
harm and (2) maximize possible benefits and mini- possible. Elements of graphic design are explained
mize possible harms. Next in Table 3.5 is justice, by and illustrated in a number of very useful books
which we mean a sense of “fairness in distribution” and articles, particularly the work of Tufte (1983,
or “what is observed” (quoting from the Belmont 1990, 2006) and Wainer (1984, 1996, 2000, 2009;
Report). As the Belmont Report went on to explain: Wainer & Thissen, 1981), and there is a burgeon-
“Injustice occurs when some benefit to which a per- ing literature in every area of science on the visual
son is entitled is denied without good reason or display of quantitative data. Second, by informa-
when some burden is imposed unduly.” Conced- tiveness, we mean that there is enough information
ing that “what is equal?” and “what is unequal?” reported to enable readers to make up their own
are often complex, highly nuanced questions in a minds on the basis of the primary results and enough
specific research situation (just as they are when to enable others to re-analyze the summary results
questions of justice are associated with social prac- for themselves. The development of meta-analysis,
tices, such as punishment, taxation, and political with emphasis on effect sizes and moderator vari-
representation), justice was nonetheless considered a ables, has stimulated ways of recreating summary
basic moral precept relevant to the ethics of research data sets and vital effect size information, often from
involving human subjects. Next in Table 3.5 is minimal raw ingredients. Third, the term precision
integrity, an ethical standard that was not distinctly is used not in a statistical sense (the likely spread of
differentiated in the Belmont Report but that was estimates of a parameter) but rather in a more gen-
discussed in detail in the NAS guide (Committee eral sense to mean that quantitative results should
on Science, Engineering, and Public Policy, 2009). be reported to the degree of exactitude required by
Integrity implies honesty and truthfulness; it also the given situation. For example, reporting the aver-
implies a prudent use of research funding and other age scores on an attitude questionnaire to a high
resources and, of course, the disclosure of any con- degree of decimal places is psychologically mean-
flicts of interest, financial or otherwise, so as not ingless (false precision), and reporting the weight
to betray public trust. Finally, respect was described of mouse subjects to six decimal places is pointless
in the Belmont Report as assimilating two obliga- (needless precision). Fourth, accuracy means that a
tions: “first, that individuals should be treated as conscientious effort is made to identify and cor-
autonomous agents, and second, that persons with rect mistakes in measurements, calculations, and
diminished autonomy are entitled to protection.” the reporting of numbers. Accuracy also means not
In the current APA code, respect is equated with exaggerating results by, for example, making false
civil liberties—that is, privacy, confidentiality, and claims that applications of the results are unlikely to
self-determination. achieve. Fifth, groundedness implies that the method
Inspecting the column headings in Table 3.5, first of choice is appropriate to the question of interest, as
by transparency, we mean here that the quantitative opposed to using whatever is fashionable or having
rosnow, rosenthal 45
a computer program repackage the data in a one- truly astonishing amount of confusion. For example,
size-fits-all conceptual framework. The methods we writing in the Journal of the National Cancer Insti-
choose must be justifiable on more than just the tute, Miller (2007) mentioned that many doctors
grounds that they are what we were taught in grad- equate the level of statistical significance of cancer
uate school, or that “this is what everyone else does” data with the “degree of improvement a new treat-
(cf. Cohen, 1990, 1994; Rosnow & Rosenthal, ment must make for it to be clinically meaningful”
1995, 1996; Zuckerman, Hodgins, Zuckerman, & (p. 1832).7
Rosenthal, 1993). In the space remaining, we concentrate on mis-
conceptions and illusions regarding the concepts of
RR and statistical significance when the clinical sig-
Clinical Significance and the Consequences nificance of interventions is appraised through the
of Statistical Illiteracy lens of these concepts in randomized clinical trials
To bring this discussion of quantitative meth- (RCTs). As a case in point, a highly cited report
ods and ethics full circle, we turn finally to a on the management of depression, a report that was
problem that has been variously described as innu- issued by the National Institute for Health and Clin-
meracy (Paulos, 1990) and statistical illiteracy. The ical Excellence (NICE), used RR of 0.80 or less as a
terms are used to connote a lack of knowledge or threshold indicator of clinical significance in RCTs
understanding of the meaning of numbers, statisti- with dichotomous outcomes and statistically signif-
cal concepts, or the numeric expression of summary icant results.8 We use the term clinical significance
statistics. As the authors of a popular book, The here in the way that it was defined in an authorita-
Numbers Game, put it: “Numbers now saturate the tive medical glossary, although we recognize that it is
news, politics, life. . . . For good or for evil, they a hypothetical construct laden with surplus meaning
are today’s preeminent public language—and those as well (cf. Jacobson & Truax, 1991). In the glos-
who speak it rule” (Blastland & Dilnot, 2009, p. x). sary, clinical significance was taken to mean that “an
To be sure, even people who are most literate in the intervention has an effect that is of practical mean-
language of numbers are prone to wishful thinking ing to patients and health care providers” (NICHSR,
and fearful thinking and, therefore, sometimes sus- 2010; cf. Jeans, 1992; Kazdin, 1977, 2008). By
ceptible to those who use numbers and gimmicks intervention, we mean a treatment or involvement
to sway, influence, or even trick people. The math- such as a vaccine used in a public health immu-
ematician who coined the term innumeracy told of nization program to try to eradicate a preventable
how his vulnerability to whim “entrained a series of disease (e.g., the Salk poliomyelitis vaccine), or a
ill-fated investment decisions,” which he still found drug that can be prescribed for a patient in the doc-
“excruciating to recall” (Paulos, 2003, p. 1). The tor’s office, or an over-the-counter medicine (e.g.,
launching point for the remainder of our discussion aspirin) used to reduce pain or lessen the risk of an
was an editorial in a medical journal several years adverse event (e.g., heart attack), or a medication
ago, in which the writers of the editorial lamented and/or psychotherapy to treat depression. By tradi-
“the premature dissemination of research and the tion, RCTs are the gold standard in evidence-based
exaggeration of medical research findings” (Schwartz medicine when the goal is to appraise the clinical
& Woloshin, 2003, p. 153). A large part of the prob- significance of interventions in a carefully controlled
lem is an emphasis on RR statistics that hook general scientific manner. Claims contradicted by RCTs are
readers into making unwarranted assumptions, a not always immediately rejected in evidence-based
problem that may often begin with researchers, medicine, as it has been noted that some “claims
funders, and journals that “court media atten- from highly cited observational studies persist and
tion through press releases” (Woloshin, Schwartz, continue to be supported in the medical literature
Casella, Kennedy, & Larson, 2009, p. 613). Con- despite strong contradictory evidence from random-
fusion about risk and risk statistics is not limited to ized trials” (Tatsioni, Bonitsis, & Ioannidis, 2007).
the general public (cf. Prasad, Jaeschke, Wyer, Keitz, Of course, just as gold can fluctuate in value, so
& Guyatt, 2008), but it is the susceptible public can conclusions based on the belief that statistical
(Carling, Kristoffersen, Herrin, Treweek, Oxman, significance is a proxy for clinical significance, or
Schünemann, Akl, & Montori, 2008) that must when it is believed that given statistical significance,
ultimately pay the price of the accelerating costs clinical significance is achieved only if the reduc-
of that confusion. Stirring the concept of statistical tion in RR reaches some arbitrary fixed magnitude
significance into this mix can frequently produce a (recall, for example, NICE, 2004). The challenge
46 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.6. Final results for myocardial infarction (MI) and hemorrhagic stroke (HS) for the
aspirin (325 mg every other day) component of the Physicians’ Health Study (Steering
Committee of the Physicians’ Health Study Research Group, 1989). The increase in relative
risk (RRI) for HS was more than twice the reduction in relative risk (RRR) for MI. Having one
more case of HS in the aspirin group would have yielded a chi-square significant at p < 0.05,
RR = 2.0, and RRI = 100%. In the combined samples, the event rate of MI (378/22,071 =
0.0171, or 1.71% ) exceeded the event rate of HS (35/22,071 = 0.0016, or 0.16% ) by a ratio
of about 10:1, and a difference of 1.71% – 0.16% = 1.55%. In the subtable on the right, RRI is
the relative risk increase, computed as RRR (see Table 3.7), but indicated as RRI when the
treatment increases the risk of the adverse outcome.
Myocardial infarction (Heart attack) Hemorrhagic stroke
MI No MI Total HS No HS Total
RR 0.58 RR 1.92
is to reverse the accelerating cost curve of statistical (thereby implying RRR), and an RR greater than 1.0
illiteracy in an area that affects us all (see, for example, indicates the treated sample did more poorly than
Gigerenzer, Gaissmaier, Kurz-Milcke, Schwartz, & the control (thereby implying RRI). Observe that
Woloshin, 2008). the “slightly increased risk of stroke” (RRI = 92% )
Table 3.6 helps us illustrate the folly of a delicate was actually more than twice the reduction in risk of
balancing act that is sometimes required between heart attack (RRR = 42% )! Suppose the study had
statistical significance and RR. The table shows a continued, and one more case of stroke had turned
portion of the results from the aspirin component up in the aspirin group. The p-value would have
of a highly cited double-blind, placebo-controlled, reached the 0.05 level, and the researchers might
randomized trial to test whether 325 milligrams of have arrived at a different conclusion, possibly that
aspirin every other day reduces the mortality from the benefit with respect to heart attack was more
cardiovascular disease and whether beta-carotene than offset by the increased risk in stroke. Appar-
decreases the incidence of cancer (Steering Com- ently, a p-value only a hair’s-breadth greater than
mittee of the Physicians’ Health Study Research 0.05 can trump a RR increase of 92% . On the
Group, 1989). The aspirin component of the study other hand, the event rate of stroke in the study as
was terminated earlier than planned on finding “a a whole was only 0.16% , less than one-tenth the
statistically significant, 44 [sic] percent reduction in magnitude of the event rate of 1.7% of heart attack
the risk of myocardial infarction for both fatal and in the study as a whole.9 However, we would never
nonfatal events . . . [although] there continued to know this from the RR alone.
be an apparent but not significantly increased risk The fact is that RR statements are oblivious to
of stroke” (p. 132). RR (for relative risk) refers to event rates in the total N . To give a quick exam-
the ratio of the incidence rate of the adverse event ple, suppose in a study with 100 people each in the
(the illness) in the treated sample to the control sam- treated and control samples that 1 treated person
ple; RRR is the relative risk reduction; and RRI, is and 5 untreated people (controls) became ill. RR
the relative risk increase (the computation of these and RRR would be 0.20 and 80% , respectively.
indices is described in Table 3.7). When tables of Stating there was an 80% reduction in risk of the
independent counts are set up as shown in Tables adverse event conveys hope. However, suppose we
3.6 and 3.7, an RR less than 1.0 indicates that the increase each sample size to 1,000 but still assume
treated sample fared better than the control sample 1 case of illness in the treated sample and 5 cases
rosnow, rosenthal 47
Table 3.7. Six studies each with total sample size (N ) of 2,000 and 1% event rates in Studies 1 and 4 (20
cases out of 2,000), 25% event rates in Studies 2 and 5 (500 cases out of 2,000), and 50% event rates in
Studies 3 and 6 (1,000 cases out of 2,000). RR, the relative risk or risk ratio, indicates the ratio of the
incidence rate of level of risk in the treated group to the level of risk in the control group. With cells labeled
A, B, C, D from upper left (A) to upper right (B), to lower left (C), to lower right (D), RR =
[A/(A+B)]/[C/(C+D)], where RR < 1.0 favors the treatment effect (risk reduction) and RR > 1.0 favors the
control effect (risk increase). OR, the odds ratio, also called relative odds or the cross-product ratio, is the
ratio of A/B to C/D, or the cross-product AD/BC. RRR, the relative risk reduction, is the reduction in risk
of the adverse outcome (e.g., illness) in the treated sample relative to the control, which is indicated as a
percentage by dividing RD (defined next) by [C/(C+D)] and then multiplying by 100. RD, the risk
difference, also called the absolute risk reduction (ARR), is the reduction in risk of the particular adverse
outcome (e.g., cancer, heart attack, stroke) in the treated group compared with the level of baseline risk in
the control—that is, [A/(A+B)]–[C/(C+D)]. Multiplying RD (or ARR) times 10,000 estimates the number
of people in a group of 10,000 that are predicted to benefit from the treatment. NNT = 1/RD = 1/ARR, is
the number needed to treat to prevent a single case of the particular adverse outcome.
Study 1 (N = 2,000) Study 2 (N = 2,000) Study 3 (N = 2,000)
Adverse outcome (1% ) Adverse outcome (25% ) Adverse outcome (50% )
Chi-square 16.4 (p < 0.0001) 540.0 (p < 0.0001) 1,620.0 (p < 0.0001)
NNT=1/ARR=1/RD 500 20 10
48 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Study 1 Study 2 Study 3 RR RRR
1000
800
Number of 600
Patients
400
0.05 95%
200
0
Study 4 Study 5 Study 6
1000
800
Number of 600
Patients
400
0.82 18%
200
0
Treatment
Treatment
Treatment
Control
Control
Control
Group
Assignment
Event
1% 25% 50%
Rate
Figure 3.3 Histograms based on the six studies in Table 3.7, in which the total sample size (N ) was 2,000 in each study. Darkened
areas of the bars indicate the number of adverse outcomes (event rates), which increased from 1% (20 cases out of 2,000) in Studies 1
and 4, to 25% (500 cases out of 2,000) in Studies 2 and 5, to 50% (1,000 cases out of 2,000) in Studies 3 and 6. However, the relative
risk (RR) and relative risk reduction (RRR) were insensitive to these vastly different event rates. In Studies 1, 2, and 3, the RR and
RRR remained constant at 0.05 and 94.7%, respectively, whereas in Studies 4, 5, and 6, the RR and RRR remained constant at 0.82
and 18.2%, respectively.
of illness in the control sample. We would still find (Morris & Gardner, 2000). As Table 3.7 shows, OR
RR = 0.20 and RRR = 80% . It makes no differ- and RR are usually highly correlated. The absolute
ence how large we make the sample sizes, as RR and risk reduction (ARR), also called the risk differ-
RRR will not budge from 0.20 and 80% so long as ence (RD), refers to the absolute reduction in risk
we assume 1 case of illness in the treated sample and of the adverse event (illness) in the treated patients
5 cases of illness in the control sample. Suppose we compared with the level of baseline risk in the con-
now hold the N constant and see what happens to trol group. Gigerenzer et al. (2008) recommended
the RR and RRR when the event rate in the over- using the absolute risk reduction (RD) rather than
all N changes from one study to another. In Figure the RR. As Table 3.7 shows, RD (or ARR) is sen-
3.3, we see the results of six hypothetical studies in sitive to the differences in the event rates. There
which the event rates increased from 1% in Stud- are other advantages as well to RD, which are dis-
ies 1 and 4, to 25% in Studies 2 and 5, to 50% in cussed elsewhere (Rosenthal & Rosnow, 2008, pp.
Studies 3 and 6. Nonetheless, in Studies 1, 2, and 3, 631–632). Phi is the product-moment correlation
RR remained constant at 0.05 and RRR remained (r) when the two correlated variables are dichoto-
constant at an attention-getting 95% . In Studies 4, mous, and Table 3.7 shows it is sensitive to the
5, and 6, RR and RRR stayed constant at 0.82 and event rates and natural frequencies. Another useful
18% , respectively. index is NNT, for the number of patients that
Further details of the studies in Figure 3.3 are need to be treated to prevent a single case of the
given in Table 3.7. The odds ratio (OR), for the adverse event. Relative risk may be an easy-to-handle
ratio of two odds, was for a time widely promoted description, but it is only an alerting indicator that
as a measure of association in 2 × 2 tables of tells us that something happened and we need to
counts (Edwards, 1963; Mosteller, 1968) and is explore the data further. As Tukey (1977), the con-
still frequently reported in epidemiological studies summate exploratory data analyst, stated: “Anything
rosnow, rosenthal 49
that makes a simpler description possible makes the degree to which each procedure in a particular study
description more easily handleable; anything that meets the methodological standards of transparency,
looks below the previously described surface makes informativeness, precision, accuracy, and ground-
the description more effective” (p. v). And, we can edness? The research program called for to address
add, that any index of the magnitude of effect that these psychometric questions of reliability will surely
is clear enough, transparent enough, and accurate find that specific research contexts, specific disci-
enough to inform the nonspecialist of exactly what plinary affiliations, and other specific individual dif-
we have learned from the quantitative data increases ferences (e.g., years of experience) will be moderators
the ethical value of those data (Rosnow & Rosenthal, of the magnitudes of agreement (i.e., reliabilities)
2011). achieved. We believe that the results of such research
will demonstrate that there will be some disagree-
ment (that is, some unreliability) in quantitative
Conclusion methodologists’ evaluations of various standards of
In a cultural sphere in which so many things com-
practice. And, as we noted above, that is likely to
pete for our attention, it is not surprising that people
be associated with some disagreement (that is, some
seem to gravitate to quick, parsimonious forms of
unreliability) in their evaluations of the ethical value
communication and, in the case of health statis-
of various quantitative procedures.
tics, to numbers that appear to speak directly to
Another important question would be addressed
us. For doctors with little spare time to do more
by research asking the degree to which the spe-
than browse abstracts of clinical trials or the sum-
cific goals and specific sponsors of the research
maries of summaries, the emphasis on parsimonious
may serve as causal factors in researchers’ choices of
summary statistics such as RR communications in
quantitative procedures. Teams of researchers (e.g.,
large sample RCTs may seem heavily freighted with
graduate students in academic departments rou-
clinical meaning. For the general public, reading
tinely employing quantitative procedures in their
about a 94.7% reduction in the risk of some ill-
research) could be assigned at random to analyze
ness, either in a pharmaceutical advertisement or in a
the data of different types of sponsors with differ-
news story about a “miracle drug that does wonders,”
ent types of goals. It would be instructive to learn
is attention-riveting. It is the kind of information
that choice of quantitative procedure was predictable
that is especially likely to arouse an inner urgency
from knowing who was paying for the research and
in patients but also in anyone who is anxious and
what results the sponsors were hoping for. Recogni-
uncertain about their health. Insofar as such infor-
tion of the possibility that the choice of quantitative
mation exaggerates the absolute effects, it is not only
procedures used might be affected by the finan-
the patient or the public that will suffer the conse-
cial interests of the investigator is reflected in the
quences; the practice of medicine and the progress
increased frequency with which scientific journals
of science will as well. As Gigerenzer et al. (2008)
(e.g., medical journals) require a statement from
wrote, “Statistical literacy is a necessary precondi-
all co-authors of their financial interest in the com-
tion for an educated citizenship in a technological
pany sponsoring the research (e.g., pharmaceutical
democracy” (p. 53). There are promising opportuni-
companies).
ties for moral (and societal) rewards for quantitative
Finally, it would be valuable to quantify the costs
methodologists who can help us educate our way
and utilities of doing and not doing a wide variety of
out of statistical illiteracy. And that education will be
specific studies, including classic and not-so-classic
beneficial, not only to the public but to many behav-
studies already conducted, and a variety of studies
ioral, social, and medical researchers as well. As that
not yet conducted. Over time, there may develop
education takes place, there will be increased clar-
a disciplinary consensus over the costs and util-
ity, transparency, and accuracy of the quantitative
ities of a wide array of experimental procedures.
methods employed, thereby increasing their ethical
And, although such a consensus is building over
value.
time, it will be of considerable interest to psychol-
ogists and sociologists of science to study disci-
Future Directions plinary differences in such consensus-building. Part
An important theoretical and practical question of such a program of self-study of disciplines doing
remains to be addressed: To what extent is there quantitative research would focus on the quantita-
agreement among quantitative methodologists in tive procedures used, but the primary goal would
their evaluation of quantitative procedures as to the be to apply survey research methods to establish
50 q u a n t i tat i v e m e t h o d s a n d e t h i c s
the degree of consensus on research ethics of the and much of the original research [about organic foods] report
behavioral, social, educational, and biomedical sci- only the statistical significance of the differences in nutrient
ences. The final product of such a program of levels—not whether they are nutritionally important” (Clancy,
Hamm, Levine, & Wilkins, 2009, p. 676).
research would include the costs and utilities of
8. NICE (2004) also recommended that researchers use a
doing, and of not doing, a wide variety of research standardized mean difference (SMD) of half a standard devia-
studies. tion or more (i.e., d or g ≥ 0.5) with continuous outcomes as
the threshold of clinical significance for initial assessments of sta-
tistically significant summary statistics (NICE, 2004). However,
Notes effects far below the 0.5 threshold for SMDs have been associated
1. Where we quote from a document but do not give the page with important interventions. For example, in the classic Salk
numbers of the quoted material, it is because either there was no vaccine trial (Brownlee, 1955; Francis, Korns, Voight, Boisen,
pagination or there was no consistent pagination in the online and Hemphill, Napier, & Tolchinsky, 1955), phi = 0.011, which
hard copy versions that we consulted. Tables 3.1–3.4 reprint only has a d -equivalent of 0.022 (Rosnow & Rosenthal, 2008). It is
the original material, as there were slight discrepancies between probably the case across the many domains in which clinical sig-
original material and online versions. nificance is studied that larger values of d or g are in fact generally
2. Pattullo (1982) described the logical basis on which “rule- associated with greater intervention benefit, efficacy, or clinical
makers” (like DHEW) had proceeded in terms of a syllogism importance. But it is also possible for large SMDs to have little
emphasizing not the potential benefits of research but only the or no clinical significance. Suppose a medication was tested on
avoidance of risks of harm: “(a) Research can harm subjects; (2) 100 pairs of identical twins with fever, and in each and every
Only impartial outsiders can judge the risk of harm; (3) There- pair, the treated twin loses exactly one-tenth of 1 degree more
fore, all research must be approved by an impartial outside group” than the control twin. The SMD will be infinite, inasmuch as
(p. 376). the variability (the denominator of d or g ) will be 0, but few
3. Hearings on the recommendations in the Belmont Report doctors would consider this ES clinically significant. As Cohen
were conducted by the President’s Commission for the Study (1988) wisely cautioned, “the meaning of any given ES is, in the
of Ethical Problems in Medicine, Biomedical, and Behavioral final analysis, a function of the context in which it is embedded”
Research. Proceeding on the basis of the information pro- (p. 535).
vided at these hearings and on other sources of advice, the 9. The high RR of HS in this study, in which participants
Department of Health and Human Services (DHHS) then (male physicians) took 325 milligrams every other day, might
issued a set of regulations in the January 26, 1981, issue of explain in part why the current dose for MI prophylaxis is
the Federal Register. A compendium of regulations and guide- tempered at only 81 milligrams per day.
lines that now govern the implementation of the National
Research Act and subsequent amendments can be found in
the DHHS manual known as the “Gray Booklet,” specifically Author Note
titled Guidelines for the Conduct of Research Involving Human Ralph L. Rosnow is Thaddeus Bolton Professor
Subjects at the National Institutes of Health (available online at Emeritus at Temple University, Philadelphia, PA
http://ohsr.od.nih.gov/guidelines/index.html).
(rosnow@temple.edu). Robert Rosenthal is a Dis-
4. Smith, Kimmel, and Klein (2009) reported that 43.4%
of the articles on consumer research in leading journals in the tinguished Professor at the University of California
field in 1975 through 1976 described some form of deception in at Riverside (robert.rosenthal@ucr.edu) and Edgar
the research. By 1989 through 1990, the number of such articles Pierce Professor of Psychology, Emeritus, Harvard
increased to 57.7% , where it remained steady at 56% in 1996 University.
through 1997, increased to 65.7% in 2001 through 2002, and
jumped to 80.4% in 2006 through 2007. The issue of deception
is further complicated by the fact that active and passive decep-
References
tions are far from rare in our society. Trial lawyers manipulate Allison, D. B. (2009). The antidote to bias in research. Science,
the truth in court on behalf of their clients; prosecutors surrep- 326, 522.
titiously record private conversations; journalists get away with American Psychological Association. (1973). Ethical principles in
using hidden cameras and undercover practices to obtain stories; the conduct of research with human participants. Washington,
and the police use sting operations and entrapment procedures to DC: Author.
gather incriminating evidence (cf. Bok, 1978, 1984; Saxe, 1991; American Psychological Association. (1982). Ethical principles in
Starobin, 1997). the conduct of research with human participants. Washington,
5. The document, titled “Ethical Principles of Psycholo- DC: Author.
gists and Code of Conduct,” is available online at http://www. American Statistical Association (1999). Ethical guidelines for
apa.org/ETHICS/code2002.html. statistical practice. http://www.amstat.org/profession/index.
6. Adaptations of the models in Figures 3.1 and 3.2 have been cfm?fusaction=ethical
used to cue students about possible ethical dilemmas in research Atwell, J. E. (1981). Human rights in human subjects research.
and data analysis (cf. Bragger & Freeman, 1999; Rosnow, 1990; In A. J. Kimmel (Ed.), New directions for methodology of social
Strohmetz & Skleder, 1992). and behavioral science: Ethics of human subject research (No. 10,
7. The confusion of statistical significance with practical pp. 81–90). San Francisco, CA: Jossey-Bass.
importance may be a more far-reaching problem in science. In Baltimore, D. (January 27, 1997). Philosophical differences. The
a letter in Science, the writers noted that “almost all reviews New Yorker, p. 8.
rosnow, rosenthal 51
Beecher, H. K. (July 2, 1966). Documenting the abuses. Saturday Huff, D. (1982). How to lie with statistics. New York: Norton.
Review, 45–46. Jacobson, N. S., & Truax, P. (1991). Clinical significance:
Beecher, H. K. (1970). Research and the individual. Boston: Little A statistical approach to defining meaningful change in
Brown. psychotherapy research. Journal of Consulting and Clinical
Blanck, P. D., Bellack, A. S., Rosnow, R. L., Rotheram-Borus, Psychology, 59, 12–19.
M. J., & Schooler, N. R. (1992). Scientific rewards and con- Jeans, M. E. (1992). Clinical significance of research: A growing
flicts of ethical choices in human subjects research. American concern. Canadian Journal of Nursing, 24, 1–2.
Psychologist, 47, 959–965. Jones, H. H. (1993). Bad blood: The Tuskegee syphilis experiment
Blass, T. (2004). The man who shocked the world: The life and (Revised edition). New York: Free Press.
legacy of Stanley Milgram. New York: Basic Books. Kaiser, J. (September 18, 2009). Researcher, two universities sued
Blastland, M., & Dilnot, A. (2009). The numbers game. New over validity of prostate cancer test. Science, 235, 1484.
York: Gotham Books. Kazdin, A. E. (1977). Assessing the clinical or applied impor-
Bok, S. (1978). Lying: Moral choice in public and private life. New tance of behavior change through social validation. Behavior
York: Pantheon. Modification, 1, 427–451.
Bok, S. (1984). Secrets: On the ethics of concealment and revelation. Kazdin, A. E. (2008). Evidence-based treatment and practice:
New York: Vintage Books. New opportunities to bridge clinical research and practice,
Bordia, P., & Rosnow, R. L. (1995). Rumor rest stops on enhance the knowledge base, and improve patient care.
the information highway: A naturalistic study of transmis- American Psychologist, 63, 146–159.
sion patterns in a computer-mediated rumor chain. Human Katz, J. (1972). Experimentation with human beings. New York:
Communication Research, 25, 163–179. Russell Sage.
Bragger, J. D., & Freeman, M. A. (1999). Using a cost–benefit Kelman, H. C. (1968). A time to speak: On human values and
analysis to teach ethics and statistics. Teaching of Psychology, social research. San Francisco, CA: Jossey-Bass.
26, 34–36. Kimmel, A. J. (1991). Predictable biases in the ethical decision
Brownlee, K. A. (1955). Statistics of the 1954 polio vaccine trials. making of psychologists. American Psychologist, 46, 786–788.
Journal of the American Statistical Association, 50, 1005–1013. Kossinets, G., & Watts, D. J. (2006). Empirical analysis of an
Carling, C., Kristoffersen, D. T., Herrin, J., Treweek, S., Oxman, evolving social network. Science, 311, 88–90.
A. D., Schünemann, H., et al. (2008). How should the Mark, M. M., Eyssell, K. M., & Campbell, B. (1999). The
impact of different presentations of treatment effects on ethics of data collection and analysis. In J. L. Fitzpatrick &
patient choice be evaluated? PLoS One, 3(11), e3693. M. Morris (Eds.), Ethical issues in program evaluation (pp.
Cicerone, R. J. (2010). Ensuring integrity in science. Science, 47–56). San Francisco, CA: Jossey-Bass.
327, 624. Milgram, S. (1963). Behavioral study of obedience. Journal of
Clancy, K., Hamm, M., Levine, A. S., & Wilkins, J. (2009), Abnormal and Social Psychology, 67, 371–378.
Organics: Evidence of health benefits lacking. Science, 325, Milgram, S. (1975). Obedience to authority: An experimental view.
676. New York: Harper Colophon Books.
Cohen, J. (1988). Statistical power analysis for the behavioral Miller, J. D. (2007). Finding clinical meaning in cancer data.
sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Journal of the National Cancer Institute, 99(24), 1832–1835.
Cohen, J. (1990). Things I have learned (so far). American Morris, J. A., & Gardner, M. J. (2000). Epidemiological studies.
Psychologist, 45, 1304–1312. In D. A. Altman, D. Machin, T. N. Bryant, & M. J. Gard-
Cohen, J. (1994). The earth is round (p < .05). American ner (Eds.). Statistics with confidence (2nd. ed., pp. 57–72).
Psychologist, 49, 997–1003. London: British Medical Journal Books.
Committee on Science, Engineering, and Public Policy (2009). Mosteller, F. (1968). Association and estimation in contingency
On being a scientist: A guide to responsible conduct in research tables. Journal of the American Statistical Association, 63, 1–28.
(3rd ed.). Washington, DC: National Academies Press. National Commission for the Protection of Human Subjects of
Edwards, A. W. F. (1963). The measure of association in Biomedical and Behavioral Research. (April 18, 1979). The
a 2×2 table. Journal of the Royal Statistical Society, 126, Belmont Report: Ethical principles and guidelines for the pro-
109–114. tection of human subjects of research. Retrieved April 8, 2009,
Fairchild, A. L., & Bayer, R. (1999). Uses and abuses of Tuskegee. http://ohsr.od.nih.gov/guidelines/belmont.html.
Science, 284, 918–921. National Commission on Research. (1980). Accountability:
Francis, T., Jr., Korns, R. F., Voight, R. B., Boisen, M., Hemphill, Restoring the quality of the partnership. Science, 207,
F., Napier, J., & Tolchinsky, E. (1955). An evaluation of 1177–1182.
the 1954 poliomyelitis vaccine trials: A summary report. NICE. (2004). Depression: Management of depression in pri-
American Journal of Public Health, 45(5), 1–63. mary and secondary care (Clinical practice guideline No.
Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. 23). London: National Institute for Health and Clinical
M., & Woloshin, S. (2008). Helping doctors and patients Excellence
make sense of health statistics. Psychological Science in the NICHSR. (2010). HTA 101: Glossary. Retrieved March 3, 2010
Public Interest, 8, 53–96. from http://www.nlm.nih.gov/nichsr/hta101/ta101014.html
Harris, G (March 10, 2009). Doctor admits pain studies were Orne, M. T. (1962). On the social psychology of the psy-
frauds, hospital says. The New York Times. Retrieved March chological experiment: With particular reference to demand
20, 2009, www.nytimes.com/2009/03/11/health/research/ characteristics and their implications. American Psychologist,
11pain.html?ref=us. 17, 776–783.
Haywood, H. C. (1976). The ethics of doing research . . . and Orne, M. T. (1969). Demand characteristics and the concept
of not doing it. American Journal of Mental Deficiency, 81, of quasi-control. In R. Rosenthal & R. L. Rosnow (Eds.).
311–317. Artifact in behavioral research (pp. 143–179). New York:
52 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Academic Press. (Reissued in Rosenthal & Rosnow, 2009, Rosnow, R. L., & Rosenthal, R. (2008). Assessing the effect size
pp. 110–137). of outcome research. In A. M. Nezu & C. M. Nezu (Eds.),
Panter, A. T., & Sterba, S. K. (2011). Handbook of ethics in Evidence-based outcome research (pp. 379–401). New York:
quantitative methodology. Taylor & Francis. Oxford University Press.
Pattullo, E. L. (1982). Modesty is the best policy: The federal Rosnow, R. L., & Rosenthal, R. (2011). Ethical principles in
role in social research. In T. L. Beauchamp, R. R. Faden, R. J. data analysis: An overview. In A. T. Panter & S. K. Sterba
Wallace, Jr., & L. Walters (Eds.), Ethical issues in social research (Eds.), Handbook of ethics in quantitative methodology (pp.
(pp. 373–390). Baltimore, MD: Johns Hopkins University 37–58). New York: Routledge.
Press. Rosnow, R. L., Rotheram-Borus, M. J., Ceci, S. J., Blanck,
Paulos, J. A. (1990). Innumeracy: Mathematical illiteracy and its P. D., & Koocher, G. P. (1993). The institutional review
consequences. New York: Vintage Books. board as a mirror of scientific and ethical standards. American
Paulos, J. A. (2003). A mathematician plays the stock market. New Psychologist, 48, 821–826.
York: Basic Books. Rosnow, R. L., Skleder, A. A., Jaeger, M. E., & Rind, B. (1994).
PLoS Medicine Editors (September, 2009). Ghostwriting: The Intelligence and the epistemics of interpersonal acumen: Test-
dirty little secret of medical publishing that just got bigger. ing some implications of Gardner’s theory. Intelligence, 19,
PLoS Medicine, 6, Issue 9, e1000156. Accessed September 19, 93–116.
2009 from http://www.plosmedicine.org/static/ghostwriting. Sales, B. D., & Folkman, S. (Eds.). (2000). Ethics in research with
action. human participants. Washington, DC: American Psychologi-
Prasad, K., Jaeschke, R., Wyer, P., Keitz, S., & Guyatt, G. (2008). cal Association.
Tips for teachers of evidence-based medicine: Understanding Saxe, L. (1991). Lying: Thoughts of an applied social psycholo-
odds ratios and their relationship to risk ratios. Journal of gist. American Psychologist, 46, 409–415.
General Internal Medicine, 23(5), 635–640. Schwartz, L. M., & Woloshin, S. (2003). On the prevention
Reynold, P. D. (1975). Value dilemmas in the professional con- and treatment of exaggeration. Journal of General Internal
duct of social science. International Social Science Journal, 27, Medicine, 18(2), 153–154.
563–611. Scott-Jones, D., & Rosnow, R. L. (1998). Ethics and mental
Rosenthal, R. (1994). Science and ethics in conducting, ana- health research. In H. Friedman (Ed.), Encyclopedia of mental
lyzing, and reporting psychological research. Psychological health (Vol. 2, pp. 149–160). San Diego, CA: Academic
Science, 5, 127–134. Press.
Rosenthal, R., & Rosnow, R. L. (1969). The volunteer subject. Smith, M. B. (1969). Social psychology and human values.
In R. Rosenthal & R. L. Rosnow (Eds.), Artifact in behavioral Chicago, IL: Aldine.
research (pp. 59–118). New York: Academic Press. (Reissued Smith, N. C., Kimmel, A. J., & Klein, J. G. (2009). Social con-
in Rosenthal & Rosnow, 2009, pp. 48–92). tract theory and the ethics of deception in consumer research.
Rosenthal, R., & Rosnow, R. L. (1975). The volunteer subject. Journal of Consumer Psychology, 19, 486–496.
New York: Wiley-Interscience. (Reissued in Rosenthal & Spielmans, G. I., & Parry, P. I. (2010). From evidence-
Rosnow, 2009, pp. 667–862). based medicine to marketing-based medicine: Evidence from
Rosenthal, R., & Rosnow, R. L. (1984). Applying Hamlet’s ques- internal industry documents. Journal of Bioethical Inquiry:
tion to the ethical conduct of research. American Psychologist, doi:10.1007/s11673-010-9208-8.
45, 775–777. Starobin, P. (January 28, 1997). Why those hidden cameras hurt
Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behav- journalism. The New York Times, p. A21.
ioral research: Methods and data analysis (3rd ed.). New York: Steering Committee of the Physicians’ Health Study Research
McGraw-Hill. Group. (1989). Final report on the aspirin component of the
Rosenthal, R., & Rosnow, R. L. (2009). Artifacts in behavioral ongoing Physicians’ Health Study. New England Journal of
research. New York: Oxford University Press. Medicine, 321, 129–135.
Rosnow, R. L. (1990). Teaching research ethics through role-play Strohmetz, D. B., & Skleder, A. A. (1992). The use of role-play
and discussion. Teaching of Psychology, 17, 179–181. in teaching research ethics: A validation study. Teaching of
Rosnow, R. L. (1997). Hedgehogs, foxes, and the evolving Psychology, 19, 106–108.
social contract in psychological science: Ethical challenges Stryker, J. (April 13, 1997). Tuskegee’s long arm still touches a
and methodological opportunities. Psychological Methods, 2, nerve. The New York Times, p. 4.
345–356. Tatsioni, A., Bonitsis, N. G., & Ioannidis, J. P. A. (2007). Per-
Rosnow, R. L., & Rosenthal, R. (1970). Volunteer effects in sistence of contradicted claims in the literature. Journal of the
behavioral research. In K. H. Craik, B. Kleinmuntz, R. L. American Medical Association, 298, 2517–2526.
Rosnow, R. Rosenthal, J. A. Cheyne, & R. H. Walters, Trials of War Criminals before the Nuernberg Military Tribunals
New directions in psychology (pp. 211–277). New York: Holt, under Control Council Law No. 10, October 1946-April
Rinehart & Winston. 1949, Vol. II. Washington, DC: U.S. Government Printing
Rosnow, R. L., & Rosenthal, R. (1995). “Some things you Office.
learn aren’t so”: Cohen’s paradox, Asch’s paradigm, and the Tufte, E. R. (1983). The visual display of quantitative information.
interpretation of interaction. Psychological Science, 6, 3–9. Cheshire, CT: Graphics Press.
Rosnow, R. L., & Rosenthal, R. (1996). Contrasts and inter- Tufte, E. T. (1990). Envisioning information. Cheshire, CT:
actions redux: Five easy pieces. Psychological Science, 7, Graphics Press.
253–257. Tufte, E. T. (2006). Beautiful evidence. Cheshire, CT: Graphics
Rosnow, R. L., & Rosenthal, R. (1997). People studying people: Press.
Artifacts and ethics in behavioral research. New York: W. H. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA:
Freeman. Addison-Wesley.
rosnow, rosenthal 53
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Wainer, H. (2009). Picturing the uncertain world. Princeton, NJ:
Rosenthal, R. (2008). Selective publication of antidepressant Princeton University Press.
trials and its influence on apparent efficacy. New England Wainer, H., & Thissen, D. (1981). Graphical data analysis.
Journal of Medicine, 358(3), 32–40. Annual Review of Psychology, 32, 191–241.
VandenBos, G. R. (Ed.). (2007). APA dictionary of psychology. Wasserman, S., & Faust, K. (1994). Social network analysis:
Washington, DC: American Psychological Association. Methods and applications. Cambridge, England: Cambridge
Vedula, S. S., Bero, L., Scherer, R. W., & Dickersin, K. University Press.
(2009). Outcome reporting in industry-sponsored trials Woloshin, S., Schwartz, L. M., Casella, S. L., Kennedy, A. T., &
of Gabapentin for off-label use. New England Journal of Larson, R. J. (2009). Press releases by academic medical cen-
Medicine, 361(20), 1963–1971. ters: Not so academic? Annals of Internal Medicine, 150(9),
Vinacke, W. E. (1954). Deceiving experimental subjects. Ameri- 613–618.
can Psychologist, 9, 155. Zuckerman, M., Hodgins, H. S., Zuckerman, A., & Rosenthal,
Wainer, H. (1984). How to display data badly. American R. (1993). Contemporary issues in the analysis of data: A
Statistician, 38, 137–147. survey of 551 psychologists. Psychological Science, 4, 49–53.
Wainer, H. (1996). Depicting error. American Statistician, 50(2),
101–111.
Wainer, H. (2000). Visual revelations: Graphical tales of fate and
deception from Napoleon Bonaparte to Ross Perot. Mahwah, NJ:
Erlbaum.
54 q u a n t i tat i v e m e t h o d s a n d e t h i c s
CHAPTER
Special Populations
4
Keith F. Widaman, Dawnté R. Early, and Rand D. Conger
Abstract
Special populations offer unique opportunities and challenges for mathematical/statistical modeling of
data. First, we discuss several ways of construing the notion of special populations, including the basis
on which we argue that the general notion of special populations is a rather recent one. Then, we
discuss four key methodological implications when considering special populations: (1) properly
defining and accessing participants from the special population; (2) ensuring that the same dimensions
are present across multiple populations; (3) assessing whether empirical implications of psychological
theories hold across populations; and (4) exploiting unusual variation in special populations that may
allow tests of unique hypotheses. Next, we provide examples that illustrate how to deal with each of
the methodological issues. We close with a discussion of issues occasioned by our discussion of
special populations, issues that represent topics for future developments.
Key Words: Special populations, group differences, factorial invariance, measurement invariance,
structural invariance, individual differences
55
the nature of special populations and the dynam- research on persons with mental retardation or intel-
ics of psychological and biological processes in these lectual disability. Clinical methods in use more than
groups. We close with conclusions and a series of 200 years ago for dealing with persons with mental
questions that represent issues for future research. illness or intellectual disability appear barbaric to the
In pursuing the nature and implications of special twenty-first-century practicing scientist or informed
populations for quantitative methods, we performed citizen; indeed, many clinical methods in use only
several literature searches to help bound or circum- 50 or 75 years ago seem rather unfortunate and
scribe our thinking about special populations. A misguided. Thus, research on special populations
PsycINFO search in August 2010 using “special has long been pursued in psychology, allied behav-
population” and “special populations” as possible ioral and social sciences, and medical sciences. But,
title words yielded 397 citations. The oldest cita- the time is ripe for renewed exploration of behav-
tion was to a publication by Uehling in 1952, the ioral phenomena in special populations to provide a
next oldest was to a publication in 1975, and 390 fuller understanding of persons in these populations,
(or more than 98%) of the 397 citations were to which should lead to improved ways of treating and
publications in 1980 or later. Thus, the clear major- improving the lives of persons in special populations.
ity of publications listing “special population” as a
title word have occurred relatively recently, within
Conceptions of Special Populations
the last 30 years.
To consider methodological implications and
More inclusively, we next conducted a PsycINFO
applications related to special populations, one must
search using “special population” and “special popu-
consider first what a special population is. The term
lations” as possible keywords. This search turned up
special population has no obvious and ready refer-
1546 citations, a far broader net of citations than
ent. That is, if one asked 10 different psychologists
provided by the title-word search. Here, the oldest
what was meant by “special populations,” one might
citation was to a chapter in a volume resulting from
get 10 different answers with little in the way of
a White House Conference on Child Health and
overlap aside from the indication that “special pop-
Protection, published in 1931 during the Hoover
ulation” implies deviation from general population
administration (see Folks, 1931). The second oldest
norms. To confront this definitional problem, we
citation was the Uehling (1952) paper that was the
read a random sample of publications identified in
oldest publication in the title-word search, and 1503
our literature searches to identify the ways different
(or more than 97%) of the 1546 citations were to
researchers used the term “special populations.” In
publications in 1980 or later.
the rest of this section, we discuss several ways in
Finally, a PsycINFO search using “special pop-
which investigators have used the term, admitting
ulation” or “special populations” as phrases to be
that our categorization is not exhaustive.
found anywhere in the database led to 4266 cita-
tions, our search that led to the most inclusive list of
reference citations. The two oldest citations were the Disability Groups
Folks (1931) and Uehling (1952) papers uncovered One common use of the term special popula-
in the earlier searches, and a full 4221 (or more than tion appears, implicitly, to represent a less pejorative
98.9%) of the citations were to works published in way of referring to one or another disability group.
1980 or later. Consistent with the preceding two Research on persons with a slowed rate of mental
searches, the vast bulk of citations under the explicit development was long published under the head-
title or heading of “special populations” occurred ing of research on mentally retarded persons, with a
within the past three decades. clear implication that this was a special population.
Each of these three searches outlined above sup- More recently, the term mentally retarded persons was
ports the conclusion that work on or thinking about replaced by persons with mental retardation, which
special populations under that specific rubric is a was a less pejorative term; still more recently, the
fairly recent phenomenon. However, work on spe- accepted term is now persons with intellectual disabil-
cial populations has been a hallmark of research in ity. Historical changes in these terms are reflected in
psychology for more than 100 years, if not more, the name of the leading professional organization
even if this research has not been published under for research and practice in this domain, which has
the heading of special populations. For example, changed its name from the American Association
research and theory on persons with mental illness on Mental Deficiency to the American Association
can be traced back more than 200 years, as can on Mental Retardation and then to the American
56 s p e c i a l p o p u l at i o n s
Association on Intellectual and Developmental Dis- synthesized results from prior meta-analyses of self-
abilities. Regardless of the precise term used to determination interventions for various categories
refer to this population, the key defining feature is of persons with disability. In some meta-analyses,
that a person is considered to be a member of the research on students from any disability group were
special population if he or she has exhibited a signif- included, whereas other meta-analyses focused on
icantly slower rate of mental development relative more restricted groups, such as students with intel-
to persons in the general population. lectual disability or developmental disabilities or
As noted above, persons with mental illness are students with ADHD. The synthesis by Cobb et
another group that is treated as a special popula- al. ended on an unfortunate note, concluding that
tion, a special population defined on the basis of interventions to enhance levels of self-determination
a perceived disability. Many additional disability appear to have relatively weak effects, perhaps
groups can be identified, including persons with because of heterogeneity of the special populations
visual deficits or blindness, persons with hearing loss combined in the meta-analyses.
or deafness, persons with physical disabilities, and Echoing issues identified above, Barton (2009)
individuals with learning disabilities, to name only has decried the lack of a common definition of
a few. The one family resemblance shared by all of the basic term disability by persons with disabili-
these uses of the term “special population” is that ties or their advocates. In 1990, the U.S. Congress
the generic label is applied to a group to denote the passed the landmark Americans with Disabilities
presence of a behavioral disability that is common Act, which outlawed discrimination on the basis of
to all members of the group. disability and outlined many ways in which accom-
One problem with the use of the term “special modations for persons with disabilities must be
population” in connection with disability groups provided. Given this legislation and the relatively
arises in the context of comorbidity, or the presence large numbers of persons in the United States who
of more than one identifier of disability. For exam- fall in one or another disability category, the pop-
ple, consider the case of a person with intellectual ulation of persons with disability could and should
disability who is also deaf. Is this person a member be a potent political force. But, without a common
of two different special populations—that is, is such and inclusive definition of the term disability, the
a person a member of the special population of per- political clout of this population typically is watered
sons with intellectual disability and also a member down, as advocates for particular disability groups
of the special population of persons who are deaf? (e.g., persons with intellectual disability, students
Or, is the person a member of a new and still more with deafness) follow their own special agendas. If
special population, the population of persons with persons with disability and their advocates could
dual diagnoses of intellectual disability and deafness? unite subgroups under a commonly accepted and
No resolution of this issue is proposed here. Rather, inclusive definition of disability, this would almost
we merely raise this issue to highlight a vexing issue certainly lead to a more united stance on issues, lead-
in evolving notions of disability groups. ing to greater power. Until that time, the political
Research on disability groups takes a number power of persons with disabilities is likely to remain
of forms, and we survey a few of these here. As fragmented and therefore weaker.
one example, Martz and Daniel (2010) collected
data in the United States and Kenya to determine
whether disability prototypes for four disability “Superability” Groups
groups (persons with AIDS, hearing impairment, In contrast to the notion of disability group-
mental illness, and spinal cord injury) differed across ing, “special population” can also be used to define
the four groups and whether this was moderated individuals with superior levels of ability or per-
by country of residence. As a second example, Or, formance, which can be termed superability groups
Cohen, and Tirosh (2010) used discriminant anal- to contrast with the term disability group. One
ysis to determine whether measures from a parent- of the most common “superability” groups is the
rated questionnaire could discriminate among three intellectually gifted, often defined as persons with
groups of students, including those with attention- intelligence quotients (IQs) greater than or equal to
deficit hyperactivity disorder (ADHD), those with 130. But, giftedness has many dimensions; some
a learning disability, and those with a combination individuals are considered gifted in general, whereas
of ADHD and learning disability. Finally, Cobb, others may be deemed gifted in narrower domains
Lehman, Newman-Gonchar, and Alwell (2009) such as the arts or specific academic domains.
58 s p e c i a l p o p u l at i o n s
(e.g., African-American) were selected using such updated by Henry, 2008; Sears, 2008) decried the
a test, then one could justify the result by claim- use of college sophomores as typical research partici-
ing that the lower level of selection resulted from pants, arguing that college sophomores may provide
lower levels of scores obtained by members of the systematically different responses than members
special population on unbiased tests. Although so- of the general population on many, if not most,
called “Cleary tests” have been used routinely for experimental questionnaires and paradigms. Col-
more than four decades in college admissions and in lege sophomores are in a stage of life when they are
personnel selection, concerns about the utility of the attempting to “find” or define their identities and
approach remain (e.g., Meade & Tonidandel, 2010; thus may be much more susceptible to various influ-
Colarelli, Han, & Yang, 2010). ences, such as experimental inductions, than would
Retrieving literature that references “special pop- others. Despite the potentially limited utility of col-
ulations,” we were struck by the frequency with lege sophomores as research participants, given their
which authors who referred to special populations unrepresentativeness relative to the population, con-
cited groups such as women, children, the elderly, venience in obtaining their participation in research
and minority ethnic groups. This designation of is a leading cause of their continued predominance
special populations implies that non-women, non- in studies in social and personality psychology.
children, non-elderly, non-minority individuals— Political pressure often appears to be another fac-
that is, adult, White males and, perhaps, college tor related to the choice of research participants,
sophomores—are “the” populations that serve as either restricting or promoting research on a given
common reference groups, and anyone who devi- topic. If a researcher chooses to study a topic (e.g.,
ates from these norms is a member of a special romantic love) or a group (e.g., homosexual males)
population. that a member of Congress finds objectionable, the
Why White adult males and college sophomores research project might be highlighted as a waste of
became standard reference groups is anyone’s guess. taxpayer money. Senator William Proxmire (D—
Most likely, the emergence of these reference groups Wisconsin) made headlines in the 1970s and 1980s
resulted from not just one but a combination of when announcing his Golden Fleece Awards, which
factors. Since the founding of the United States of derided programs of research he personally opposed.
America, with few notable exceptions, adult White On the other hand, the decades-long support for
males have tended to occupy the highest positions research on persons with intellectual disability by the
of power in business, academia, and government. members of the Kennedy family in Congress led to
Regardless of the optimality of this distribution far more research funding in this area than otherwise
of power, the mere presence of this demographic would have occurred. Indeed, continued research
in positions of power may have led, explicitly or support for any special population usually requires
implicitly, to acceptance of adult White males as the the presence of a special champion in Congress for
reference group against which outcomes for other that population, given conflicting funding priorities
groups would be compared. Additionally, the use of at the national level.
a single demographic, such as adult White males, Yet another reason why certain groups are not
might have been considered a way to reduce hetero- often represented as research participants is the
geneity that might otherwise cloud research results. sheer difficulty in finding and recruiting partici-
Assuming that results for different demographic pants in these groups. Masten and colleagues (e.g.,
groups should not differ dramatically, reduction of Obradoviæ, Long, Cutuli, Chan, Hinz, Heistad,
heterogeneity might allow trends in data to be seen & Masten, 2009) have been studying homeless and
more clearly. highly mobile children, and identifying members
Other reasons are, almost surely, responsible for of this group and then tracking them longitudi-
the use of college sophomores as a standard or nally has been difficult. Other challenges often face
reference group. The top reason for selecting col- researchers who study court-related samples, such
lege sophomores as “typical” research participants as victims of physical or sexual abuse or children or
must be convenience. When taking introductory adolescents in foster care, where concurrence of legal
psychology courses in college, students often must entities are yet another impediment to research.
participate in experiments so that they learn about Regardless of the basis for the designation of
how studies are conducted, and college sophomores reference populations, signs of the breakup of the
are a common demographic in introductory psy- hegemony of White adult males and college sopho-
chology classes. A quarter-century ago, Sears (1986, mores as reference groups are clearly in evidence.
60 s p e c i a l p o p u l at i o n s
Biological or Genetic Markers of Group varied levels at which this research is undertaken—
Membership from the level of synaptic processes to molar patterns
A fifth and final type of special population is in behavior—is remarkable, yet the patterns uncov-
perhaps the most current and state-of-the-art way ered all point to the biological nature of the processes
of defining special populations—by the presence of involved.
specific biological or genetic markers that have ties to Two additional ways of understanding biologi-
behavior. Since its rebirth in the early 1970s, behav- cal markers and their effects deserve mention here.
ior genetic research has accumulated at a seemingly First, the search for a solitary SNP or a small number
ever-increasing rate. The general thrust of findings in of SNPs responsible for a particular psychological
this field is that most, if not all, behavioral traits have or behavioral trait is almost certainly a rather unre-
some heritability, and certain important traits (e.g., alistic goal. A more likely outcome is reflected in
intelligence) have high levels of heritability. Consis- research on phenylketonuria (PKU). As reported on
tent with this focus, research continues to focus on the Phenylalanine Hydroxylase Locus Knowledge-
biological markers related to behavior, where biolog- base website (http://www.pahdb.mcgill.ca/), more
ical markers are broadly defined, from the molecular than 500 mutations on the phenylalanine hydroxy-
level of particular genes or single-nucleotide poly- lase (PAH) gene have been identified, and all of these
morphisms (SNPs) to the molar level of performance lead to reduced metabolism of phenylalanine into
on experimental tasks. tyrosine, the underlying problem in PKU. Although
Searching for the molecular genetic bases of a small number of mutations (e.g., 5 or 6) may
behavioral traits is a major focus at the present time account for the majority of mutations in particular
(Petrill, 2010). The successful identification of par- populations (e.g., European), any of the mutations
ticular genes related to a certain condition is often can cause PKU. Further, many of the mutations
touted in the press. For example, Chakrabarti et al. have been categorized with regard to the severity
(2010) recently reported that certain genes associ- of the mutation, indexed by the degree of disrup-
ated with sex steroids and neural growth are related tion of phenylalanine metabolism. If a large number
to autistic traits and Asperger syndrome. But, the of mutations are found to underlie a single, rather
mere presence of certain SNPs may not be the key. restricted phenotype such as PKU, we should expect
Rather, gene expression at the molecular levels may that very large numbers of mutations or SNPs are
be the important feature, as Simunovic et al. (2009) related to broader phenotypes such as intelligence,
reported in their study of the pathology underlying intellectual disability, or personality disorders.
Parkinson’s disease. Although successes in finding Second, we think that researchers must pay at
certain SNPs or particular forms of gene expression least as much attention to the environment as to gene
related to a disease or behavior have been reported, SNPs when searching for genes that affect behav-
most of these involve very small portions of vari- ior, an approach that has been characterized as the
ance explained and suffer from lack of replication in search for Gene X Environment, or G X E, interac-
follow-up investigations. tions. Specific genes or SNPs are important in the
Perhaps more promising is research at a more G X E approach, but the SNPs alone do not directly
molar level. For example, Pennington et al. (2008) herald the emergence of a behavioral trait. Rather,
reported synaptic and metabolic abnormalities in behavioral differences among groups identified with
the prefrontal cortex in persons with schizophre- different genetic alleles may arise only in particu-
nia or bipolar disorder. Focusing on the anterior lar environmental circumstances or may be clearly
cingulate, Eastwood and Harrison (2010) found exacerbated in such environments, so main effects
increased synaptic transmission and plasticity in of genes or environments supply less-than-complete
persons with bipolar disorder. At a still more information. Rather than main effects, the G X E
molar level, Koychev, El-Deredy, Haenschel, and interaction indicates that the effects of genetic alle-
Deakin (2010) reported visual information process- les is moderated by environmental circumstances,
ing deficits in persons with schizotypy, which is so environments modulate how genetic factors are
a marker of vulnerability to schizophrenia. And, expressed in behavior. Two early papers on the G
Reichenberg, Caspi, Harrington, Houts, Keefe, X E approach, by Caspi et al. (2002) and by Caspi
Murray, Poulton, and Moffitt (2010) argued that et al. (2003), were both published in Science, the
they had identified patterns in performance on leading general journal in all of science. In the sec-
standard psychological tests that reflected cogni- ond of these papers, Caspi et al. (2003) investigated
tive deficits related to childhood schizophrenia. The the 5-HTTLPR region, which is associated with
62 s p e c i a l p o p u l at i o n s
most graduate education in psychology, quantitative or implicit exclusion criteria should be used or one
experts teach classes in which students are prepared might inadvertently draw an unrepresentative sam-
to obtain data from a sample (typically described in ple. Of course, a more restricted population could be
nebulous terms); take standard, off-the-shelf statis- the topic of a study, such as persons from minority
tical methods; estimate population parameters in groups who seek mental health services with regard
a particular analytic model, and use their results to alcohol problems but without attendant drug use
to make conjectures about the population, as if problems. We hope the reader is clear on the issue
the population were a single, monolithic entity. of identification of a special population: Both inclu-
But, once one acknowledges the presence of spe- sion and exclusion criteria for participants must be
cial populations (which represent specially identified clearly elucidated and justified by the nature of the
subsets of the larger population) our thinking about study, and the research team should remain vigilant
methodological and statistical procedures must be to ensure that no unexpected factors are biasing the
amended. Rather than estimating “the” population nature of the population defined.
parameter in monolithic population, we should A second issue is the development of a plan
begin trying to understand the nature of special to access participants from the special population.
populations, how parameter estimates might vary If one is interested in studying clinical popula-
across populations, when it is possible to compare tions, access to potential participants will often be
parameter estimates across groups, and similar dif- sought through professional agencies, such as men-
ficult questions. In this section, we slice up this tal health centers, regional centers that offer services
task under four headings—identifying and accessing for members of particular populations (e.g., intel-
special populations, measuring the same constructs lectual disability), state agencies, or the like. If the
across groups, exploring the bounds of psychological goal is to study students in elementary school, devel-
theories, and exploiting unusual variation in spe- oping ties to schools is the most obvious approach.
cial populations—and discuss the methodological But, members of some populations are much harder
implications of the substantive issues that arise. to access. For example, researchers who study child
physical or sexual abuse or neglect must often access
their participants through arrangements with court
Identifying and Accessing Participants in systems, child welfare organizations, and so forth.
Special Populations In these research endeavors, investigators frequently
The first task of any investigation into partici- find that a great deal of time must be spent and
pants from special populations is gaining access to a large amount of red tape must be surmounted to
the participants. But, this characterization of “gain- gain access to research subjects. Still, one should not
ing access to the participants” masks several neces- let the difficulty of the access alter the research goals
sary steps in the design and conduct of an empirical of a program of research. To provide the most valid
investigation. To provide some cognitive structur- data from the most representative settings, proper
ing of this first concern, we have broken down the and optimal access must be developed.
research process associated with drawing a sample A third step, once the special population has
from a special population into four steps or issues. been identified and modes of accessing participants
The first task is the clear identification of the spe- have been developed, is to develop a plan for sam-
cial population to be studied. As earlier sections pling participants from the population. In some
of this chapter have shown, a single, overarching situations, a sampling plan can be relatively straight-
notion of what constitutes a special population is forward. For example, a researcher might select each
not present in the research literature. Rather, vari- birth (or every second birth) from a set of hospi-
ous ways of defining special populations can be used, tals during a given time period as the sample to
and some of these will provide partially overlapping be drawn, and this will often lead to the draw-
subsets of possible participants. Because of the seem- ing of a representative sample from the population
ingly murky nature of some special populations, we that is served by the hospitals in question. Study-
offer only the most general recommendations of how ing birth outcomes in low SES families might be
to deal with the problem of identification of the pop- approached in this fashion, if the hospitals selected
ulation. Researchers should be careful to develop a tended to serve the low SES families in a community.
clear statement of the population to be studied. If But, if a researcher intends to investigate differ-
the target population is all persons receiving mental ences across certain strata (e.g., African-American,
health services in a catchment area, then no explicit Hispanic [or Latino]) and participants from these
64 s p e c i a l p o p u l at i o n s
have been published under several rubrics, including r latent variables, η is an (r × 1) vector of scores
for
methods to assess measurement invariance, factorial person i on the r latent variables, and ε is a p × 1
invariance, or lack of differential item function- vector of scores of person i on the p unique factors.
ing (DIF). The upshot of this concern leads to One can use the model in Equation 2 to develop
an important implication when researching special moment expectations for the manifest variables,
populations: moment expectations that yield expressions for the
covariance structure and the mean structure of the
Implication 2: Establishing the measurement
manifest variables. In a single-group case, these
invariance of instruments across groups is a key result
expectations are:
in the investigation of any special population.
Measurement invariance must hold for meaningful
comparisons to be made across samples from Σ = ΛΨ + Θ (3a)
different populations. μ = τ + Λα, (3b)
Measurement invariance is a broad topic, one
that subsumes research and statistical models that where Σ is the p × p matrix of population covari-
cross many boundaries. For example, some work ances among manifest variables, Ψ is the (r × r)
on measurement invariance has looked at predic- matrixof covariances
among the latent variables, Θ
tion models, which may be used to predict some is the p × p matrix (usually diagonal)
of covari-
behavioral outcome (e.g., college grade point aver- ances among unique factors, μ is a p × 1 vector
age) from relevant predictors (e.g., high school grade of population means on manifest variables, α is
point average, admission test scores). Multiple mod- an (r × 1) vector of means on the latent variables,
els can be employed in such research, and regression and other symbols were defined above. Equation 3a
analysis is often used. In regression analysis, ques- is the population covariance structure model, and
tions arise regarding any potential intercept bias Equation 3b is the population mean structure
or slope bias across groups when predicting the model.
outcome. In any sample from a population, we observe
When measurement invariance is considered sample covariances among manifest variables, which
within the context of factor analysis models, the we signify as S, and sample means on the manifest
term applied is usually factorial invariance, which is variables, which we can signify as Ȳ. Given these
a restricted form of measurement invariance (Mered- sample estimators of population values, we can write
ith, 1993). In factor models, we typically begin with the sample covariance and mean models as:
a data model of the form:
Yji = τj + λj1 η1i + · · · + λjr ηri + εji , (1) S∼
= Λ̂Ψ̂Λ̂ + Θ̂=Σ̂ (4a)
66 s p e c i a l p o p u l at i o n s
manifest variables in the same fashion across groups. particular matrices in Equations 8a and 8b. Specif-
The resulting equations are: ically, group differences in mean levels are obtained
from the α̂g matrices in Equation 8b. Models are
Sg ∼
= Λ̂Ψ̂g Λ̂ + Θ̂g = Σ̂g (7a)
often identified by fixing factor means to zero in one
Ȳg ∼
= μ̂g =τ̂ + Λ̂α̂g , (7b) group (e.g., group 1, which serves as the reference
group), so mean values for the other groups (e.g.,
where all symbols are defined above. Note that both groups 2, . . . , G) are estimated as mean differences
the factor loading matrix and the vector of intercepts from the reference group.
have the g subscript deleted, because these matrices Group differences in variance on the latent vari-
have invariant estimates across groups. Two key out- ables or covariances among latent variables are
comes accompany strong factorial invariance: (1) all
obtained from the Ψ̂g matrices in Equation 8a. If
differences across groups in the means on manifest vari-
latent variables are identified by fixing latent vari-
ables are due to mean differences on the latent variables;
able variances to unity in the reference group (e.g.,
and (2) group differences in means and variances on
group 1), then variances on the latent variables are
the latent variables are identified in a comparable met-
estimated relative to this reference group.
ric across groups, enabling comparisons at the latent
In summary, if strong factorial invariance holds
variable level across groups. Thus, the latent variable
for a set of data, latent variables are identified
model in Equations 7a and 7b identifies the latent
in an invariant fashion across groups. Given this,
variables as the sources of differences across groups in
between-group comparisons on mean and/or vari-
mean levels on the manifest variables, and variances
ance on the latent variables are justified, and these
on latent variables can also be compared.
comparisons can be made at the error-free, latent
The fourth and most restricted level of invari-
variable level.
ance is strict factorial invariance. Under this level
of invariance, the unique factor variances are addi-
tionally constrained to invariance across groups. The Exploring the Bounds of Psychological
resulting equations for the covariance structure and Theories
mean structure, respectively, are: As discussed above, obtaining strong factorial
invariance allows the researcher to assume that
Sg ∼
= Λ̂Ψ̂g Λ̂ + Θ̂ = Σ̂g (8a)
mean and/or variance differences on dimensions of
Ȳg ∼
= μ̂g =τ̂ + Λ̂α̂g , (8b) individual difference are interpretable across popu-
lations. However, psychological theories frequently
where all symbols are defined above. As shown in
lead to predicted relations among constructs. For
Equation 8a, the g subscript is deleted from the
example, many researchers have sought to outline
unique factor covariance matrix Θ̂ because esti- dimensions of parenting styles, and the isolation and
mates in this matrix are invariant across groups. replication of such dimensions—particularly across
Under strict factorial invariance, all group differ- groups (e.g., across ethnic groups)—is an impor-
ences in mean levels and in variances on the manifest tant matter. But, once these initial steps have been
variables are due to mean and variance differences, taken, researchers are typically interested in whether
respectively, in the latent variables. Thus, under strict parenting styles have impacts on, or at least con-
factorial invariance, we have a concise representa- sistent patterns of asymmetric relations with, other
tion of all between-group differences on manifest variables, such as child behavior. To pursue such
variables. Although strict factorial invariance is the research, we must focus on the relations among
most concise of the levels of factorial invariance, latent variables, determining whether the patterns
researchers often find that equality constraints on and strength of relations among latent variables is
unique factor variances across groups are too restric- similar or different across groups.
tive. This is not a problem, because comparisons
across groups on the latent variables are justified if Implication 3: Results of studies of the general
strong factorial invariance holds. population should not be generalized to special
Representing mean and variance/covariance dif- populations without research that supports such
ferences across groups. If at least strong factorial generalization. Investigations of the structural
invariance is satisfied for a set of data, then latent relations among latent variables across populations
variables are identified in a form that allows inves- hold the key to determining whether conclusions
tigation of group differences on the latent variables. regarding relations among variables generalize to
These differences across groups are contained in special populations.
68 s p e c i a l p o p u l at i o n s
process, help establish a common nomenclature for be placed on raw score regression weights (or equiva-
discussing structural invariance. lents of these), rather than standard score regression
The first and most basic form of structural invari- weights. As in typical multiple regression analy-
ance is configural structural invariance of the pattern ses, raw score regression weights are presumed to
of fixed and free regression weights in the B̂g matrices be invariant across samples from a common pop-
in Equation 14a. If a restricted pattern of directed ulation, whereas standardized regression weights
paths is estimated from independent latent variables are expected to vary as a result of range restric-
to dependent latent variables and among depen- tion resulting from sampling. Now, with latent
dent latent variables, configural structural invariance variable models, the scale of each latent variable
implies that the same latent variables have directed may be fixed in any of several ways, and the “raw
effects on the same outcome latent variables in each score regression weights” among the latent η vari-
group. ables will vary as a result. However, if the scale
The second level of structural invariance is invari- of each latent variable is fixed in one group and
ance of the regression weight estimates for predicting strong factorial invariance constraints are placed on
certain latent η variables from other latent variables. the Λ̂ and τ̂ matrices, then the latent η variables
Paralleling distinctions made above for factorial are on the same scale across groups. As a result,
invariance, placing invariance on the regression the regression weights in the B̂ matrices are anal-
weights leads to what may be called weak structural ogous to raw score regression weights and are on
invariance. These regression weights are contained a comparable metric, so invariance constraints on
in the B̂g matrices in Equation 14a. If parame- these weights are reasonable a priori hypotheses
ter estimates in the B̂g matrices are constrained to to test.
invariance across groups, the g subscript would be A third form of structural invariance involves
dropped from these matrices, leading to the intercepts for the dependent latent variables,
−1 Ψ̂(i)
contained in the αg(d ) matrices. Placing invariance
0
Sg ∼
= Λ̂ I − B̂
g constraints on these latent intercepts leads to strong
0 Ψ̂(d
g
)
structural invariance. If across-group constraints on
−1 latent intercepts are imposed, then the resulting
I − B̂ Λ̂ + Θ̂g = Σ̂g , (15) equation for mean expectations has the form:
where all terms are defined above, and the lack of
α̂g(i)
subscripts on the B̂ matrices indicates that across- Ȳg ∼
= μ̂g = τ̂ + Λ̂ , (16)
α̂(d )
groups invariance constraints have been imposed on
parameter estimates in this matrix.
where all terms are defined above, and the lack of
At least two issues must be mentioned with regard
subscripts on the α(d ) matrix indicates the presence
to constraints on regression weights in the B̂ matri- of cross-group invariance constraints on parameter
ces. First, across-group constraints on the B̂ matrices estimates in this matrix.
are interpretable only if weak or strong factorial More importantly, if these latent intercept terms
invariance has been established. Thus, if only con- are constrained to invariance across groups with
figural factorial invariance holds for a given set of no significant loss in model fit, then any group
data, no clear and convincing substantive interpre- differences in mean level on the dependent latent
tation can be placed on constraints on elements variables result from group differences in mean level
in the B̂ matrices. Data must support at least the on the independent latent variables. This condition
hypothesis of weak factorial invariance to yield inter- is analogous to the distinction at the measured vari-
pretable constraints on the B̂ matrices, and still able level that distinguishes weak factorial invariance
stronger interpretations of such constraints accom- from strong factorial invariance; hence, our dis-
pany the successful specification of strong factorial tinction here between weak and strong structural
invariance. For this reason, we placed the restricted invariance.
B̂ matrices in the strong factorial invariance model The fourth and final type of structural invariance
in Equation 15, because strong factorial invariance to be represented and tested involves the residual
allows a more adequate basis for discussing these covariances among the dependent latent variables,
constraints. contained in the Ψ̂(d g
) matrices. If these residual
Second, if across-group constraints are imposed covariances are constrained to invariance, the result-
on the B̂ matrices, the constraints should generally ing model is characterized as conforming to strict
70 s p e c i a l p o p u l at i o n s
emphasizing this issue, we encourage researchers to reported by Nary, Froehlich-Grobe, and Aaronson
be vigilant for the presence of unusual variation. (2011), who sought to recruit wheelchair users into
Unusual variation may take any of several forms. Per- a randomized controlled exercise trial. Nary et al.
sons in a special population may exhibit variation on set out to recruit 180 participants for their study, or
dimensions on which members of the general pop- 60 persons in each of three cohorts. The research
ulation may show no variability at all. An example team used a wide variety of recruiting strategies,
of this is special artistic abilities, such as individual beginning with the traditional approach of con-
differences in composing symphonies or directing tacting hospitals, health-care organizations, health-
movies. No one doubts the presence of individual care providers, and disability agencies. When these
differences among composers and movie directors, sources of recruitment led to discouraging results,
and preference for certain music or movies is often Nary et al. began using more innovative approaches,
a lively topic of discussion. But, we rarely reflect including putting flyers in movie theaters and public
on the fact that relatively few individuals compose transportation (e.g., busses), advertising in news-
symphonies or direct movies, so no individual dif- papers, newsletters, and so forth, and employing
ferences on these dimensions are evident outside the direct mail coupon packets (identified as ValPak).
small group of persons who pursue these distinctive These recruitment efforts were extremely varied, as
endeavors. Nary et al. listed approximately 30 different loca-
Alternately, members of special populations tions/activities associated with their recruitment.
may exhibit unusually large or small variation on When participants were recruited, they were asked
behavioral traits relative to the general population. where they heard about the project. Somewhat sur-
Persons in a special population may show a particu- prisingly, the single most effective method was the
lar symptomatic behavior and so exhibit essentially ValPak direct mail coupon packet and other contacts
no variation in this behavior, which may fall at one through the media (e.g., radio and TV advertise-
end of a behavioral continuum on which the general ments, newspaper advertisements); a full third of
population exhibits wide individual variability. For the sample of participants was recruited through the
example, persons with Down syndrome often have media contacts. Although the research team did not
facial features, including small chin, round face, fully meet their recruitment goal of 180 participants,
almond eye shape, and so forth, that are an unusual they did come close to the goal—and only did so by
combination in the general population. Thus, the employing a wide array of approaches that cost much
common facial appearance of persons with Down more in time and effort than the research team had
syndrome, which is distinct from other members of anticipated. The Nary et al. paper is an entertaining,
the population, is an early signal that the individual if sobering, accounting of the myriad approaches the
is a member of this special population. Or, per- researchers took to recruit participants and the rel-
sons in a special population may exhibit enhanced ative value of the different approaches. Nary et al.
or enlarged variability on a dimension relative to have summarized their experiences in a series of five
the general population. Regardless of the form of “lessons learned,” such as needing to be cognizant
unusual variability, such variability may offer unique that recruiting of persons with disabilities will be
ways to understand differences across groups. more difficult than recruiting persons without dis-
abilities and that more time, effort, and money will
Examples of Quantitative Explorations of be required in recruiting members of special popu-
Special Populations lations than initially thought. The most unfortunate
Having outlined various methodological issues lesson learned was the fourth lesson, which related to
and approaches that can be used to characterize spe- the research team finding that health-care agencies
cial populations, we now describe empirical studies were not especially helpful in recruiting participants,
that embody each of the methodological issues that although the special population of wheelchair users
we have discussed. The approaches we have taken has, in general, greater reliance on such agencies.
and the results obtained can be extended readily by The Nary et al. paper is a wonderful place to start
other researchers to the study of special populations. when planning to recruit members of populations
with disabilities, and learning from the efforts of
Identifying and Accessing Participants in these researchers might ease recruitment in future
Special Populations studies.
One example of the problems that arise when A second example of ways researchers have dealt
recruiting members of a special population was with identifying and accessing participants in a
72 s p e c i a l p o p u l at i o n s
money to cover expenses, including “difficulty pay- Assessing factorial invariance across groups. We
ing bills” and “the amount of money left at the end of then fit a series of three-group confirmatory fac-
the month” (response scale: 1 = “no difficulty at all” tor models to test factorial invariance of the model
to 4 = “a great deal of difficulty”); and (3) Financial across the three samples. The first model, Model
Cutbacks (F), the number of 11 areas in which the 1, was a three-factor model that had the same pat-
family had to make cutbacks, including “the family tern of fixed and free factor loadings in each group.
postponing major household purchases, changing This configural invariance model fit the data well,
residence, and or eliminating medical insurance” χ 2 (33, N = 1063) = 41.99, with RMSEA of
(each area scored dichotomously, 0 = “no cut- .028, and CFI and TLI values of 0.996 and 0.992,
backs” and 1 = “cutbacks”). Husbands and wives respectively.
provided answers on the preceding scales, and aver- Next, we imposed invariance constraints on fac-
age responses across husbands and wives were used tor loadings in the Λ matrix across gender and
as analyzed scores. For families in which husbands ethnicity. For example, the loadings for the two indi-
did not participate, scores on the economic pres- cators of husband depression were constrained to be
sure variables were treated as missing data. Scores equal to the loadings for the two indicators of wife
on all three indicators were reverse-scored, when depression within each sample; these loading were
necessary, so that higher scores indicated greater then constrained to equality across all three samples.
economic pressure on the family. The fit of the resulting model, Model 2, was slightly
Across the three ethnic groups, the depression worse; however, indicators of model fit remained
latent construct was obtained from the Mini-Mood in an acceptable range, χ 2 (42, N = 1063) =
and Anxiety Questionnaire (Mini-MASQ; Clark 61.25, p = 0.02, with RMSEA of 0.036, and CFI
& Watson,1997). For this analysis, we used the and TLI scores of 0.991 and 0.987, respectively.
five items that measure general distress—depression. The latter indices of practical fit suggest that the
These items ask both husband and wife to self-report invariance constraints imposed on Model 2, which
on how much they had felt depressed, discour- represent weak factorial invariance, are appropri-
aged, and or worthless in the past week. Each ate. In prior research, several authors have discussed
item was answered on a scale ranging from 1 = changes in TLI or CFI values that might be consid-
“not at all” to 4 = “very much.” Two parcels (one ered practically significant when comparing nested
two-item and one three-item parcel) were formed, models. Thus, Widaman (1985) considered a differ-
and these are denoted M1H, M2H, M1W, and ence in the TLI of less than 0.01 when comparing
M2W for parcels 1 and 2 from husband and wife, two models to be a practically unimportant differ-
respectively. ence in fit, and Cheung and Rensvold (1999) and
α1 α2 α3
1
ψ 31
ψ21 ψ 32
λ1 λ2 λ3 λ4 λ5 λ6 λ7
θ1 θ2 θ3 θ4 θ5 θ6 θ7
τ1 τ2 τ3 τ4 τ5 τ6 τ7
Figure 4.1 A figural presentation of the relations between latent variables for Economic Pressure, Husband Depression and Wife
Depression and their respective indicators (Note: triangle stands for the unit constant, squares for manifest variables, circles or ellipses
for latent variables, single-headed arrows for direct effects, and double-headed arrows for symmetric effects such as variances or
covariances)
χ 2 (7, N = 1063) = 34.04, p < 0.001. How- Ethnic differences in effects of economic pressure on
ever, practical fit indices were only modestly worse, depression. We turn next to ways of exploring the
with an RMSEA of 0.052, and CFI and TLI values bounds of psychological theories. A substantial body
of 0.979 and 0.973, respectively. Despite having of prior research supports the conclusion that family
worse fit that slightly exceeded the “change of 0.01” economic pressure has negative effects on husband
criterion for practically significant difference in fit, and wife depression. Much of this research has been
the practical fit indexes for Model 3 were still accept- based on data from European American families,
able (i.e., the CI for the RMSEA included 0.05, and and we wanted to see whether the effect of economic
both TLI and CFI values were above 0.95), indi- pressure on husband and wife depression was similar
cating that the partial strong invariance constraints across ethnic groups.
applied were appropriate. To answer this question, we returned to the
The final confirmatory model, Model 4, imposed three-group data used to test factorial invariance
invariance constraints on the unique factor variances but examined models that substituted direct effects
in Θ and therefore corresponded to strict factorial of economic pressure on husband and wife depres-
invariance. Model 4 resulted in relatively poor fit, sion in place of the correlations estimated in prior
χ 2 (65, N = 1063) = 317.84, p < 0.001, with models. In the first of these models, we imposed
RMSEA of 0.105, and CFI and TLI scores of 0.886 invariance constraints on the path coefficients lead-
and 0.890, respectively. This represented a signifi- ing from economic pressure to depression across
cant worsening of both the statistical index of fit, husbands and wives within each sample. This cross-
χ 2 (16, N = 1063) = 222.55, p < 0.001, and gender constraint resulted in a significant worsening
practical indices of fit when compared to Model 3. in fit compared to the partial strong invariance
Based on the poor practical fit indexes of the strict model,
χ 2 (3, N = 1063) = 12.96, p = 0.01,
factorial invariance model, we accepted the partial with moderately worse practical fit indexes, RMSEA
strong factorial invariance model (Model 3) as the of 0.055, and CFI and TLI values of 0.975 and
optimal model for the data. 0.969, respectively. Next, we freed the cross-gender
Group differences in means and variances on latent constraints, but imposed equality constraints on the
variables. Given the fit of the partial strong facto- path coefficients of economic pressure on husband
rial invariance model, mean and variance differences depression across groups and on wife depression
across groups could be evaluated in an unbiased fash- across groups. This model also resulted in a sig-
ion. For the latent variables of Economic Pressure nificant worsening of fit when compared to partial
and Husband Depression, means were fixed at 0 and strong factorial invariance model,
χ 2 (4, N =
variances at unity for the FTP sample, and means 1063) = 19.65, p < 0.001, with worse practical
and variances were estimated for all other latent vari- fit indexes, RMSEA of 0.057, and CFI and TLI
ables. In the FTP sample, wives had significantly values of 0.972 and 0.967, respectively.
74 s p e c i a l p o p u l at i o n s
In our final model, we imposed invariance con- positive parenting behaviors of induction, maturity
straints on the path coefficients for economic pres- demands, and promoting autonomy had the
sure on husband and wife depression across the two strongest, positive effects on longitudinal changes
minority samples, allowing the corresponding coef- in different forms of adaptive functioning, such as
ficients to be free for the European American sample. practical skills (e.g., dressing oneself ), conceptual
This resulted in a nonsignificant change in model competence (e.g., functional academics), and social
fit when compared to the partial strong invariance competence (e.g., making and keeping friends).
model,
χ 2 (2, N = 1063) = 1.52, p = 0.47. The However, contrary to research on the general popu-
practical fit indexes for this model were essentially lation, Widaman and Borthwick-Duffy found that
identical to those for the strong factorial invari- the authoritarian dimension of power assertion
ance model, so we chose this model as the optimal consistently had the strongest effects on longitu-
representation of our data. dinal changes in negative behaviors such as social
Parameter estimates in this model suggest that the maladaption (e.g., aggression, property destruc-
impact of economic pressure on the level of depres- tion) and personal maladaption (e.g., self-injurious
sion for minority couples differs from that for Euro- behavior). Although the standardized regression
pean American couples. Specifically, economic pres- weights were not large (βs ranged from 0.15 to
sure appears to have the strongest effect (β = 0.69) 0.24), power assertion was the only parenting style to
on depression in minority husbands (both African- impact changes in maladaptive behaviors. Thus, the
American and Mexican-American husbands), and a proscription of power assertion as a less useful form
significantly weaker effect (β = 0.47) on depression of parenting behavior was based on research on fam-
in minority wives. In contrast, economic pressure ilies in the general population. However, to reduce
had a stronger effect (β = 0.54) on European Amer- the levels of maladaptive behaviors in children with
ican wives than on European American husbands intellectual disability, power assertion appears to be
(β = 0.23), who were least affected by economic the only viable parenting option.
pressure.
Parenting styles and child behavior. A second
example of exploring the bounds of psychologi- Exploiting Unusual Variability in Special
cal theories developed on the general population Populations
is worthy of note. In this study, Widaman and Adaptive behavior in persons with intellectual dis-
Borthwick-Duffy (April, 1990) reported results ability. The final analytic issue to illustrate is the
from 109 families with a child with intellectual dis- exploiting of unusual variability in special popula-
ability. Key findings on parenting styles, or styles of tions. One example of unusual variability in special
parenting behaviors, by researchers such as Baum- populations involves the domain of adaptive behav-
rind (1968, 1971, 1991; Baumrind, Larzelere, & ior, which attains significance in the population
Owens, 2010) and Hoffman (1975, 1979, 1994) of persons with intellectual disability. The three
supported the presence of several dimensions of par- key dimensions of adaptive behavior are practical
enting behavior. The general findings of this research competence (or independent living skills), concep-
was that authoritative parenting (high control, plus tual competence (or cognitive skills), and social
high nurturance/warmth) was associated with opti- competence (or skills); all three dimensions repre-
mal forms of child behavior and that authoritarian sent everyday behaviors that enable a person to live
parenting (high control and power assertion, plus independently in the community.
low nurturance and warmth) and permissive par- Widaman, Gibbs, and Geary (1987) utilized a
enting (low control, plus high nurturance and database maintained by the Department of Devel-
warmth) both led to less optimal levels of child opmental Services (DDS) of the State of Cali-
behavior. fornia. DDS required all persons receiving state
Widaman and Borthwick-Duffy (April, 1990) services for developmental disability to be assessed
isolated a total of seven dimensions of par- on a 66-item instrument of adaptive behavior. The
enting behavior in their study, including (1) instrument assessed the three key dimensions of
nurturance/warmth, (2) induction, (3) maturity practical, conceptual, and social competence and
demands, (4) promoting autonomy, (5) firm con- also assessed three additional dimensions of motor
trol, (6) love withdrawal, and (7) power assertion. competence, social maladaption, and personal mal-
Consistent with research on the general population, adaption. Widaman et al. extracted 14 groups of
Widaman and Borthwick-Duffy found that the individuals based on a crossing of age (children,
76 s p e c i a l p o p u l at i o n s
Table 4.1. Descriptive Statistics on Variables from the Maternal PKU
Study
Variable N Mean SD Min Max
Mother PHE level on unrestricted diet 413 22.03 9.18 3.30 51.10
Mother PHE level during pregnancy 412 8.23 4.49 1.30 28.30
Note: Mother full-scale IQ based on Wechsler Adult Intelligence Scale—Revised; Mother PHE
levels on regular diet and during pregnancy are in mg | dL; Child verbal, performance, and
full-scale IQ based on Wechsler Intelligence Scale for Children—Revised.
would exhibit extremely low PHE values, below 1.0 the three coefficients were 1.70, 0.32, and 0.62, the
and near 0, on these two measures. In contrast to point estimates and interval estimates for the coef-
this, mother PHE levels on an unrestricted diet var- ficients are: intercept = 103.9, 95% CI [100.5,
ied between 3.3 and 51.1, and mother PHE levels 107.3], regression slope for PHE = −4.14, 95%
during pregnancy varied between 1.3 and 28.3. CI [ −3.50, −4.78], and knot point = 5.50, 95%
To model the effect of prenatal PHE exposure on CI [4.25, 6.75].
child Full Scale IQ at 7 years, one could use a linear Exploiting the unusual variability of mother PHE
regression model. The result of fitting this model is levels during pregnancy allowed us to reach several
the following equation: important goals. First, we could verify the nature of
the teratogenic effect of prenatal exposure to PHE
predicted IQ = 119.5 − 3.53 (PHE)
on child cognitive outcomes, which is nonlinear
The standard errors for the intercept and regres- in nature with a threshold for teratogenic effects.
sion slope were 2.13 and 0.24, respectively, so Second, we could estimate the level of exogenous
interval estimates (95% CIs) for the two parameter PHE at which the teratogenic effect begins to occur,
estimates were approximately 117.2 to 123.7 and which is close to the level of endogenous PHE often
−3.05 to −4.00 for the intercept and slope, respec- assumed to have teratogenic effects for persons with
tively. The above equation represents the unrealistic PKU. Third, these results could be used as the basis
expectation that a child’s IQ at age 7 years would for recommendations for monitoring PHE levels
be approximately 120 if his/her mother had main- in mothers with PKU, attempting to ensure that
tained a PHE level of 0 during pregnancy, and that mothers with PKU keep their PHE levels below
IQ would drop about 3.5 points for every 1 mg|dL the level at which teratogenic effects are likely to
increase above this value of 0. occur. Additionally, all of these goals were accom-
An alternative regression model is a two-piece plished only because of the unusual variability in
linear spline, with three parameter estimates: an PHE levels exhibited by mothers with PKU in the
intercept, a knot point that estimates the PHE expo- MPKUC study. Readers are referred to Widaman
sure at which a teratogenic effect begins to occur, and (2009) for an informative summary of findings of
the linear slope representing the teratogenic effect the MPKUC study.
after the knot point. The results for this model were:
predicted IQ = 103.9−4.14 (PHE), knot = 5.50.
Conclusions
In this equation, children who experienced pre- The presence of special populations—however
natal PHE levels between 0 and 5.50 all had an delineated—provides opportunities beyond those
expected 7-year IQ of about 104, near the popu- afforded by drawing repeated samples that are rep-
lation mean. Moreover, after the knot point of 5.50, resentative of the general population. If we disre-
the predicted IQ drops over 4.1 IQ points for every garded the presence of special populations, then we
1 mg|dL increased. Because the standard errors for might develop a corpus of scientific findings that
78 s p e c i a l p o p u l at i o n s
One Shields Avenue, Davis, CA 95616. E-mail: Clark, L. A., & Watson, D. (1997). The Mini Mood and
kfwidaman@ucdavis.edu Anxiety SymptomQuestionnaire (Mini-MASQ). Unpublished
manuscript, University of Iowa.
Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and
References White students in integrated colleges. Journal of Educational
Barton, B. (2009). Dreams deferred: Disability definitions, data, Measurement, 5, 115–124.
models, and perspectives. Journal of Sociology and Social Cobb, B., Lehmann, J., Newman-Gonchar, R., & Alwell, M.
Welfare, 36(4), 13–24. (2009). Self-determination for students with disabilities: A
Baumrind, D. (1968). Authoritarian vs. authoritative parent narrative metasynthesis. Career Development for Exceptional
control. Adolescence, 3, 255–272. Individuals, 32, 108–114.
Baumrind, D. (1971). Current patterns of parental authority. Colarelli, S. M., Han, K., & Yang, C. (2010). Biased against
Developmental Psychology, 4 (1, Pt. 2), 1–103. whom? The problems of “group” definition and membership
Baumrind, D. (1991). The influence of parenting style on ado- in test bias analyses. Industrial and Organizational Psychology:
lescent competence and substance abuse. The Journal of Early Perspectives on Science and Practice, 3, 228–231.
Adolescence, 11, 56–95. Conger, R. D., Song, H., Stockdale, G. D., Ferrer, E., Widaman,
Baumrind, D., Larzelere, R. E., & Owens, E. B. (2010). Effects K. F., & Cauce, A. M. (2012). Resilience and vulnerability of
of preschool parents’ power assertive patterns and practices Mexican origin youth and their families: A test of a culturally-
on adolescent development. Parenting: Science and Practice, informed model of family economic stress. In P. K. Kerig, M.
10, 157–201. S. Schulz, & S. T. Hauser (Eds.), Adolescence and beyond:
Belsky, J., Bakermans-Kranenburg, M. J., van IJzendoorn, M. H. Family processes and development (pp. 268–286). New York:
(2007). For better and for worse: Differential susceptibility to Oxford University Press.
environmental influences. Current Directions in Psychological Conger, R. D., Stockdale, G. D., Song, H., Robins, R. W., &
Science, 16, 300–304. Widaman, K. F. (in press). Predicting change in substance
Bronfenbrenner, U. (1977). Toward an experimental ecology of use and substance use cognitions of Mexican origin youth
human development. American Psychologist, 32, 513–531. during the transition from childhood to early adolescence. In
Bronfenbrenner, U. (1986a). Ecology of the family as a context Y. F. Thomas, L. N. Price, & A. V. Lybrand (Eds.), Drug use
for human development: Research perspectives. Developmen- trajectories among African American and Hispanic Youth. New
tal Psychology, 22, 723–742. York: Springer.
Bronfenbrenner, U. (1986b). Recent advances in research on Diaz, N., Horton, E. G., McIlveen, J., Weiner, M., & Nelson, J.
the ecology of human development. In R. K. Silbereisen, (2009). Dysthymia among substance abusers: An exploratory
K. Eyferth, & G. Rudinger (Eds.), Development as action in study of individual and mental health factors. International
context: Problem behavior and normal youth development (pp. Journal of Mental Health and Addiction, 7, 357–367.
287–309). Heidelberg: Springer-Verlag. Eastwood, S. L., & Harrison, P. J. (2010). Markers of glutamate
Bronfenbrenner, U. (1999). Environments in developmental synaptic transmission and plasticity are increased in the ante-
perspective: Theoretical and operational models. In S. L. rior cingulate cortex in bipolar disorder. Biological Psychiatry,
Friedman & T. D. Wachs (Eds.), Measuring environment 67, 1010–1016.
across the life span: Emerging methods and concepts (pp. 3–28). Ferrer, E., Balluerka, N., & Widaman, K. F. (2008). Factorial
Washington, DC: American Psychological Association. invariance and the specification of second-order latent growth
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for models. Methodology, 4, 22–36.
the equivalence of factor covariance and mean structures: The Folks, H. (1931). Committee C-1—Socially handicapped—
issue of partial measurement invariance. Psychological Bulletin, Dependency and neglect. In White House Conference on
105, 456–466. Child Health and Protection, White House conference 1930:
Caspi, A., McClay, J., Moffitt, T. E., Mill, J., Martin, J., Craig, Addresses and abstracts of committee reports, White House con-
I. W., et al. (2002). Role of genotype in the cycle of violence ference on child health and protection called by President Hoover
in maltreated children. Science, 297, 851–854. (pp. 319–340). New York: The Century Co.
Caspi, A., Sugden, K., Moffitt, T. E., Taylor, A., Craig, I. W., Grigorenko, E. L., Klin, A., & Volkmar, F. (2003). Annota-
Harrington, H., et al. (2003). Influence of life stress on tion: Hyperlexia: disability or superability? Journal of Child
depression: Moderation by a polymorphism in the 5-HTT Psychology and Psychiatry, 44, 1079–1091.
gene. Science, 301, 386–389. Hancock, G. R., Kuo, W.-L., & Lawrence, F. R. (2001). An
Chakrabarti, B., Dudbridge, E., Kent, L., Wheelwright, S., Hill- illustration of second-order latent growth models. Structural
Cawthorne, G., Allison, C., et al. (2010). Genes related Equation Modeling, 8, 470–489.
to sex steroids, neural growth, and social-emotional behav- Henry, P. J. (2008). College sophomores in the laboratory
ior are associated with autistic traits, empathy, and Asperger redux: Influences of a narrow data base on social psychol-
syndrome. Autism Research, 2, 157–177. ogy’s view of the nature of prejudice. Psychological Inquiry, 19,
Chassin, L., Knight, G., Vargas-Chanes, D., Losoya, S. H., & 49–71.
Naranjo, D. (2009). Substance use treatment outcomes in a Hoffman, M. L. (1975). Altruistic behavior and the parent-child
sample of male serious juvenile offenders. Journal of Substance relationship. Journal of Personality and Social Psychology, 31,
Abuse Treatment, 36, 183–194. 937–943.
Chen, F. F., Sousa, K. H., & West, S. G. (2005). Testing mea- Hoffman, M. L. (1979). Development of moral thought, feeling,
surement invariance of second-order factor models. Structural and behavior. American Psychologist, 34, 958–966.
Equation Modeling, 12, 471–492. Hoffman, M. L. (1994). Discipline and internalization. Devel-
Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial opmental Psychology, 30, 26–8.
invariance across groups: A reconceptualization and proposed Horn, J. L., McArdle, J. J., & Mason, R. (1983). When is
new method. Journal of Management, 25, 1–27. invariance not invariant: A practical scientist’s look at the
80 s p e c i a l p o p u l at i o n s
Waller, N. G., & Reise, S. P. (1989). Computerized adaptive presented at the Gatlinburg Conference on Research and The-
personality assessment: An illustration with the Absorption ory in Mental Retardation and Developmental Disabilities,
scale. Journal of Personality and Social Psychology, 57, 1051– Brainerd, MN.
1058. Widaman, K. F., Gibbs, K. W., & Geary, D. C. (1987). The
Widaman, K. F. (1985). Hierarchically nested covariance struc- structure of adaptive behavior: I. Replication across fourteen
ture models for multitrait-multimethod data. Applied Psycho- samples of nonprofoundly retarded persons. American Journal
logical Measurement, 9, 1–26. of Mental Deficiency, 91, 348–360.
Widaman, K. F. (2007). Intrauterine environment affects infant Widaman, K. F., & Reise, S. P. (1997). Exploring the measure-
and child intellectual outcomes: Environment as direct effect. ment invariance of psychological instruments: Applications
In T. D. Little, J. A. Bovaird, & N. A. Card (Eds.), Model- in the substance use domain. In K. J. Bryant, M. Windle,
ing contextual effects in longitudinal studies (pp. 387–436). & S. G. West (Eds.), The science of prevention: Method-
Mahwah, NJ: Erlbaum. ological advances from alcohol and substance abuse research
Widaman, K. F. (2009). Phenylketonuria in children and moth- (pp. 281–324). Washington, DC: American Psychological
ers: Genes, environments, behavior. Current Directions in Association.
Psychological Science, 18, 48–52.
Widaman, K. F., & BorthwickDuffy, S. A. (April, 1990). Parental
influences on the development of adaptive behaviors. Paper
James Jaccard
Abstract
General issues involved in (1) building causal theories, (2) translating those theories into a set of
mathematical representations, (3) choosing an analytic strategy to estimate parameters in the
equations implied by the theory, and (4) choosing the “best” model from a set of competing models
are discussed. Theory construction fundamentally relies on six relationship types, including direct
effects, indirect effects, moderated effects, reciprocal effects, spurious effects and unanalyzed
relationships. Once specified, each type of effect can be represented mathematically, thereby
translating a path diagram into a set of (linear) equations. Parameters in the equations are then
estimated using either limited information estimation approaches or full information estimation
approaches, taking into account measurement properties, population distributions, and matters of
robustness. Choices between competing models are based on indices of relative fit with the data and
relative fit with competing models, as well as more general theoretical criteria (e.g., parsimony,
consistency with related theories).
Key Words: Theory; theory construction; modeling; parameter estimation
Few people dispute that theory is at the heart of this, we would transform data on depression so
the scientific enterprise. We use theories to explain that it would approximate a normal distribution
phenomena and to help solve important applied and be amenable to analysis using statistical meth-
problems. With theoretical propositions in hand, ods that assumed normality. Such days, fortunately,
we design research to gain perspectives on the viabil- are over. The statistical tools available now allow
ity of the theory and imbue those propositions with us to give precedence to theory and model test-
a certain degree of confidence based on the results ing without being slave to many of the traditional
of research and the quality of the research design. assumptions made by methods of analysis. The
Theory is fundamental to the social sciences. present chapter discusses issues that analysts need
Advances in statistical analysis have been con- to take into account as they move from theory to
siderable during the past 30 years. Our statistical analysis. I focus first on the nature of theory in
toolbox was at one time somewhat limited, and it the social sciences, with an emphasis on describ-
was not uncommon for analysts to adopt practices ing theories that invoke the language of causality.
that forced data to conform to the assumptions of Although there are many other approaches to the-
our statistical tools so that we could apply inferen- ory and modeling, one cannot deny the prominence
tial methods of analysis to those data. For example, and pervasiveness of causal thinking in social sci-
although we knew that a construct like depression ence research. Hence, causal frameworks capture
was non-normally distributed in the population and the bulk of my attention. Next, I discuss issues
that our measures of depression and data reflected involved in moving from a well-specified causal
82
theory to a mathematical representation of that the- Lewis, 2000, for alternative conceptualizations).
ory for purposes of statistical analysis. I then discuss Four properties of causality are typically emphasized.
a range of issues that must be considered as one First, a cause always must precede an effect in time.
moves from a set of equations representing a the- Second, the time that it takes for a change in X to
ory to formal data analysis, including measurement produce a change in Y can vary, ranging from almost
considerations, full versus limited information esti- instantaneous change to years, decades, centuries, or
mation strategies, distributional assumptions, and millennia. Third, the nature and/or strength of the
the possible presence of outliers. Finally, I consider effect of X on Y can vary depending on context. X
issues in model selection, which refers to the process may influence Y in one context but not another con-
of choosing one model as being the “best” from a set text. Finally, cause and effect must be in some form
of candidate models. of spatial contact or must be connected by a chain
In the present chapter, I use the terms theory of intermediate events. We return to these ideas in
and model interchangeably, although some scien- later sections of this chapter.
tists do not do so. As examples, various authorities
contend that models are a special type of theory The Nature of Theories that Use
(e.g., Coombs, Dawes, & Tversky, 1970, p. 4; Causality
Kaplan, 1964, p. 263), models are portions of the- The focus of most causal theories is on explaining
ories (Sheth, 1967, p. 444; Torgerson, 1958, p. 4), why variation in one or more variables exists. Some
models are derived from theories (e.g., Pap, 1962, people make a great deal of money and others are
p. 355), models are simplified versions of theories poor. Why? What can account for this variation?
(e.g., Carnap, 1971, p. 54), models represent corre- Some people are able to remember complex mate-
spondence between two or more theories (Brodbeck, rial easily whereas for other people, it is difficult
1968), or theories represent specific interpretations to do so. Why? What explains this variability? We
of (i.e., are derived from) models (e.g., Green & answer these questions by specifying the presumed
Tull, 1975, p. 42). Others consider the terms to be causes of the variability, and then we seek to test our
synonymous (cf, Dubin, 1976; Simon & Newell, theoretical proposition(s).
1956). Although there may indeed be meaningful In any given causal theory, there are six funda-
distinctions between theories and models, it also is mental types of relationships that can be referenced;
the case that models, like theories, involve variables these are illustrated in Figure 5.1. These six relation-
and relationships between variables, usually invok- ship types are the cornerstone of causal theories and
ing the concept of causality. Accordingly, I will use define the universe of causal relationships that theo-
the terms theory and model interchangeably. rists draw upon. In Figure 5.1, a variable is indicated
by a box, and a causal influence is represented by a
Specifying a Causal Theory straight arrow emanating from the cause and point-
The Nature of Causality ing to the effect. I discuss the bidirectional curved
Theories take many forms in the social sciences. arrow in Figure 5.1 shortly.
One common form involves specifying presumed Referring to Figure 5.1, a direct causal relation-
relationships between variables while invoking the ship is one in which a given cause is assumed to
concept of causality. The nature of causality has have a direct causal impact on some outcome vari-
been debated extensively by philosophers of sci- able. For example, exposure to violence in the media
ence (e.g., Bunge, 1961; Cartwright, 2007; Frank, is assumed to influence aggressive behavior. Or,
1961; Morgan & Winship, 2007; Pearl, 2000; Rus- the quality of the relationship between a mother
sell, 1931; Rubin, 1974, 1978; Shadish, Cook, & and her adolescent child is assumed to influence
Campbell, 2002) and most agree that causality is whether the child uses drugs. By contrast, an indi-
an elusive concept that is fraught with ambiguities. rect causal relationship is when a variable influences
It is beyond the scope of this chapter to delve into another variable indirectly through its impact on an
the underlying philosophical issues (see Jaccard & intermediary variable (see Fig. 5.1). For example,
Jacoby, 2010). Rather, I emphasize here a “working failing to accomplish a goal may lead to frustra-
model” of causality that is adopted by most social tion that, in turn, causes someone to aggress against
scientists. another. In this case, the failure to obtain a goal
The concept of causality is usually thought of in is an indirect cause of aggression. It influences
terms of change—that is, X is a cause of Y if changes aggression through its impact on frustration. Frus-
in X produce changes in Y (but see Sowa, 2000, and tration is formally called a mediating variable or,
jaccard 83
Direct Causal Relationship Indirect Causal Relationship
X Y X Z Y
Z X Y
X Y
X Y
X Y
more informally, a mediator, because other vari- give to peers who use drugs. We therefore decide to
ables “work through” it to influence the outcome. posit partial mediation to reflect this fact and add a
Indirect relationships are sometimes called mediated direct causal path from X to Y.
relationships. If we are able to specify another mechanism by
Whenever a theorist posits an indirect relation- which Z influences Y over and beyond Z, then why
ship, the issue of whether to specify partial or not just incorporate that additional mediator into
complete mediation presents itself. In partial medi- the theory? Of course, we could very well do this,
ation, the distal variable has a direct effect on but then the issue becomes whether the two media-
the outcome variable over and above its effect on tors in the theory, considered together, are complete
the mediator. In complete mediation, all of the or partial mediators of the causal effect of X on Y.
impact of the distal variable on the outcome vari- This might lead us to speculate about yet a third
able is accounted for by the mediator. With partial mechanism, and once we have specified it, the issue
mediation, in addition to the two causal paths char- of partial or complete mediation will present itself
acterizing the indirect effect illustrated in Figure 5.1, yet again. At some point, we must decide to close
one adds an additional causal path linking the distal out the system and just let a direct path between X
variable (X) and the outcome variable (Y) directly. As and Y stand so as to reflect partial mediation with-
an example, we might posit that the quality of the out formally bringing additional mediators into the
relationship between a mother and her adolescent model. If pressed, we could articulate one, but we
child (X) impacts the child’s motivation to perform simply do not want to complicate the theory further.
well in school (Z) and that one’s motivation to do A spurious relationship is one in which two vari-
well in school, in turn, impacts (negatively) the ten- ables are related because they share a common cause
dency for an adolescent to use drugs (Y). As we think but not because either causes the other (see Fig.
about the matter of specifying partial versus com- 5.1). As an example, if we select a random sam-
plete mediation, we might decide that there are other ple of people in the United States and calculate
mechanisms by which the quality of the relationship the association between height and length of hair,
between parent and adolescent can impact drug use, then we would find a moderate relationship between
such as by lessening the attention that adolescents the two variables: People with shorter hair grow
jaccard 85
Belief
Unprotected Sex
c
leads to STDs
a
Intervention Engages in
e
versus Unprotected Has an STD
Control Sex
b
Belief STDs d
are Serious
that contracting an STD is bad for one’s health the direct causes can then be turned into an indirect
(see path b in Fig. 5.2). These two beliefs, in turn, effect by elaborating the mechanisms by which the
are assumed to influence the tendency for a person cause produces the effect. The “mechanisms” are,
to engage in unprotected sex (paths c and d ). The essentially, mediators. The theory might be further
tendency to engage in unprotected sex, in turn, is elaborated by specifying the boundary conditions
thought to impact the extent to which the person of effects, thereby introducing moderated relation-
contracts STDs (path e). Paths a through e each ships into the framework (e.g., the effect holds when
represent direct causal relationships. There also is condition A is operative but not when conditions
aspurious relationship in this model, as seen by the B is operative; or the effect holds for one type
fact that the two beliefs share a common cause, of individual but not another type of individual).
the intervention (paths a and b). Because of this For a discussion of the many heuristics scientists
common cause, we expect the two beliefs to be use to identify mechanisms and boundary condi-
correlated to some extent, although there is no pre- tions when developing causal theories see Jaccard
sumed causal connection between them. There also and Jacoby (2010).
are several indirect causal relationships in the model. Theories can be represented in the form of path
For example, the intervention indirectly affects the diagrams using the graphical schemes illustrated in
occurrence of STDs through the two belief media- Figures 5.1 and 5.2. Such representations are pos-
tors and, in turn, their influence on the tendency to sible no matter what research design is used to test
engage in unprotected sex. a theory—that is, we can represent the underlying
In causal theories, distinctions are made between theory with path diagrams for experiments just as
exogenous and endogenous variables. Any variable readily as for purely observational designs. Figure 5.2
that has a straight arrow going to it in a path dia- is an example of a causal theory that incorporates an
gram is called an endogenous variable. Endogenous experimental design.
variables, essentially, are outcome variables—that is,
they are presumed to be influenced by another vari-
able in the theoretical system. Variables that do not Causal Theories with Explicit Temporal
have a straight arrow going to them are called exoge- Dynamics
nous variables. They do not have presumed causes An important structure of many (but not all)
that are elaborated upon in the theory. causal theories is a focus on longitudinal dynam-
In sum, the fundamental orientation to con- ics. Causal theories that focus on longitudinal
structing a causal theory is to explain variation in dynamics contain one or more of the six types
one or more outcomes. This is accomplished with of relationships described above, but there also is
reference to six types of relationships, direct effects, an explicit temporal dynamic that is of theoretical
indirect effects, reciprocal effects, moderated effects, interest. A nomenclature has emerged around such
spurious effects, and unanalyzed relationships. In theories, which I briefly describe here.
the theory construction process, it is not uncommon One popular type of causal model that includes
for the theorist to first identify the outcome variable longitudinal dynamics is called a panel model in
he or she wishes to explain and then to specify a which multiple variables are modeled at multi-
few direct causes of that variable. One or more of ple points in times, also called waves. Figure 5.3
d d d
c c c
presents an example of a two-variable, four-wave “growth” process that causes changing values of
panel model. This model examines the relationship an outcome over time. These models are typically
between adolescent alcohol use and delinquency associated with latent growth curve models in the
across the 4 years of high school. Heavy alcohol use social science literature or, more simply, latent curve
at time t is thought to have a direct effect on alcohol models (LCMs). The classic form of these models is
use at time t + 1, and the same type of dynamic is presented in Figure 5.4, as applied to the alcohol-
thought to hold for delinquency. A path that spec- delinquency example considered earlier. Observed
ifies a variable at time t + 1 as being a function variables or measures are represented by boxes, and
of that same variable at time t is said to reflect an latent (unobserved) variables are represented by cir-
autoregressive effect (see paths a and b in Fig. 5.3). cles. This model parameterizes a latent “growth”
In the model in Figure 5.3, the autoregressive effects or “maturation” process for alcohol use as students
are first-order effects, because the variable at a given progress through high school (represented by the
time, t, is assumed to be a function of that same variable called “Alcohol Slope” in Fig. 5.4) as well
variable at the time just prior to it. A second-order as a latent growth or maturation process for delin-
autoregressive effect is one where the variable at time quency as students progress through high school
t is impacted (also) by the same variable at time t −2. (see the variable “Delinquency Slope” in Fig. 5.4).
A third-order autoregressive effect implies a direct The “growth process” for alcohol use is assumed to
effect between a variable at time t and that same impact the “growth process” for delinquency across
variable at time t − 3. And so on. the four time periods. One often will encounter
The theory in Figure 5.3 also specifies that drink- longitudinal causal theories expressed in this form
ing alcohol at time t predisposes one to delinquency instead of in the more traditional panel model form
at that same time period under the rationale that (Collins, 2006).
people often commit acts of delinquency when they Of course, it is possible that both types of dynam-
are drunk. When a causal relationship exists between ics in the two types of models operate. When the two
two variables at the same period of time, it is said to types of processes are integrated into a single model,
represent a contemporaneous causal effect (see path c we obtain what is called an autoregressive latent tra-
in Fig. 5.3). Finally, when a variable at time t has jectory model (Bollen & Curran, 2006). Figure 5.5
an effect on a different variable at time t + 1, it is presents this model.
referred to as a lagged effect (see path d in Fig. 5.3). In sum, when constructing theories that incorpo-
In this case, alcohol use at time t is thought to have rate longitudinal dynamics, one will explicitly take
a delayed effect on delinquency at time t + 1 inde- into account the possible causal dynamics described
pendent of the other indirect causal chains that link by panel models, by LCMs, or by autoregressive
alcohol use at time t to delinquency at time t + 1. latent trajectory models.
In sum, it is common in longitudinal models to
theorize about autoregressive effects, contempora-
neous effects, and lagged effects. These effects are Multilevel Causal Theories
common features of panel models (Collins, 2006; Another common type of causal model that has
Finkel, 2008). received considerable attention in the social sciences
A second type of theory that formally incorpo- is one that incorporates multiple levels of analysis, or
rates temporal dynamics is theory based on growth what is called a multilevel model. These models deal
processes. This approach views variation in out- with scenarios where there is nesting—for example,
comes across time as arising from an unobserved where individuals are nested within different, higher
jaccard 87
Baseline
Alcohol
Alcohol
Slope
Use
Baseline Delinquency
Delinquency Slope
Baseline Alcohol
Alcohol Slope
Baseline Delinquency
Delinquency Slope
jaccard 89
and drug use in adolescence are correlated is because the theoretical system and work just with the vari-
they share the common cause of gender. In real- ables we have specified. By including disturbance
ity, there are many other common causes of these terms and correlated disturbances, we are explic-
two constructs. For example, social class impacts itly recognizing the operation of other variables, but
both tobacco use and drug use during adolescence, we choose not to give them central focus in our
with more economically disadvantaged youth hav- theory.
ing an increased tendency to smoke cigarettes and
to use drugs. Essentially, social class resides within
the disturbance term for smoking cigarettes and it Latent Variables, Structural Models, and
also resides within the disturbance term for drug Measurement Models
use. If the same unspecified cause is in each distur- Some researchers take matters a step further and
bance term, then we would expect the two distur- also incorporate a measurement theory into their
bance terms to be correlated. Figure 5.6c presents conceptual frameworks when they are performing
a more plausible theory that includes this correla- an empirical test of the theory. This goes beyond
tion between disturbances. According to this theory, the typical province of theory construction per se,
there are two reasons why adolescent cigarette smok- but I mention the ideas here as they ultimately
ing and adolescent drug use are correlated. One impact data analysis and the choice of statisti-
reason is because they share the common cause of cal methods. The integration of conceptual and
gender. Another reason is that they share other com- measurement theories is something that should be
mon causes that are unspecified by the theory and standard practice for social science research.
that reside in both disturbance terms. An empirical test of a theory necessarily requires
A well-developed theory provides explicit state- developing and using measures of the theoretical
ments about which disturbance terms in the frame- constructs in the theory. Just as one can build a the-
work are correlated and which disturbance terms ory linking one concept to another concept, so too
are not. The lazy way out for a theorist is to simply can one build a theory linking a construct to a mea-
assume all disturbance terms are correlated. But this sure of that construct. Some theorists combine both
is not satisfactory, and it can create considerable dif- types of theories into a single overarching frame-
ficulties for testing the theory empirically. A better work. Traditional measurement models make a
approach is to carefully consider every pair of dis- distinction between a latent variable and an observed
turbance terms and to articulate a possible common measure of that variable. The latent variable is the
cause that resides in each of the separate disturbance true construct about which one is interested in
terms. If such a common cause can be articulated, making statements, such as depression. Although
then it makes sense to posit correlated disturbances. we can directly observe many of the symptoms of
If one cannot articulate any such variable, or if its depression, we can’t directly observe the “seat” of
effects are thought to be trivial, then one does not depression in a person’s mind. Rather, we rely on
posit correlated disturbances. some observable response(s) to assess the latent vari-
For models with a longitudinal component, able, such as a multi-item inventory of depression
many theorists have a “knee-jerk” reaction that that a person completes. Figure 5.7a presents one
disturbances directed at the same variable at two representation of a measurement model. The latent
different points in time must be correlated. Again, variable of depression is contained in a circle, and
if one can articulate a compelling rationale for cor- the observed measure thought to reflect depression
related disturbances, then by all means, correlated is contained in a square (the label “AR” stands for
disturbances should be incorporated into the the- adolescent report of depression). A causal path is
ory. Otherwise, correlated disturbances should be drawn from the latent variable to the observed mea-
viewed with theoretical skepticism. sure, under the assumption that how depressed a
If a theorist is able to articulate a variable that person is influences how he or she responds to the
resides in two disturbance terms to create correlated questions on the inventory. There also is an error
disturbances, then why not explicitly incorporate term that reflects measurement error—that is, there
the variable into the theoretical system? For exam- are factors other than depression that influence a
ple, for the smoking cigarette and drug use example person’s responses on the inventory. Ideally, mea-
in Figure 5.6, why not explicitly bring social class surement error is minimal, but it is a fact of life for
into the theoretical system? This, of course, is desir- many research endeavors. The relationship between
able. But at some point, we want to close out the latent construct and the observed indicator is
jaccard 91
ε1 ε2 ε3 ε4 ε5 ε6
S1 S2 S3 A1 A2 A3
ε13 ε14
W1 W2
Support Alcohol
δ1
Birth-
weight
δ2
δ3
Control Smoking
C1 C2 C3 K1 K2 K3
In sum, a well-developed theory will not only the theoretical system are continuous. Second, I will
include causal relationships between constructs but assume that the measures of all variables have at least
also will include disturbance terms and a theory interval level properties. Finally, I will assume that
of the correlational structure of those disturbance all relationships between variables are linear. I adopt
terms. As one moves to an empirical test of a theory, a strategy whereby the theory under consideration
one also specifies a measurement model that links is expressed in the form of a set of linear equations.
theoretical constructs to measures obtained in the This does not mean, however, that researchers always
empirical test. parameterize data analysis in terms of the linear
model. Many well-known methods of analysis, such
as t tests, analysis of variance, and analysis of covari-
From Theories to Mathematical ance, evolved outside of the context of the linear
Representations model. However, these methods can be re-expressed
Specifying Core Model Equations as linear equations, and in this sense, they are com-
With a well-articulated theory in hand, the next patible with the current approach. Complex models
step for choosing a form of statistical analysis is to of dichotomous, nominal, and count variables also
translate the theory into a set of equations that then can be approached from the perspective of linear
guide the statistical analysis. The basic strategy for equations using what is known as the generalized
doing so can be illustrated by making some simpli- linear model (see Yuan & Schuster, Chapter 18,
fying assumptions, which I relax later. First, unless Volume 1). Finally, newer methods of analysis
otherwise specified, I will assume that all variables in that focus on robust indices of central tendency
Y3
β4 β2
β1
X Y2 Y1
β5 β3
Y4 ε2 ε1
ε4
and variability can be viewed in the context of lin- The rule for expressing a path diagram in terms
ear equations but where the parameterization shifts of a set of core equations also applies to models with
from means and variances to trimmed means, M latent and observed variables, such as the model in
estimators, and the like (see Erceg-Hurn, Wilcox, & Figure 5.10. For this model, the structural model
Keselman, Chapter 19, Volume 1). has the following core equations:
Consider the case where the initial theory is LY = α1 + β1 LM + δ1
represented as a path or influence diagram, such
as the theory in Figure 5.9. A path diagram can and
be viewed as a pictorial representation of a set of LM = α2 + β2 LX + β3 LZ + δ2 ,
equations. There is a separate equation for each
with each equation focusing on a latent endoge-
endogenous variable in the theory. More specifi-
nous variable as an outcome. We are interested in
cally, given the aforementioned assumptions, each
estimating the parameters in these equations, but
endogenous variable can be expressed as being a
the task is more challenging statistically because we
linear function of all variables with arrows going
do not have direct access to a person’s scores on
directly to and explicitly touching the box repre-
the latent variables. Nevertheless, statisticians have
senting the endogenous variable. For the model in
derived methods for obtaining such estimates (see
Figure 5.9, there are four linear equations that are
McDonald, Chapter 7, Volume 1).
of primary theoretical interest because there are four
The measurement model for the theory in
endogenous variables.
Figure 5.10 implies the following equations (again,
Using the above rule, the relevant equations are:
using the rule of specifying an equation for each
Y1 = α1 + β1 Y2 + ε1, endogenous variable, in this case, the observed
endogenous variables):
Y2 = α2 + β2 Y3 + β3 Y4 + ε2,
X1 = α3 + β4 LX + ε1,
Y3 = α3 + β4 X + ε3, and
X2 = α4 + β5 LX + ε2,
Y4 = α4 + β5 X + ε4,
Z1 = α5 + β6 LZ + ε3,
where the various α are intercepts, the various β are
linear coefficients, and the various ε are disturbance Z2 = α6 + β7 LZ + ε4,
terms. Primary interest of the research is estimat- M1 = α7 + β8 LM + ε5,
ing and interpreting the parameters α1 through α4 ,
M2 = α8 + β9 LM + ε6,
β1 through β5 , and the variances of ε1 through
ε4 relative to the variances of Y1 through Y5 . We Y1 = α9 + β10 LY + ε7,
select statistical methods of analysis that provide and
the best and most well-behaved estimates of these
parameters. Y2 = α10 + β11 LY + ε8.
jaccard 93
ε1 ε2
ε5 ε6 ε7 ε8
X1 X2
β4 β5
M1 M2 Y1 Y2
LX
β2 β8 β9 β10 β11
β1
LM LY
β3
LZ δ2 δ1
β6 β7
Z1 Z2
ε3 ε4
It is not uncommon to express the set of core respect to two latent η, and p = 4 observed endoge-
equations in a theory in more succinct form than nous variables with respect to the two latent ξ . We
the above using matrix algebra. Readers unfamiliar further specify an mXm matrix (B) representing the
with matrix algebra can skip to the summary para- linear (regression) coefficients regressing the η onto
graph at the end of this section. As an example, the the η and a mXq matrix () representing the linear
above structural model is expressed by making dis- (regression) coefficients regressing the η onto the ξ :
tinctions between the latent endogenous variables
β11 β12 γ11 γ12
and the latent exogenous variables, defining a vector B= = .
of latent endogenous variables (η) and a vector of β21 β22 γ21 γ22
latent exogenous variables (ξ ) as In the example in Figure 5.10, let LY = η1 ,
LM = η2 , LX = ξ1 , and LZ = ξ2 . Then
η1 ξ1
η= ξ= ,
η2 ξ2 0 β12 0 0
B= =
0 0 γ21 γ22
and a vector of intercepts (α) and a vector of
disturbance terms (δ) for the latent η as and the structural model is defined as
δ1 α1 η = α + Bη + ξ + δ.
δ= α= .
δ2 α2
The measurement model for the latent endoge-
In the present example, there are m = 2 latent nous and exogenous variables defines separate matri-
endogenous variables, q = 2 latent exogenous vari- ces of factor loadings for the linear (regression)
ables, r = 4 observed endogenous measures with coefficients from the latent endogenous variables to
jaccard 95
Jaccard (2006) and Yuan and Schuster (Chapter properties of the measures of the variables that
18, Volume 1). The above equation suggests that comprise the theory. One important metric consid-
another way of diagramming a moderated effect is eration is the level of measurement of the measures.
to include three separate direct effects, one from The classic distinction between nominal, ordinal,
each of the component parts of the product and a interval, and ratio level measurement is particularly
third from the product term. Interaction analysis important, because different methods of analysis
does not have to be pursued through product terms often are called for depending on the operative levels
in a linear equation (Wilcox, 2005), but this is the of measurement. Nominal measurement involves
most common approach. the assignment of numbers to levels of variable that
Another qualification to specifying core equations are categorical in nature and that have no inher-
is whether the relationship between variables is non- ent ordering on an underlying dimension. For the
linear rather than linear. For example, the relation- variable gender, a researcher might assign the val-
ship between an outcome variable and a causal vari- ues of 0 to females and 1 to males. The fact that
able might be characterized by a logarithmic func- one number is larger than the other has no sub-
tion, an exponential function, a power function, stantive interest. The numbers function much like
a polynomial function, or a trigonomic function, non-numeric labels.
among others. In such cases, the form of the core Ordinal, interval, and ratio level measures can
equation differs from those specified earlier. Non- be defined relative to the measurement equations
linear modeling can be complex and consideration between quantitative latent variables and observed
of such models is beyond the scope of this chapter. measures described earlier. In describing the prop-
Readers are referred to Jaccard and Jacoby (2010) erties of these levels of measurement, I will assume
and Yuan and Schuster (Chapter 18, Volume 1). no measurement error is operating, to simplify the
Although there are other nuances that can fur- presentation. Assuming no measurement error, an
ther shape the nature of the equations one derives interval level measure is one that is a linear function
from a causal theory, the general point I stress here is of the underlying latent variable—that is,
that (1) it is possible to represent most causal mod-
els in the form of one or more equations and (2) X = α + βLX,
that translating a model into equation form is a fun- where X is the observed measure and LX is the latent
damental step for choosing an appropriate analytic variable. The intercept and the linear coefficient
method for testing the viability of a causal model. can take on any value. For ratio level measures, the
underlying latent variable, LX, should have a mean-
Additional Considerations for Choosing ingful conceptual zero-point (such as zero weight
Analytic Strategy when measuring how heavy an object is) and the
With a carefully specified set of equations that are intercept in the measurement equation must equal
theoretically guided, the next task of the researcher zero (as would be the case for a weight measure
is to choose a data analytic method to estimate in pounds or a measure in grams). For ordinal
the parameters in those equations. The present measures, the relationship between X and LX is non-
section considers the role of facets that must be linear but monotonic. For example, X might be a
taken into account when making such decisions, power function of LX, in which case the measure
including measurement/metric considerations, the has ordinal properties.
use of full versus limited information estimation, As discussed in the other chapters in this volume,
distributional assumptions of estimation strategies, analytic strategies for parameter estimation vary as a
and the anticipated presence of outliers. Of course, function of these measurement levels. For example,
there are a host of more nuanced considerations that when a measure of an exogenous variable in a causal
researchers must account for, and these are devel- model is nominal, it is common to represent that
oped in the different chapters in this volume. My variable in a core equation using dummy variables
focus is on categories of more general considerations. (Chapter 18). If one wants to compare two groups
on their average heights, then a measure of height
that has ordinal properties is potentially problem-
Measurement and Metric Considerations atic and not well-suited to comparing means on the
Strategies for estimating parameters in core underlying latent construct of height.
equations, or the act of defining the parameters An important but underappreciated point when
themselves, are influenced by the psychometric taking measurement levels into account is the fact
jaccard 97
smoked marijuana) and whether measures are cen- hand. Three assumptions are typical: (1) normal-
sored. Censoring occurs when a value occurs outside ity of scores in the population, (2) homogeneity
the range of a measuring instrument. For example, of variances in the population, and (3) indepen-
a bathroom scale might only measure up to 300 dence of replicates. The ways these assumptions
pounds. If a 350-pound individual is weighed using are instantiated vary by analytic method, and other
the scale, then we would only know that the indi- assumptions also can come into play (Maxwell &
vidual’s weight is at least 300 pounds. For details of Delaney, 2004). I focus on the above assumptions
how these metric qualities affect analytic strategy, see primarily to make general points about approaches
Long (1997) and Yuan and Schuster (Chapter 18, to distributional assumptions. I will make reference
Volume 1). to the robustness of a test to violations of underlying
assumptions. A statistical test is said to be robust to
violations of assumptions if (1) the nominal Type I
Full Information versus Limited error rate (alpha level) set by the investigator a priori
Information Estimation (usually 0.05) is maintained in the face of assump-
Given a set of equations derived from a causal tion violations and (2) the statistical power of the
theory, one approach to estimating parameters in test is relatively unaffected by assumption violations.
those equations is to use a full information estima- For a more technical and nuanced discussion of
tion approach. In full information estimation, the robustness, see Wilcox (2005).
coefficients in all equations in the model are esti- A common perception of many researchers is
mated simultaneously. This can be contrasted with that traditional F and t tests in analysis of vari-
a limited information estimation approach where ance (ANOVA) and regression are quite robust to
the coefficients in the equations are estimated one violations of normality and homogeneity of vari-
equation at a time or, alternatively, focusing on only ance assumptions. Because of this, these methods
a portion of the larger model rather than the entire are often applied to data where the population
model. Full information estimation is common in assumptions are tenuous. Studies have shown that
structural equation modeling (SEM), but there also ANOVA and regression are not necessarily robust
are limited information variants of SEM (see Bollen, to assumption violations of normality and variance
2001). Full information estimation has the advan- heterogeneity (see Keselman et al., 1998; Maxwell
tage that parameter estimates typically (but not & Delaney, 2004; Wilcox, 2005). One strategy for
always) are more efficient (in a strict statistical sense dealing with assumption violations is to perform
of the term) than limited information estimators as a preliminary test of the viability of the assump-
long as the model being tested is correctly specified. tion in question and, if the test suggests a problem,
Limited information estimation has the advantage perform a metric transformation or use a robust
that it often is less susceptible to adverse effects of analytic alternative. This two-step strategy is contro-
specification error, because the effects of specifica- versial for several reasons. First, many preliminary
tion error are limited to the particular portion of tests lack power without large sample sizes and
the model where the error occurs. It also allows one yield nonsignificant results for testing an assumption
to use the strongest methods of analysis available violation, even when the violation is problematic
dependent on the properties of measures and vari- (Wilcox, Charlin, & Thompson, 1986; Wilcox,
ables in different portions of the model rather than 2003). Second, the crucial issue is not whether the
applying the same homogenous algorithm through- null hypothesis of normality or variance homogene-
out the model. A decision point for analysts when ity can be rejected but, rather, estimating the degree
choosing a statistical method to estimate parameters to which the assumption is violated and making
defined by a theoretical model is whether to adopt a a decision as to whether the degree of violation is
full information or limited information estimation consequential. This requires documenting the mag-
strategy. nitude of the assumption violation in the sample
data and then taking sampling error into account
when making decisions. For example, we might find
Statistical Assumptions that a variance ratio comparing the variances of two
Another issue that analysts must consider when groups is 4.0, with a margin of error of plus or minus
moving from equations to estimation is the dis- 3.0. The margin of error suggests that the variance
tributional assumptions of the estimation method could be as large as 7.0, which could be problem-
and whether they are appropriate for the task at atic. Unfortunately, it is rare for researchers to take
jaccard 99
results of an analysis when the case is deleted. Such theory must be in agreement with known data and
approaches, however, can be nondiagnostic when facts. Third, the theory must be testable—that is,
multiple outliers are present. For example, if there a theory must ultimately be subject to empirical
are two individuals in an analysis who distort a mean evaluation.
upward, deleting only one of them may not reveal In addition to these criteria, Shaw and Costonzo
an “outlier” effect as long as the second outlier is still (1982) have specified six additional features of a the-
in the data. Only when both outliers are removed ory that are desirable but not necessarily critical. First,
is their distorting character revealed. Outlier iden- a theory should be stated in terms that can be under-
tification is a complex enterprise, with some of the stood and communicated to most other scientists.
most sophisticated work being pursued in the liter- Second, the theory should be parsimonious in that it
ature on robust statistics (see Wilcox, 2003, 2005, adequately explains a phenomenon but with a min-
for elaboration). imum of concepts and principles. All other things
Wilcox (1998, 2006) objects to applying tradi- being equal, preference is given to theories that make
tional inferential methods to data that have elim- fewerassumptions.Third, althoughwerecognizethat
inated outliers based on simple outlier detection theories are occasionally so novel that they upset the
methods. He argues that doing so invalidates the sta- theoretical apple cart, a theory should be consistent
tistical theory on which the inferential tests are based with other accepted theories that have achieved con-
because of dependencies that outlier elimination cre- sensus among the scientific community—that is, it
ates. Others recommend conducting analyses with should be able to be integrated into existing bod-
and without outliers to determine if conclusions ies of theory. A fourth desideratum is scope. Other
change. If conclusions do change, then one moves things being equal, the greater the range of the the-
forward with any conclusions on a tentative basis. ory, the better it is thought to be. That said, there are
Probably the most effective strategy for dealing with times when narrow range theories tend to hold up
outliers is to focus on parameter estimation methods better over time than broad range theories. Creativ-
that are resistant to outliers. For an introduction to ity or novelty is a fifth criterion sometimes suggested
these methods, see Wilcox (2005). for evaluating a theory. A theory that explains the
obvious is generally not as highly valued by the scien-
tific community as one that provides a novel insight
Model Selection into an interesting phenomenon. Finally, many sci-
Model selection refers to the process of choosing entists suggest that a good theory is one that generates
what one believes is the “best” model for describ- research activity.
ing a phenomenon from a set of candidate models. Brinberg and McGrath (1985) have noted that
The set of plausible models might consist of many the various theory desiderata sometimes conflict
models or it might consist of just one or two with each other. For example, parsimonious theories
models. Criteria for model selection can be consid- tend to be more limited in scope. As such, the-
ered in general terms using a mindset of specifying orists often make trade-offs as they construct the-
criteria to evaluate the overall quality of a theory ories to maximize what is valued by the scientific
in general; or they can be discussed in specific, community.
quantitative terms when choosing between compet-
ing models within an experiment. I consider both
perspectives. Choosing Between Models in a Given Study
As one pursues the theory construction process,
one may develop competing theories that either
General Criteria for Evaluating Theories make opposite predictions or that account for the
Consensual validation is one basis by which the- same phenomena but using different explanations
ories are evaluated. This refers to the degree of con- and assumptions. Faced with such scenarios, we
sensus among the scientific community about the design research to help us choose between the com-
validity of the theory. If a theory enjoys widespread peting theories. Many scientists consider research
acceptance, then it is seen as being a “good” theory. that chooses between two or more logical and plau-
Shaw and Costonzo (1982) have argued that three sible theories to be inherently more interesting than
criteria are crucial for the acceptance of a theory. studies that yield results regarding a single theory
First, the theory must be logically consistent—that (Platt, 1964).
is, the theoretical statements within the concep- In some cases, competing models making quali-
tual system must not be contradictory. Second, the tatively opposite predictions about how data should
jaccard 101
but tedious, algebra that the one factor model pre- is true in the population, then the sample correla-
dicts that the correlations should pattern themselves tion matrix should at least reasonably approximate
in a certain way—that is, they should follow cer- the regularities predicted by the one-factor model.
tain mathematical regularities. For example, if we A challenge for scientists when comparing models
choose any two columns and ignore entries where vis-à-vis indices of fit (or, for that matter, evaluating
a 1.00 occurs, then the ratios of the respective row a single model using an index of fit) is to take into
entries for the two columns should be equal in value account such sampling error.
to the comparable ratio for any row in the matrix. A final strategy that one encounters when scien-
For example, tists evaluate competing models is the case where
scientists prefer the model that explains the most
For columns 1 and 2:
variation in an outcome variable. In these cases,
.63/.56 = .54/.48 = .45/.40 = 1.125 one typically cannot differentiate models in terms
of degree of fit to the data, as in the previous
and
examples we considered. Nevertheless, one can dif-
For columns 2 and 5 : ferentiate between them in terms of whether one
model accounts for significantly more variation in
.72/.45 = .56/.35 = .48/.30 = 1.600.
an outcome than the other model. For example,
By contrast, models that posit more complex pat- when explaining adolescent drug use, one might
terns of underlying determinants (e.g., a two-factor compare a model that assumes drug use is a sim-
model, a three-factor model) predict a different set ple additive function of gender and grade in school
of regularities, which we do not elaborate here in versus a model that assumes drug use is an additive
the interest of space (see McDonald, 1985). The function of gender, grade, plus the interaction effect
model selection task is to choose which of the com- between gender and grade. If the models account
peting models (e.g., a one-factor model, a two-factor for about the same amount of variation in drug
model, etc.) is most consistent with the data. use, then preference for the first model will take
Statisticians have developed quantitative indices precedence on grounds of parsimony and because
of fit that represent the overall degree of correspon- taking into account the interaction effect does not
dence between the predicted pattern of data by a seem to matter. By contrast, if the three-parameter
model and the observed data pattern. When there model explains substantially more variation in drug
are competing models, a fit index is derived for use than the two-parameter model, then the three-
each model and then the fit indices are compared to parameter model is preferred.
identify the model that best accounts for the data. As scientists consider the complex issues sur-
The technical details of such comparisons can be rounding model selection, a paramount concern
quite complex, and these are elaborated through- is not to mistakenly embrace a model that is mis-
out the many chapters in this volume. The point specified. A misspecified model is a model that is
I am emphasizing here is that even if two models wrong because it (1) left out an important variable
do not make qualitatively opposite predictions, as from the model whose omission biases parame-
was the case for the example on summation versus ter estimates and leads to faulty conclusions (also
averaging, models often can be compared in terms called left-out-variable-error or LOVE problems);
of their relative consistency with the data, with one (2) assumed one type of function between variables
model ultimately being declared as being more con- (e.g., linear) when, in fact, a different type of func-
sistent with the data than another model based on tion was operative (e.g., a threshold function), with
the comparison of quantitatively defined fit indices. such misspecification leading to faulty conclusions;
There is, however, an additional complication in and/or (3) incorrectly modeled the temporal dynam-
model selection. In practice, it would be surprising ics within a model in such a way that parameter
if sample data yielded a set of correlations that per- estimates and conclusions are non-trivially biased.
fectly followed the predicted regularities of, say, a Throughout this volume, chapters elaborate on
one-factor model, even if the one-factor model was issues of model misspecification and ways to gain
operative in the population. This is because sam- perspectives on it. Cudeck (1989) has argued that
ple correlations are subject to sampling error and the crucial issue is not whether a model is misspeci-
will randomly deviate to a greater or lesser extent fied because misspecification is inevitable in so much
from the true population correlations. Even if this of the research we conduct. Rather, the more cen-
is the case, one expects that if the one factor model tral issue is whether the degree of misspecification is
jaccard 103
Keyes, T. & Levy, M.S. (1997). Analysis of Levene’s test under Shadish, W. R., Cook, T. & Campbell, D. (2002). Experimental
design imbalance. Journal of Educational and Behavioral and quasi-experimental designs for generalized causal inference.
Statistics, 22, 227–236. Boston: Houghton Mifflin.
Lewis, D. (2000) Causation as influence. Journal of Philosophy, Shapiro, S. S. & Wilk, M. B. (1965). An analysis of variance test
97, 182–197. for normality (complete samples). Biometrika, 52, 591–611.
Long, S. (1997).Regression models for categorical and limited Shaw, M. & Costonzo, P. (1982). Theories of social psychology.
dependent variables. Thousand Oaks, CA: Sage. New York: McGraw-Hill.
Maxwell, S. E. & Delaney, H.D. (2004). Designing experiments Sheth, J. N. (1967). A review of buyer behavior. Management
and analyzing data: A model comparison perspective. Mahwah, Science, 13, B718–B758.
NJ: Erlbaum. Simon, H. A. & Newell, A. (1956) Models: Their uses and lim-
McDonald, R. (1985). Factor analysis and related methods. itations. In L.D. White (Ed.), The state of the social sciences.
Hillsdale, NJ: Erlbaum. (pp. 61–83). Chicago, IL: University of Chicago Press.
McDonald, R. P. (2012). Modern test theory. In T. D. Little Sowa, J. (2000). Processes and causality. From www.jfsowa.com/-
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1, ontology/causal.htm. Accessed August 15, 2011.
pp. 118–143). New York: Oxford University Press. Torgerson, W. S., (1958) Theory and methods of scaling. New York:
Milligan, G. (1987). The use of the arc-sine transformation Wiley.
in the analysis of variance. Educational and Psychological Wilcox, R. (2006). Graphical methods for assessing effect size:
Measurement, 47, 563–573. Some alternatives to Cohen’s d. Journal of Experimental
Morgan, S. & Winship, C. (2007). Counterfactuals and causal Education, 74, 353–367.
inference. New York: Cambridge University Press. Wilcox, R.R. (1996). Statistics for the social sciences. New York:
Pap, A. (1962). An introduction to the philosophy of science. Academic Press.
Glencoe, IL: The Free Press of Glencoe (MacMillan). Wilcox, R. R. (1998). How many discoveries have been lost by
Parra-Frutos, I. (2009). The behaviour of the modified Lev- ignoring modern statistical methods. American Psychologist,
ene’s test when data are not normally distributed. Journal 53, 300–314.
Computational Statistics, 24, 671–693. Wilcox, R.R. (2003). Applying contemporary statistical techniques.
Pearl, J. (1999). Causality: Models, reasoning, and inference. New San Diego, CA: Academic Press.
York: Cambridge University Press. Wilcox, R.R. (2005). Introduction to robust estimation and
Platt, J. (1964). Strong inference. Science, 146, 347–353. hypothesis testing (2nd ed.). San Diego, CA: Academic
Rubin, D. (1974). Estimating causal effects of treatments in ran- Press.
domized and nonrandomized studies. Journal of Educational Wilcox, R. R., Charlin, V. L., & Thompson, K. (1986). New
Psychology, 66, 688–701. Monte Carlo results on the robustness of the ANOVA F, W,
Rubin, D. (1978). Bayesian inference for causal effects: The role and F* statistics. Communications in Statistics. Simulation and
of randomization. Annals of Statistics, 6, 34–58. Computation, 15, 933–944.
Russell, B. (1931). The scientific outlook. London: George Allen Yuan, K-H., & Schuster, C. (2012). Overview of statistical esti-
& Unwin. mation methods. In T. D. Little (Ed.), The Oxford Handbook
Schumacker, R. E. & Lomax, R. G. (2004). A beginner’s guide to of Quantitative Methods, (Vol. 1, pp. 360–386). New York:
structural equation modeling. Mahwah, NJ: Erlbaum. Oxford University Press.
Abstract
This chapter strives to enliven quantitative psychology teaching and encourage statistical literacy.
Summaries of quantitative training, at the undergraduate and graduate levels, offer guidelines to
improve instruction and call for greater emphasis on measurement, research, and quantitative
methods. Strategies for effectively teaching quantitative psychology are suggested, including active,
hands-on learning for engaging students, e-learning and Web-based instruction, mentoring and role
models, and encouraging conceptual understanding. Quantitative students are encouraged to focus on
the nature of research questions, similarities and differences in statistical methods, and interpreting
findings with statistical tests, effect sizes, and confidence intervals. Future directions are offered
regarding model building, quantitative learning beyond the classroom through workshops, membership
in quantitative societies, and reading the quantitative literature. The reach of quantitative training
should be widened to more readily include those from disadvantaged, early career, and
under-represented groups to further strengthen the field and enlighten greater numbers about the
wonders of quantitative psychology.
Key Words: quantitative training, statistical literacy, engaging students, strategies for teaching, active
learning, mentoring, underlying concepts, widening quantitative reach
105
pre-course quantitative skill in a class with engaging measurement, and methodology. Only 15% of the
activities to get students involved, suggesting that graduate programs had a quantitative area, although
it is worthwhile to try to reduce quantitative anxi- about 50% offered a minor in quantitative studies.
ety by making quantitative learning more appealing Almost all of the graduate programs required stu-
(Harlow, Burkholder, & Morrow, 2002). dents to take at least one or two quantitative courses
Thus, it is probably not extraordinary that I find during their first year, although more than 25% sent
it my greatest joy to try to engage and enlighten students to other departments to get this training.
students with the remarkable prospective of quan- Required quantitative courses usually included basic
titative thinking and analysis. I go so far as to analysis of variance (ANOVA) and some regres-
argue that any student interested in psychology has sion, with half to two-thirds of programs including
the making of a quantitative scientist, if only at limited training in measurement or research. Com-
some latent, implicit level. What student has not puter training in SPSS and SAS is often part of the
posed a possible theory for why something is, or curriculum, with the possibility of EQS, AMOS,
is not, a certain way? Doesn’t everyone wake up Mplus, LISREL, and other software exposure in
each morning with one or more hypotheses about more advanced courses (e.g., structural equation
how the day will emerge and which variables or fac- modeling) in about half of the programs. Over a span
tors will help bring about the desired effect, whether of almost two decades between surveys, quantitative
small, moderate, or large? And, don’t we all think training had not improved much; over time there
about how much confidence we can place on our was slightly more emphasis on measurement and
hypotheses or expectations, depending on specific deterioration in coverage of research design. Much
mitigating or confounding circumstances? I believe improvement in advanced methods training is still
that we all engage in these activities, however for- needed in most departments to impart the expertise
mal or informal, in which quantitative psychology needed to effectively compete in quantitative studies
can play a pivotal role. Quantitative psychology and fill needed positions in the workforce.
has the potential to empower and enlighten with It is becoming readily apparent that efforts are
the training, skills, reasoning, and capabilities to needed to increase the number of quantitatively
formalize these kinds of questions, helping us to trained individuals. A recent report from an Amer-
describe and understand the essence of myriad data ican Psychological Association (APA) Task Force
that come our way. With more advanced quantita- to Increase the Quantitative Pipeline (Aiken et al.,
tive training, we could further develop and analyze 2007) reported that in recent history, whereas there
intricate theoretical models that help explain and were approximately two applicants for each non-
predict complex processes and behaviors, integrat- quantitative doctoral level job (e.g., cognitive, devel-
ing information necessary to inform and improve opmental, social), there were almost 2.5 jobs for
interventions, policies, and the human condition every quantitatively trained PhD student. Further,
(Harlow, 2010; Rodgers, 2010). the focus of quantitative methodology is moving
In this chapter, I highlight some major issues and away from emphasis on learning a set of specific
topics that go into teaching quantitative psychology. methods and procedures and instead placing greater
First, I provide an overview of quantitative training priority on developing a broader vision of quanti-
to get an overarching picture of the field. Second, tative science through theory building, modeling
I suggest a number of issues that should be taken underlying processes, and integrating information
into account when teaching quantitative courses and across meta-analytic studies (e.g., Harlow, 2010;
research that investigates ways to address these con- Jaccard & Jacoby, 2009; McGrath, 2011; Rodgers,
cerns. Third, I offer several themes that run through 2010; Rosenthal & DiMatteo, 2001). Consistent
many statistical methods to help tie together the with this vision is a focus on encouraging greater
numerous, seemingly isolated and obscure quanti- statistical reasoning, thinking and literacy, rather
tative procedures. Finally, I summarize the current than rote learning (Gal, 2003; Garfield & delMas,
field of quantitative teaching and give recommenda- 2010; Ridgway, Nicholson, & McCusker, 2007).
tions for other options to supplement, enrich, and Researchers are realizing that quantitative literacy
expand statistical learning. is a needed goal in undergraduate studies, begin-
ning with the most basic, introductory statistics
Overview of Quantitative Training courses (Ben-Zvi & Garfield, 2004; Mulhern &
Aiken et al. (1990, 2008) surveyed more than Wylie, 2004, 2006; Rumsey, 2002; Watson, 2006).
200 graduate training programs regarding statistics, Resources are becoming available to help encourage
106 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
quantitative thinking (Garfield & Ben-Zvi, 2008; & Garfield, 2006). Humor is also a great ice-
Saville, Zinn, Lawrence, Barron, & Andre, 2008; breaker, catching students’ attention (e.g., Cobb,
Stark & Krause, 2009), along with a realization of 1999, 2007; Zeedyk, 2006), helping to diffuse ten-
the challenges involved (Ben-Zvi & Garfield, 2004; sion, and surreptitiously calling students back to
Sen, 2004). quantitative learning.
Garfield and colleagues from the University of Creating a lively environment is essential for
Minnesota developed a program in Quantitative capturing the interest of students. A number of
Methods in Education that is at the forefront on researchers have emphasized the importance of
improving undergraduate statistical training. Much hands-on and interactive learning (Dolinsky, 2001;
of their research focuses on training statistics teach- Kolar & McBride, 2003; Wulff & Wulff, 2004).
ers (Garfield & Everson, 2009), reforming the way A simple strategy could be to invite students to
introductory statistics is taught (Garfield, Hogg, write down what is clear and not clear at the end
Schau, & Whittinghill, 2002), and investigating of each lecture, with the faculty clarifying unclear
how to teach and learn statistics (Garfield & Ben- points at the beginning of the next lecture (Harlow,
Zvi, 2007). Other researchers (Friedrich, Buday, Burkholder, & Morrow, 2006). It takes little time,
& Kerr, 2000) also help outline the field by sur- and students get fairly immediate feedback on how
veying undergraduate programs across the country to clear up misunderstandings, whereas faculty get
with respect to quantitative training. In the next a clearer idea of what the students understood
section, I review research focusing on strategies and what needs to be clarified. Moreover, involv-
for teaching quantitative psychology (e.g., Gelman ing students with creative examples (Chew, 2007;
& Nolan, 2002), including a set of guidelines Schwartz & Martin, 2004) and analyzing data (e.g.,
developed by Garfield and colleagues (Franklin & Nie & Lau, 2010; Watkins, Scheaffer, & Cobb,
Garfield, 2006). 2004) bring about more in-depth learning than tra-
ditional lecture-based approaches. Demonstrating
visual images and graphs of procedures also helps
Strategies for Teaching Quantitative
improve understanding (e.g., Peden, 2001).
Psychology
If there is more time, then students can be
There is an abundance of resources to aid instruc-
arranged into small groups and be given a research
tors of statistics courses, including best practice
scenario in which they need to consult among them-
volumes (e.g., Dunn, Smith, & Beins, 2007;
selves to recommend statistical procedures to address
Hulme, 2007; Hulsizer & Woolf, 2009), com-
the research question (Harlow, Burkholder, & Mor-
pilations of research on teaching statistics (e.g.,
row, 2006). For a basic example, students could be
Bjornsdottir & Garfield, 2010; Zieffler, Garfield,
asked how to assess whether men and women dif-
Alt, Dupuis, Holleque, & Chang, 2008), learn-
fered on hours of exercise per week (i.e., with a two-
ing objectives for introductory statistics and research
sample independent t-test) or whether the number
(Tomcho, Rice, Foels, Folmsbee, Vladescu, Liss-
of hours worked at an outside job was related to GPA
man, Matulewicz, & Bopp, 2009), and sugges-
(i.e., with a correlation). A number of researchers
tions for teaching quantitative courses (Garfield &
have recommended engaging students in these
Everson, 2009; Ware & Johnson, 2000). In this
cooperative small groups (DaRos-Voseles, Collins,
section, I present several strategies that have been
Onwuegbuzie, & Jiao, 2008; Onwuegbuzie,
researched and recommended to improve statistical
Collins, & Jiao, 2009; Peterson & Miller, 2004) or
literacy.
learning communities (e.g., Barren, Benedict, Sav-
ille, Serdikoff, & Zinn, 2007) where students work
Active Learning together to understand quantitative material and
Research has demonstrated that actively involv- immerse themselves in the process. Krause, Stark,
ing students significantly improves performance in and Mandl (2009), on the other hand, have found
quantitative courses (e.g., Helman & Horswill, that cooperative groups did not directly improve
2002). Guidelines to improve statistical learn- statistics performance, although students reported
ing (e.g., encouraging statistical thinking, using greater perceived efficacy when working with others.
examples with technology and real data, emphasiz- It may be that group learning is not effective for all
ing concepts, promoting active learning) reiterate students, with more advanced students possibly ben-
this approach of engaging students in the process efiting the least. For example, Harlow, Burkholder,
(Everson, Zieffler, & Garfield, 2008; Franklin and Morrow (2002) found that learning activities
harlow 107
that included working in groups and with peer men- Ferreira, 2001) and can help to supplement class
tors was viewed more favorably when students were lectures and faculty input (e.g., Katayama, 2001).
more anxious and had lower confidence about quan- Fortunately, graduate teaching assistants (TAs) are
titative learning. Future research could investigate often provided for undergraduate- and graduate-
whether ability or achievement level is a moderator level quantitative courses. When TAs are not funded,
for group learning and performance to clarify who I have found it very effective to invite one or more
benefits from group learning. top achievers from a previous semester to serve as
volunteer peer mentors or TAs, offering indepen-
Technology and Learning dent study or teaching practicum credit. Students
Other research has revealed the value of tech- in the course benefit from having a peer of similar
nology in heightening quantitative learning using age demonstrating and facilitating expertise in quan-
computer-assisted analysis (Bartz & Sabolik, 2001), titative methods. TAs or peer mentors gain efficacy
Web-based tutorials (Bliwise, 2005), and specific and greater confidence, often choosing to become
online learning programs such as Estimating Statis- even more involved with other quantitative courses
tics (EStat; Britt, Sellinger, & Stillerman, 2002), and research to continue building their skills (e.g.,
Simulation-Assisted Learning Statistics (SALS; Liu, Harlow, Burkholder, & Morrow, 2002, 2006).
Lin, & Kinshuk, 2010), the Utah Virtual Lab Mentoring can be particularly valuable for
(Malloy & Jensen, 2001), Statistical Understand- women and individuals from under-represented
ing Made Simple (SUMS; Swingler, Bishop, & groups who have few role models in the quanti-
Swingler, 2009), or Web Interface for Statistics Edu- tative field, for providing direction, support, and
cation (WISE; Berger, n.d., http://wise.cgu.edu/). encouragement (e.g., Kosoko-Lasaki, Sonnino, &
In a meta-analysis of 45 studies, there was a small- Voytko, 2006; Neal-Barnett, Mitchell, & Boeltar,
to medium-sized Cohen’s d (i.e., 0.33) performance 2002; Zirkel, 2002).
benefit effect size attributed to learning statistics
with computer-assisted instruction versus learning Conceptual Approach to Teaching
in a lecture-based control group that did not provide Perhaps the most effective idea for conveying
such input (Sosa, Berger, Saw, & Mary, 2010). Thus, complex quantitative material is to focus on the
students who had technology-enhanced instruction concepts rather than using a strictly mathematical
demonstrated one-third of a standard deviation approach (e.g., Atkinson, Catrambone, & Merrill,
higher performance than those students without 2003). Chiou (2009) found that encouraging stu-
such advantage; further, the improvement was even dents to collaborate on conceptually mapping sta-
more salient when involving those who were more tistical material significantly improved performance
advanced (i.e., graduate students) and when more compared to having students complete textbook
time was allotted for instruction. exercises and calculations. In another study, Aber-
Still, whereas research has demonstrated the son, Berger, Healy, and Romero (2003) demon-
benefits of e-learning approaches (e.g., Fillion, strated that an interactive approach to hypothe-
Limayem, Laferrière, & Mantha, 2008; Hanley, sis testing concepts was received more positively
2004; Sosa et al., 2010; Wender & Muehlboeck, and improved performance over traditional labo-
2003), more research is needed, as others remain ratory exercises. Similarly, Meletiou-Mavrotheris
unconvinced of the merits of adding technology to and Lee (2002) found that helping students to
the classroom environment (e.g., Härdle, Klinke, & understand concepts, improve statistical reason-
Ziegenhagen, 2007). Even the value of online dis- ing, and build intuitions about statistical ideas was
cussions and whether or how much faculty should more facilitating than using a traditional teaching
facilitate or contribute is not entirely clear (e.g., approach.
Mazzolini & Maddison, 2007). Instructors are Presenting and encouraging understanding of
encouraged to consider literature on cognitive learn- core concepts—particularly through hands-on
ing and assessment to improve Web-based materials engagement and research—can help foster more
for students, particularly in quantitative learning in-depth insight and greater involvement in inquiry-
(e.g., Romero, Berger, Healy, & Aberson, 2000). based future learning (Aulls & Shore, 2008; Dewey,
1997). Inviting students to seek out solutions to
Mentors and Role Models quantitative research problems promotes greater sta-
Mentoring can help students get the extra input tistical awareness and literacy. In the next section, I
needed to understand quantitative concepts (e.g., present a set of conceptual themes that are common
108 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
to many statistical methods and that help provide MANOVA) after taking into account one or more
a foundation for understanding introductory and covariates such as number of previous math courses
advanced quantitative learning. or GPA.
For those who are not specifically interested in
mean differences across groups, a correlational ques-
Themes that Run Through Quantitative
tion may be asked. Researchers interested in finding
Psychology
what is associated with achievement could conduct a
Quantitative understanding is increased when
simple correlation analysis, assessing whether math
common ideas are revealed that occur in many sta-
interest is related to college GPA. This bivariate pro-
tistical procedures. Harlow (2005) has emphasized
cedure (between just two variables) could expand
a number of themes that run through multivariate
to canonical correlation (CC) that allows an exam-
methods and that can be extended to encompass
ination of two sets of variables. For example, a
univariate methods as well. Three basic themes are
researcher could examine whether several math atti-
presented below to facilitate approaching, analyzing,
tude scores (e.g., math interest, math efficacy, and
and interpreting quantitative methods.
math anxiety) are related to several achievement
outcomes (e.g., GPA, the number of conferences
Considering the Research Question attended, and the number of memberships in honor
First, it is helpful to encourage students to con- societies). Correlation could also be extended to
sider the kind of research question that needs to be multiple regression (MR) to relate a linear combi-
addressed. Group difference questions can be ana- nation of several continuous predictors (e.g., math
lyzed with basic z-tests when information is known ability, verbal ability, previous GPA, achievement
about the mean and the variability in the popula- motivation) to a single outcome such as current
tion. For example, we could investigate whether GPA. If the outcome were dichotomous (e.g., suc-
an exceptional students’ class evinced an IQ that cess or failure in a course or grade level), then
was different from the known average of 100, with the merit of similar predictors could be examined
a standard deviation of 15. When only a pop- with logistic regression (LR) or discriminant func-
ulation mean is known or there are two groups tion analysis (DFA), depending on whether there
involved, a t-test would be appropriate, requiring was interest in conveying the odds or the degree of
that a researcher estimate the population standard correlation with the outcome, respectively. Multi-
deviation(s) from the sample(s). For studies with level modeling (MLM) would be useful to predict
two or more groups, an ANOVA would be useful an outcome such as achievement, when there is
for assessing whether there were differences among reason to believe that participants are clustered in
multiple groups. For example, a teacher may want separate groups (e.g., classrooms, school districts).
to compare the level of interest in taking math Thus, MLM allows, and even models, heterogeneity
courses for male and female students, which could of variance across several groups in the data, con-
be examined with a two-sample independent t-test. trary to more restrictive prediction methods (e.g.,
To investigate potential differences in math inter- MR, LR, and DFA) that assume that data are drawn
est among students from three different college from a homogeneous group.
majors, an ANOVA would be helpful. To exam- Other, more advanced, research questions could
ine whether several groups differ on math interest, be addressed with more sophisticated methods. Fac-
after accounting for the number of previous math tor analysis (FA) or principal components analysis
classes taken, analysis of covariance (ANCOVA) (PCA) would be useful when faced with a large
could be used. Further, group differences across sev- set of measures with a goal of identifying a few
eral groups could be assessed on several measures key dimensions that explain the underlying struc-
(e.g., math interest, math efficacy, and math anxiety) ture or associations (e.g., quantitative and verbal
using multivariate analysis of variance (MANOVA). intelligence dimensions in an IQ test of multiple
Similarly to the way that ANCOVA extended subtests). Factor analysis would allow the delin-
ANOVA by taking into account another predic- eation of unique or error variance in measures before
tor that correlated or covaried with the outcome, forming factors with the variance that was in com-
a multivariate analysis of covariance (MANCOVA) mon among the measures. In contrast, PCA analyzes
can extend a MANOVA when wanting to examine all of the variance in the variables when form-
whether several groups differ on several outcomes ing components. In practice, there may not be
(e.g., the three math attitudes suggested for the much actual difference in results across these two
harlow 109
methods if the loadings are high for relevant vari- being and exercise. Thus, both group difference and
ables on their respective factors, even with the initial correlational methods can examine how much vari-
difference in whether unique variance is included ance is shared between independent and dependent
in the analysis (Velicer & Jackson, 1990). Struc- variables. In this regard, Cohen, Cohen, West, and
tural equation modeling (SEM) could be used to Aiken (2003) have described how categorical group-
examine whether latent factors, each with several ing variables can be coded for use in correlational and
measures, could be theoretically modeled to explain regression procedures.
hypothesized underlying processes. For example,
using Bandura’s (1997) social learning theory, we
could test whether different factors of intelligence
(i.e., quantitative and verbal) would predict degree Noting Similarities and Differences in
of self-efficacy in learning, which in turn could Quantitative Methods
predict an achievement factor (measured by home- Examining research questions and seeing specific
work, quiz, and exam scores). In testing the SEM, methods that seem to be relevant to such questions
other covariates (e.g., socioeconomic status, pre- leads into a second theme, which is to notice the
vious GPA) could be added as possible predictors similarities and differences among various quantita-
of the mediator (i.e., self-efficacy) to see if they tive procedures. Quilici and Mayer (2002) helped
are important, or to rule them out as predictors students notice the underlying similarities in differ-
of achievement. Further, multiple group analyses ent statistical word problems. This helped students
could examine whether such a predictive SEM of to know how to analyze specific problems by not-
achievement held in different groups (e.g., men ver- ing their similarities to other statistical problems
sus women, different ethnicities). If findings differed that were addressed by certain methods. Similarly,
across groups, then it would indicate that the group- Derryberry, Schou, and Conover (2010) helped
ing variable moderated the prediction. Longitudinal students understand how to conduct rank-based sta-
modeling (e.g., latent growth curve modeling) could tistical tests by revealing their resemblance to already
examine whether achievement changed over time in studied parametric tests.
mean level and in the rate of change and whether Quantitative methods can be classified in several
predictors (e.g., previous GPA, IQ, gender) could ways to help delineate common aspects or distin-
predict the level (called the intercept) and rate of guishing differences. Thus, group-difference meth-
change (called the slope) across time. Other meth- ods (e.g., z-test, t-test, ANOVA, ANCOVA, and
ods discussed in this volume could also be examined MANOVA) are similar in allowing an examination
to further fine tune the nature of the question being of potential mean differences between groups on
assessed. one or more outcome variables. In contrast, correla-
Therefore, the main point of the first theme (type tional methods (e.g., Pearson’s r, CC) do not tend to
of research question asked) is that it is the nature focus on means but, rather, on assessing association
of the research and the questions asked of the data between independent and dependent variables. Pre-
that drive the kind of quantitative method that is diction methods (e.g., MR, LR, DFA, and MLM)
selected. There is a whole world of methods to all focus on predicting an outcome from a set of pre-
consider, and the choice is not so daunting when dictors and may have some emphasis on grouping
realizing that certain methods readily lend them- variables—particularly LR and DFA, which explic-
selves to different kinds of research questions (e.g., itly include a categorical dependent variable, and
group difference, correlation or prediction, underly- MLM, which takes into account different levels or
ing structure, longitudinal, etc.). In the next section, groups in the data. However, each of these predic-
similarities and differences among methods is the tion methods differs somewhat from ANOVA-based
second theme that is discussed. Here, it will become methods in focusing more on weights linking inde-
apparent that although there are distinctions among pendent and dependent variables and less on mean
methods that make them more likely to be applied to scores such as group averages or centroids. Dimen-
specific research questions, many quantitative proce- sional or structural methods (e.g., FA, PCA) are
dures share similarities in how they reveal the essence similar by involving one or more sets of measures
of the data. For example, researchers who find a sig- with a smaller set of underlying factors or compo-
nificant difference in well being between groups who nents posited or revealed to help explain the patterns
exercise regularly and those who do not will also find of relationships among variables. Thus, dimensional
that there is a significant relationship between well methods are similar to correlational methods in their
110 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
focus on associations among variables and differ- despite their individual variance (or uniqueness), a
ent from group difference methods that are more meaningful relationship emerges.
concerned with mean differences among groups. After assessing significance, it is important to
SEM combines the best of group difference, correla- examine the magnitude of an overall finding, called
tion/prediction, and structural methods by allowing an effect size (ES). For group-difference methods,
an investigation of the means of latent factors across an ES can indicate the number of standard deviation
different samples or groups, while still allowing an units of difference there is among group means (i.e.,
examination or confirmation of the structural or Cohen’s d ; Cohen, 1988), with small, medium, and
correlational relationships among variables. Longi- large effects having values of about 0.20 (almost
tudinal methods (e.g., time series, latent growth a quarter of a standard deviation), 0.50 (half a
modeling) add the element of time to allow an standard deviation), and 0.80 (almost a full stan-
examination of temporal ordering that can help in dard deviation) (Cohen, 1988). For correlational
assessing causal precedence among variables. methods, we would hope to show a meaningful rela-
The main point of this second theme is to encour- tionship between pertinent variables, with values of
age students to see how various quantitative methods 0.1, 0.3, and 0.5 indicating small, medium, and
are similar and how they are different. Discern- large Pearson product moment correlations, respec-
ing these distinctions and similarities will go a long tively. Particularly for prediction methods, as well
way toward highlighting the overriding features and as other methods, proportions of shared variance
underlying aspects of quantitative methods and in effect sizes (e.g., η2 or R2 ) are useful for showing
selecting an appropriate method to address a specific how much the independent and dependent variables
kind of research question. have in common, ranging in size from 0 to 1.0. Pro-
portion of variance ES values of 0.01, 0.09, and
Interpreting Findings from Quantitative 0.25 indicate small, medium, and large univariate
Methods effects, respectively (obtained by squaring 0.1, 0.3,
Third, after identifying a research question and and 0.5 correlations, respectively); and 0.02, 0.13,
noticing similarities and differences that lead to and 0.26 or more refer to small, medium, and large
selecting a method to analyze one’s data, it is impor- multivariate effect sizes, respectively (e.g., Cohen,
tant to examine and interpret findings from multiple 1988; Harlow, 2005).
perspectives. Although the field of quantitative When interpreting ES, it is also important to pro-
psychology has traditionally focused largely on sig- vide an indication of the margin of error with con-
nificance testing, there is growing emphasis on fidence intervals (e.g., Cumming & Fidler, 2009;
considering other alternatives such as effect sizes Gilliland & Melfi, 2010; Odgaard & Fowler, 2010).
and confidence intervals (e.g., Harlow, Mulaik, & Confidence intervals provide a range of values, with
Steiger, 1997; Kline, 2004; Wilkinson, & The Task narrower intervals indicating more precision for an
Force on Statistical Inference, 1999). Initially, it can estimated effect. Research journals such as the Jour-
be helpful to assess whether the overall variation nal of Consulting and Clinical Psychology and others
in the data is significantly different from chance. are beginning to require that statistical findings be
Significance tests usually involve some ratio of vari- supplemented with effect sizes and confidence inter-
ances. We can think of group difference models vals (e.g., La Greca, 2005; Odgaard & Fowler,
as explaining the ratio of how scores vary among 2010).
groups, relative to how much scores vary within each Thus, the third theme, concerned with inter-
group. This ratio can be readily recognized as the preting results, involves providing an indication as
F -test. The closer this ratio is to 1.0, the less we to whether a result is significantly different from
are apt to consider between-groups variance as any- chance, the magnitude of the effect, and the degree
thing more meaningful than the within-group error of certainly about the result.
variance. Correlational methods examine whether Ultimately, understanding quantitative themes
the covariance between variables is salient when and concepts (e.g., Abelson, 1995), including
contrasted with the variances within each variable. framing initial research questions; noticing sim-
Thus, a correlation is simply a ratio of the covariance ilarities and differences when selecting statistical
between variables over the product of the stan- methods; and interpreting the significance, extent,
dardized variance (i.e., standard deviations) within and precision of one’s finding will go a long way
each variable. Using an analogy relating individuals, toward moving psychology more in the forefront of
when the covariance between two people is salient, quantitative science.
harlow 111
Conclusions overarching constructs and processes rather than
This chapter features a number of issues regard- limiting teaching to narrow and isolated method-
ing the teaching of quantitative psychology. First, ological procedures (Embretson, 2010; McGrath,
research on quantitative training underscores the 2011; Rodgers, 2010). Building quantitative mod-
need to provide greater guidelines and opportunities els of behavior will generate more integrated and
for learning more about measurement, research, and explanatory understanding that is necessary to keep
statistical methods (e.g., Franklin & Garfield, 2006; psychology scientifically rigorous.
Garfield & Everson, 2009). Job opportunities are Second, students, faculty, and professionals
more available than are applicants for quantitative should be encouraged to explore options for fur-
careers (e.g., Aiken et al., 1990, 2008). An APA task thering quantitative learning outside of the class-
force on increasing the quantitative pipeline (Aiken room. For example, quantitative journals (e.g.,
et al., 2007) is aimed at bringing more people into Multivariate Behavioral Research, Psychological Assess-
this field to fill this need. ment, Psychological Methods, Structural Equation
Second, strategies for effectively teaching quan- Modeling ) as well as quantitative handbooks (e.g.,
titative psychology emphasize the advantages of Cooper et al., 2012), statistical dictionaries (e.g.,
active learning (e.g., Onwuegbuzie, Collins, & Dodge, 2003; Everitt, 2002; Upton & Cook,
Jiao, 2009), technology, and Web-based instruction 2008), and even online resources on quantita-
(e.g., Bliwise, 2005; Fillion, Limayem, Laferrière, tive methods (e.g., Journal of Statistics Educa-
& Mantha, 2008; Sosa et al., 2010), mentoring tion, http://www.amstat.org/publications/jse/) pro-
and role models (e.g., Ferreira, 2001), and concep- vide opportunities for further quantitative studies.
tual understanding (e.g., Mulhern & Wylie, 2004; There are also numerous quantitative volumes that
Swingler, Bishop, & Swingler, 2009). All of these provide input and guidelines for understanding
strategies implicitly invite students to become more the main aspects of various quantitative methods
immersed in the quantitative learning process and (Hancock & Mueller, 2010), including multi-
ultimately to become more statistically literate. ple regression/correlation (Cohen, Cohen, West,
Third, several themes have been offered to bring & Aiken, 2003), multivariate statistics (Stevens,
about greater clarity in quantitative learning (Har- 2009), statistical mediation (MacKinnon, 2008),
low, 2005). One of the themes includes considering missing data analysis (Enders, 2010), and structural
the nature of the research question. Students should equation modeling (e.g., Kline, 2011), among oth-
be encouraged to ask whether their research ques- ers. For those interested in more advanced statistical
tion involves group differences, correlation, predic- material, they could venture into the British Jour-
tion, underlying structure, or longitudinal studies. nal of Mathematical and Statistical Psychology and
Another theme encourages students to notice the Psychometrika, or even the Millsap and Maydeu-
similarities and differences among quantitative pro- Olivares (2009) volume on Contemporary Psychomet-
cedures, as this kind of awareness is helpful in select- rics, realizing that these latter sources would not be
ing appropriate methods to analyze data and address as accessible as the more translational articles in the
research questions. Still another theme presents sug- former outlets.
gestions for interpreting quantitative findings using Those interested in additional training could
statistical tests, ES, and confidence intervals. consider workshops (e.g., the University of Kansas
Being aware of quantitative training practices, Summer Institute held by Todd Little), annual
encouraging more engaged learning, and focusing conferences featuring quantitative presentations
on conceptual thinking and underlying themes can (e.g., Division 5 talks at the APA convention,
help us feature and spotlight quantitative psychology Division D talks at the American Educational
as a highly worthwhile and exciting field in which Research Association convention, and presenta-
to take part. tions and workshops at the Association for Psy-
chological Science). Websites focusing on quan-
Future Directions titative material (e.g., UCLA statistical website,
A number of future directions are suggested to http://www.ats.ucla.edu/stat/, and a similar one at
point out areas in which quantitative teaching is Georgetown, http://statpages.org/) and online tuto-
emerging and needs to grow. First, teaching should rials in quantitative methods (Dinov & Christou,
focus on actively engaging students and helping 2009; Garfield & delMas, 2010) offer additional
them to understand basic, underlying concepts. This occasions to heighten quantitative proficiency. Fur-
would also involve more emphasis on modeling ther, there are quantitative societies that provide
112 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
useful opportunities and contacts (e.g., American Active
g
Psychological Association, Division 5: Evaluation, Learnin
ning
Measurement, and Statistics; American Statisti- E-Lear
cal Association; Psychometric Society; Society for ring
Quant Mento
Mathematical Psychology, and the Society of Multi- Teaching
variate Experimental Psychology), including inter-
national statistical societies (e.g., European Associa- Research
tion of Methodology, International Association for Questions
Statistical Education, International Statistical Insti- Quant Similarities &
tute, Society for Multivariate Analysis in the Behav- Concepts Differences
ioral Sciences). Participating in quantitative forums Interpret
helps to enlarge quantitative networks, providing a Results
stronger foundation to further one’s skills. Other Quant Statistical
Third, quantitative teaching should be widened Learning Literacy
Models
to include individuals from low-income (Kitchen,
Wo
DePree, Celedón-Pattichis, & Brinkerhoff, 2007) Soc
rks
ho
Li ieti ps
and under-represented groups (Kosoko-Lasaki, Son- ter
atu es
nino, & Voytko, 2006). We also need to con- re
sider whether traditional approaches to quantitative
teaching have helped to shrink or amplify the differ- Widening
Quant Reach
ences among mainstream and marginalized students Disadvantaged
(Ceci & Papierno, 2005). Other offerings can be
Early
encouraged (e.g., Quantitative Training for Under- Care
er
Und
represented Groups) to help bring about greater erre
pres
equity and effectiveness that can open doors to ente
d
more career options in the field of quantitative sci-
ence to an even larger and more diverse group of Figure 6.1 Bringing about statistical literacy through quanti-
individuals. tative teaching, other quantitative learning, and widening the
quantitative reach
Fourth, it is important to reach out to quantita-
tive researchers who are newly graduated to encour-
age them to become involved in the larger field.
Early career psychologists make up about 10% of the technology, engineering, and mathematics that
150,000 members of the APA (2010), and Division sometimes exclude psychology. Further, research
5 (measurement, statistics and evaluation) involves reveals that students from the United States fall
approximately 1% of APA members (i.e., Gruber, in the bottom half regarding quantitative perfor-
2010). The APA Quantitative Task Force (Aiken mance when compared to other countries (Gon-
et al., 2007) helps in this regard, with a website zales, Guzmán, Partelow, Pahlke, Jocelyn, Kast-
(http://www.apa.org/research/tools/quantitative/in- berg, & Williams, 2003; McQuillan & Kennelly,
dex.aspx) that provides definitions of the areas of 2005).
focus within quantitative methods, recommenda- Figure 6.1 depicts the main focus of this chapter,
tions for preparing for quantitative training, and lists where innovative quantitative teaching (with active
of quantitative graduate programs that can help in learning, e-learning, mentoring, and concepts learn-
recruiting individuals into quantitative psychology ing), other quantitative learning (e.g., model build-
and in retaining those that are already participating. ing, workshops, quantitative societies, and quan-
Finally, continuing efforts to generate greater titative literature), and widening the quantitative
interest, participation, and performance in quan- reach (to the disadvantaged, early career individu-
titative literacy is an important goal. Unfortunately, als, and the under-represented) can help bring about
psychology is not always considered as having the greater statistical literacy. More emphasis needs to be
scientific credentials of other disciplines such as made on involving others in the marvels of quanti-
physics, chemistry, and biology (Simonton, 2004). tative psychology, whether at the undergraduate or
Particularly, recent efforts to improve technologi- graduate level, in the larger profession of psychol-
cal education in the United States (Chopra, 2010) ogy or in the general public. Ideas presented in this
have focused more on traditionally defined science, chapter, and in the larger handbook, are offered to
harlow 113
help in teaching quantitative psychology, within the Ashcraft, M. H., & Kirk, E. P. (2001). The relationship among
classroom, in research centers, and in every facet of working memory, math anxiety, and performance. Journal of
Experimental Psychology: General, 130, 224–237.
life that involves quantitative thinking.
Atkinson, R. K., Catrambone, R., & Merrill, M. M. (2003). Aid-
ing transfer in statistics: Examining the use of conceptually
oriented equations and elaborations during subgoal learning.
Author Note
Journal of Educational Psychology, 95, 762–773.
Lisa L. Harlow is Professor of Quantitative Psy- Aulls, Mark W., & Shore, B. M. (2008). Inquiry in education,
chology at the University of Rhode Island. Please volume 1: The conceptual foundations for research as a curricular
address correspondence to Lisa L. Harlow, Depart- imperative. New York: Taylor & Francis/Erlbaum.
ment of Psychology, University of Rhode Island, 142 Bandura, A. (1997). Self-efficacy: The exercise of control. New York:
W. H. Freeman.
Flagg Rd, Chafee, Kingston, RI 02881-0808.
Barren, K. E., Benedict, J. O., Saville, B. K., Serdikoff, S.
L., & Zinn, T. E. (2007). Innovative approaches to teach-
ing statistics and research methods: Just-in-time teaching,
Acknowledgments
interteaching, and learning communities. In Dunn, D. S.,
I wish to acknowledge support from the National Smith, R. A, & Beins, B. (Eds.), Best practices for teaching
Science Foundation (NSF) Grant (# 0720063) statistics and research methods in the behavioral sciences, (pp.
on Quantitative Training for Underrepresented 143–158). Mahwah, NJ: Erlbaum.
Groups; Lisa L. Harlow, Principal Investigator (PI). Bartz, A. E., & Sabolik, M. A. (2001). Computer and software
use in teaching the beginning statistics course. Teaching of
I would like to thank students, faculty and staff at
Psychology, 28, 147–149.
the University of Rhode Island and the Cancer Pre- Ben-Zvi, D. & Garfield, J. (Eds.) (2004). The challenge of devel-
vention Research Center in Rhode Island for their oping statistical literacy, reasoning, and thinking. Dordrecht,
ongoing support, inspiration and encouragement. Netherlands: Kluwer.
Similarly, I would like to thank Dr. Lisa Dierker Berger, D. (n.d.). WISE Web Interface for Statistics Educa-
tion. Retrieved December 20, 2010 from http://wise.cgu.
(PI) and her colleagues at the Quantitative Analy-
edu/.
sis Center at Wesleyan University. Serving on the Bjornsdottir, A., & Garfield, J. (2010). Teaching bits: Statistics
advisory board for their NSF Grant (# 0942246) education articles from 2009. Journal of Statistics Educa-
on An Inquiry-Based, Supportive Approach to Sta- tion, 18, Retrieved May 9, 2012 from www.amstat.org/
tistical Reasoning and Application has opened up publications/jse/v18n1/garfieldtb.pdf
Bliwise, N. G. (2005). Web-based tutorials for teaching intro-
my thinking and those of my students on how
ductory statistics. Journal of Educational Computing Research,
to productively engage students in the wonders of 33, 309–325.
quantitative inquiry. Britt, M. A., Sellinger, J., & Stillerman, L. M. (2002). A review
of ESTAT: An innovative program for teaching statistics.
Teaching of Psychology, 29, 73–75.
References Budé, L., Van De Wiel, M. W. J., Imbos, T., Candel, M.
Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, J. J. M., Broers, N. J., and Berger, M. P. F. (2007).
NJ: Erlbaum. Students’ achievements in a statistics course in relation to
Aberson, C. L., Berger, D. E., Healy, M. R., & Romero, V. motivational aspects and study behaviour. Statistics Education
L. (2003). Evaluation of an interactive tutorial for teach- Research Journal [Online], 6, 5–21. Retrieved May 9, 2012
ing hypothesis testing concepts. Teaching of Psychology, 30, from http://www.stat.auckland.ac.nz/∼iase/serj/SERJ6(1)_
75–78. Bude.pdf
Aiken, L. S., Aguinis, H., Appelbaum, M., Boodoo, G., Ceci, S. J., & Papierno, P. B. (2005). The rhetoric and reality of
Edwards, M. C., Gonzalez, R. D., Panter, A., et al. gap closing: When the “have nots” gain but the “haves” gain
(2007). Report of the task force for increasing the num- even more. American Psychologist, 60, 149–160.
ber of quantitative psychologists. Washington DC: American Chew, S. L. (2007). Designing effective examples and problems
Psychological Association. Retrieved July 1, 2010 from: for teaching statistics. In, Dunn, Dana S., Smith, Randolph
http://www.apa.org/research/tools/quantitative/quant-task- A, & Beins, Barney (Eds.), Best practices for teaching statistics
force-report.pdf. and research methods in the behavioral sciences, (pp. 73–91).
Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral Mahwah, NJ: Erlbaum.
training in statistics, measurement, and methodology in psy- Chiou, C.-C. (2009). Effects of concept mapping strategy on
chology: Replication and extension of Aiken, West, Sechrest, learning performance in business and economics statistics.
and Reno’s (1990) survey of PhD programs in North America. Teaching in Higher Education, 14, 55–69.
American Psychologist, 63, 32–50. Chopra, A. (2010). Building a future for technologi-
Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). cal innovation. White House Office of Science &
Graduate training in statistics, methodology, and measure- Technology Policy. Retrieved January 13, 2011 from
ment in psychology: A survey of PhD programs in North http://www.whitehouse.gov/sites/default/files/microsites/
America. American Psychologist, 45, 721–734. ostp/CommClubr081710FINAL.pdf.
APA (2010). Early career psychologists. Washington D.C.: Amer- Cobb, G. (1999). Move over, Bill Nye: George Cobb
ican Psychological Association. Retrieved on July 1, 2010 makes learning about statistics practical and fun. Mount
from: http://www.apa.org/membership/index.aspx. Holyoke College news and events. Retrieved May 9, 2012
114 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
from http://www.mtholyoke.edu/offices/comm/csj/990409/ Franklin, C., & Garfield, J. (2006). The guidelines for assess-
gummy.html. ment and instruction in statistics education (GAISE) project:
Cobb, G. W. (2007). The introductory statistics course: A Developing statistics education guidelines for grades pre K-
ptolemaic curriculum? Technology Innovations in Statis- 12 and college courses, In G.F. Burrill, (Ed.), Thinking and
tics Education, 1, 1?15. Retrieved May 9, 2012 from reasoning with data and chance: Sixty-eighth annual NCTM
http://escholarship.org/uc/item/6hb3k0nz# page-15 yearbook (pp. 345–375). Reston, VA: National Council of
Cohen, J. (1988). Statistical power analysis for the behavioral Teachers of Mathematics.
sciences (2nd edition). Hillsdale, NJ: Erlbaum. Friedrich, J., Buday, E., & Kerr, D. (2000). Statistical train-
Cohen, J. Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied ing in psychology: A national survey and commentary
multiple regression/correlation analysis for the behavioral sciences on undergraduate programs. Teaching of Psychology, 27,
(3rd edition). Mahwah, NJ: Erlbaum. 248–257.
Cooper, H., Camic, P. M., Long, D. L., Panter, A. T., Rind- Gal, I. (2003). Teaching for statistical literacy and ser-
skopf, D., & Sher, K. J. (Eds.). (2012). APA handbook of vices of statistics agencies. The American Statistician, 57,
research methods in psychology (Vols. 1–3). Washington, DC: 80–84.
American Psychological Association. Garfield, J., & Ben-Zvi, D. (2007). How students learn statis-
Cumming, G., & Fidler, F. (2009). Confidence intervals: Better tics revisited: A current review of research on teaching
answers to better questions. Zeitschrift für Psychologie/Journal and learning statistics. International Statistical Review, 75,
of Psychology, 217, 15–26. 372–396.
DaRos-Voseles, D. A., Collins, K. M. T., Onwuegbuzie, A. Garfield, J., & Ben-Zvi, D. (2008). Developing students’ statistical
J., & Jiao, Q. G. (2008). Effect of self-perception on per- reasoning: Connecting research and teaching practice. New York:
formance of graduate-level cooperative groups in research Springer.
methodology courses. Journal of Instructional Psychology, 35, Garfield, J., & delMas, R. (2010). A web site that provides
254–259. resources for assessing students’ statistical literacy, reasoning
Derryberry, D. R., Schou, S. B., & Conover, W. J. and thinking. Teaching Statistics, 32, 2–7.
(2010). Teaching rank-based tests by emphasizing struc- Garfield, J., & Everson, M. (2009). Preparing teachers of
tural similarities to corresponding parametric tests. Journal statistics: A graduate course for future teachers. Journal of
of Statistics Education, 18, Retrieved May 9, 2012 from Statistics Education, 17 (2), Retrieved May 9, 2012 from
www.amstat.org/publications/jse/v18n1/derryberry.pdf. www.amstat.org/publications/jse/v17n2/garfield.html.
DeVaney, T. A. (2010). Anxiety and attitude of graduate stu- Garfield, J., Hogg, B., Schau, C., & Whittinghill, D.
dents in on-campus vs. online statistics courses. Journal (2002). First courses in statistical science: The status
of Statistics Education, 18, Retrieved May 9, 2012 from of educational reform efforts. Journal of Statistics Educa-
www.amstat.org/publications/jse/v18n1/devaney.pdf. tion, 10(2), Retrieved May 9, 2012 from www.amstat.
Dewey, J. (1997). How we think. New York: Dover Publications. org/publications/jse/v10n2/garfield.html.
Dinov, I. D., & Christou, N. (2009). Statistics online com- Gelman, A., & Nolan, D. (2002). Teaching statistics: A bag of
putational resource for education. Teaching Statistics, 31, tricks. New York: Oxford University Press.
49–51. Gilliland, D., & Melfi, V. (2010). A note on confidence
Dodge, Y. (Ed.) (2003). The Oxford dictionary of statistical terms. interval estimation and margin of error. Journal of Statis-
New York: Oxford University Press. tics Education, 18(1), Retrieved May 9, 2012 from
Dolinsky, B. (2001). An active learning approach to teaching www.amstat.org/publications/jse/v18n1/gilliland.pdf.
statistics. Teaching of Psychology, 28, 55–56. Gonzales, P., Guzmán, J. C., Partelow, L., Pahlke, E., Jocelyn,
Dunn, D. S., Smith, R. A., & Beins, B. (Eds.) (2007). Best prac- L., Kastberg, D., & Williams, T. (2003). Highlights from the
tices for teaching statistics and research methods in the behavioral trends in international mathematics and science study (TIMSS).
sciences. Mahwah, NJ: Erlbaum. U.S. Department of Education Institute of Education Sci-
Embretson, S. E. (Ed.) (2010). Measuring psychological con- ences, National Center for Educational Statistics.
structs: Advances in model-based approaches. Washington, DC: Gruber, C. (Ed.) (2010) The score newsletter, Vol. XXXII. No. 3.
American Psychological Association. Division 5, American Psychological Association, Retrieved
Enders, C. K. (2010). Applied missing data analysis. New York: July 1, 2010 from http://www.apa.org/divisions/div5/pdf/
Guilford. July10Score.pdf.
Everitt, B. S. (2002). The Cambridge dictionary of statis- Hancock, G. R., & Mueller, R. O. (Eds.). (2010). The reviewer’s
tics (2nd edition). Cambridge, UK: Cambridge University guide to quantitative methods in the social sciences. New York:
Press. Taylor & Francis.
Everson, M., Zieffler, A., & Garfield, J. (2008). Implement- Hanley, G. L. (2004). E-learning and the science of instruction.
ing new reform guidelines in teaching introductory college Applied Cognitive Psychology, 18, 123–124.
statistics courses. Teaching Statistics, 30, 66–69. Härdle, W., Klinke, S., & Ziegenhagen, U. (2007). On the utility
Ferreira, M. (2001). Building communities through role mod- of e-learning in statistics. International Statistical Review, 75,
els, mentors, and hands-on-science. The School Community 355–364.
Journal, 11, 27–37. Harlow, L.L. (2005). The essence of multivariate thinking: Basic
Fillion, G., Limayem, M., Laferrière, T., & Mantha, R. (2008). themes and methods. Mahwah, NJ: Erlbaum.
Integrating ICT into higher education: A study of onsite Harlow, L. L. (2010). On scientific research: Invited com-
vs. online students’ and professors’ perceptions. International mentary on the role of statistical modeling and hypothesis
Journal of Web-Based Learning and Teaching Technologies, 3, testing. Journal of Modern Applied Statistical Methods, 9,
48–72. 348–358.
harlow 115
Harlow, L. L., Burkholder, G., & Morrow, J. (2002). Evaluating McQuillan, L., & Kennelly, L. (2005). New study finds U.S.
attitudes, skill and performance in a learning enhanced quan- math students consistently behind their peers around the world:
titative methods course: A structural modeling approach. Findings challenge conventional wisdom about U.S. math success
Structural Equation Modeling Journal, 9, 413–430. in early grades. Washington, D.C.: American Institutes for
Harlow, L. L., Burkholder, G., & Morrow, J. (2006). Engag- Research.
ing students in learning: An application with quantitative Meletiou-Mavrotheris, M, & Lee, C. (2002). Teaching students
psychology. Teaching of Psychology, 33, 231–235. the stochastic nature of statistical concepts in an introduc-
Harlow, L. L., Mulaik, S., & Steiger, J. (Eds.) (1997). What if tory statistics course. Statistics Education Research Journal, 1,
there were no significance tests? Mahwah, NJ: Erlbaum. 22–37.
Helman, S., & Horswill, M. S. (2002). Does the introduction Mills, J. D. (2004). Students’ attitudes toward statistics: Impli-
of non-traditional teaching techniques improve psychology cations for the future. College Student Journal, 38, 349–361.
undergraduates’ performance in statistics? Psychology Learning Millsap, R. E., Maydeu-Olivares, A. (Eds.) (2009). The Sage
& Teaching, 2, 12–16. handbook of quantitative methods in psychology. Thousand
Hulme, J. (2007). Review of Handbook of demonstrations and Oaks, CA: Sage Publications.
activities in the teaching of psychology volume 1: Intro- Mulhern, G., & Wylie, J. (2004). Changing levels of numer-
ductory, statistics, research methods and history. Psychology acy and other core mathematical skills among psychology
Learning & Teaching, 6, 164–165. undergraduates between 1992 and 2002. British Journal of
Hulsizer, M. R., & Woolf, L. M. (2009). Guide to teaching Psychology, 95, 355–370.
statistics: Innovations and best practices. New York: Wiley- Mulhern, G., & Wylie, J. (2006). Mathematical prerequisites
Blackwell. for learning statistics in psychology: Assessing core skills of
Jaccard, J., & Jacoby, J. (2009). Theory construction and model numeracy and mathematical reasoning among undergradu-
building skills: A practical guide for social scientists. New York: ates. Psychology Learning & Teaching, 5, 119–132.
Guilford Press. Neal-Barnett, A., Mitchell, M., & Boeltar, C. (2002). Faculty
Katayama, A. D. (2001). Bi-modal instructional practices in edu- of color serving students, serving self: The psychology group.
cational psychology: Mentoring and traditional instruction. Teaching of Psychology, 29, 44–45.
Journal of Instructional Psychology, 28, 171–177. Nie, Y., & Lau, S. (2010). Differential relations of constructivist
Kitchen, R. S., DePree, J., Celedón-Pattichis, S., & Brinkerhoff, and didactic instruction to students’ cognition, motivation,
J. (2007). Mathematics education at highly effective schools that and achievement. Learning and Instruction, 20, 411–423.
serve the poor: Strategies for change. Mahwah, NJ: Erlbaum. Odgaard, E. C., & Fowler, R. L. (2010). Confidence intervals for
Kline, R. B. (2004). Beyond significance testing: Reforming data effect sizes: Compliance and clinical significance in the jour-
analysis methods in behavioral research. Washington, DC: nal of consulting and clinical psychology. Journal of Consulting
American Psychological Association. and Clinical Psychology, 78, 287–297.
Kline, R. B. (2011). Principles and practice of structural equation Onwuegbuzie, A. J. (2000). Attitudes toward statistics assess-
modeling ( 3rd edition). New York: Guilford. ments. Assessment and Evaluation in Higher Education, 25,
Kolar, D. W., & McBride, C. A. (2003). Creating problems 321–339.
to solve problems: An interactive teaching technique for Onwuegbuzie, A. J., Collins, K. M. T., & Jiao, Q. G. (2009).
statistics courses. Teaching of Psychology, 30, 67–68. Performance of cooperative learning groups in a postgraduate
Kosoko-Lasaki, O., Sonnino, R. E., & Voytko, M. L. (2006). education research methodology course: The role of social
Mentoring for women and underrepresented minority fac- interdependence. Active Learning in Higher Education, 10,
ulty and students: experience at two institutions of higher 265–277.
education. Journal of the National Medical Association, 98(9), Onwuegbuzie, A. J., & Wilson, V. A. (2003). Statistics anxiety:
1449–1459. Nature, etiology, antecedents, effects, and treatments-a com-
Krause, U,-M., Stark, R., & Mandl, H. (2009). The effects of prehensive review of literature. Teaching in Higher Education,
cooperative learning and feedback on e-learning in statistics. 8, 195–209.
Learning and Instruction, 19, 158–170. Peden, B. F. (2001). Correlational analysis and interpretation:
La Greca, A. M. (2005). Editorial. Journal of Consulting and Graphs prevent gaffes. Teaching of Psychology, 28, 129–131.
Clinical Psychology, 73, 3–5. Peterson, S. E., & Miller, J. A. (2004). Quality of college students’
Liu, T.-C., Lin, Y.-C., & Kinshuk (2010). The applica- experiences during cooperative learning. Social Psychology of
tion of Simulation-Assisted Learning Statistics (SALS) for Education, 7(2), 161–183.
correcting misconceptions and improving understanding Piotrowski, C., Bagui, S. C., & Hemasinha, R. (2002). Devel-
of correlation. Journal of Computer Assisted Learning, 26, opment of a measure of statistics anxiety in graduate-level
143–158. psychology students. Journal of Instructional Psychology, 29,
MacKinnon, D. P. (2008). Introduction to statistical mediation 97–100.
analysis. New York: Taylor & Francis. Quilici, J. L., & Mayer, R. E. (2002). Teaching students to recog-
Malloy, T. E., & Jensen, G. C. (2001). Utah Virtual Lab: nize structural similarities between statistics word problems.
JAVA interactivity for teaching science and statistics on line. Applied Cognitive Psychology, 16, 325–342.
Behavior Research Methods, Instruments & Computers, 33, Rajecki, D. W., Appleby, D., Williams, C. C., Johnson,
282–286. K., & Jeschke, M. P. (2005). Statistics can wait: Career
Mazzolini, M., & Maddison, S. (2007). When to jump in: The plans activity and course preferences of American psychology
role of the instructor in online discussion forums. Computers undergraduates. Psychology Learning & Teaching, 4, 83–89.
& Education, 49, 193–213. Ridgway, J., Nicholson, J., & McCusker, S. (2007). Teaching
McGrath, R. E. (2011). Quantitative models in psychology. statistics—Despite its applications. Teaching Statistics, 29,
Washington, DC: American Psychological Association. 44–48.
116 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
Rodarte-Luna, B., & Sherry, A. (2008). Sex differences in Tomcho, T. J., Rice, D., Foels, R., Folmsbee, L., Vladescu, J.,
the relation between statistics anxiety and cognitive/learning Lissman, R., Matulewicz, R., et al. (2009). APA’s learning
strategies. Contemporary Educational Psychology, 33, 327– objectives for research methods and statistics in practice: A
344. multimethod analysis. Teaching of Psychology, 36, 84–89.
Rodgers, J. L. (2010). The epistemology of mathematical and Tremblay, P. F., Gardner, R. C., & Heipel, G. (2000). A model
statistical modeling: A quiet methodological revolution. of the relationships among measures of affect, aptitude, and
American Psychologist, 65, 1–12. performance in introductory statistics. Canadian Journal of
Romero, V. L., Berger, D. E., Healy, M. R., & Aberson, C. L. Behavioural Science, 32, 40–48.
(2000). Using cognitive learning theory to design effective on- Upton, G. & Cook, I. (2008). A dictionary of statistics. New York:
line statistics tutorials. Behavior Research Methods, Instruments Oxford University Press.
& Computers, 32, 246–249. Velicer, W.F., & Jackson, D.N. (1990). Component analy-
Rosenthal, R. & DiMatteo, M. R. (2001) Meta-analysis: recent sis vs. common factor analysis: Some issues in selecting an
developments in quantitative methods for literature reviews. appropriate procedure. Multivariate Behavioral Research, 25,
Annual Review of Psychology, 52, 59–82. 1–28.
Rumsey, D. J. (2002). Statistical literacy as a goal Ware, M. E. & Johnson, D. E. (Eds) (2000). Handbook of
for introductory statistics courses. Journal of Statis- demonstrations and activities in the teaching of psychology: Intro-
tics Education, 10(3), Retrieved May 9, 2012 from ductory, statistics, research methods, and history, Vol. I (2nd
http://www.amstat.org/publications/jse/v10n3/rumsey2.html. ed.). Mahwah, NJ: Erlbaum.
Saville, B. K., Zinn, T. E., Lawrence, N. K., Barron, K. E., Watkins, A. E., Scheaffer, R. L., & Cobb, G. W. (2004). Statis-
& Andre, J. (2008). Teaching critical thinking in statistics tics in action: Understanding a world of data. New York: Key
and research methods. In Dunn, D. S., Halonen, J. S., & College Publishing.
Smith, R. A. (Eds.), Teaching critical thinking in psychology: A Watson, J. M. (2006). Statistical literacy at school: Growth and
handbook of best practices, (pp. 149–160). New York: Wiley- goals. Mahwah, NJ: Erlbaum.
Blackwell. Wender, K. F., & Muehlboeck, J.-S. (2003). Animated diagrams
Schwartz, D., & Martin, T. (2004). Inventing to prepare for in teaching statistics. Behavior Research Methods, Instruments
future learning: The hidden efficiency of encouraging orig- & Computers, 35, 255–258.
inal student production in statistics instruction. Cognition Wilkinson, L., & The Task Force on Statistical Inference (1999).
Instruction, 22, 129–184. Statistical methods in psychology journals: Guidelines and
Sen, S. (2004). Teaching quantitative thinking: Comment. explanations. American Psychologist, 54, 594–604.
Science, 304( 5670), 518. Wulff, S. S., & Wulff, D. H. (2004). “Of course I’m com-
Simonton, D. K. (2004). Psychology’s status as a scientific dis- municating, I lecture every day:” Enhancing teaching and
cipline: Its empirical placement within an implicit hierarchy learning in introductory statistics. Communication Education,
of the sciences. Review of General Psychology, 8, 59–67. 53, 92–103.
Sosa, S., Berger, D. E., Saw, A. T., & Mary, J. C. (2010). Zeedyk, M. S. (2006). Detective work on statistics street: Teach-
Effectiveness of computer-assisted instruction in statistics: ing statistics though humorous analogy. Psychology Learning
A meta-analysis. Review of Educational Research, published & Teaching, 5, 97–109.
online August 12, 2010. DOI: 10.3102/0034654310378174 Zieffler, A., Garfield, J., Alt, S., Dupuis, D., Holleque, K.,
Stark, R., & Krause, U.-M. (2009). Effects of reflection prompts & Chang, B. (2008). What does research suggest about
on learning outcomes and learning behaviour in statistics the teaching and learning of introductory statistics at the
education. Learning Environments Research, 12, 209–223. college level? A review of literature. Journal of Statistics
Stevens, J. P. (2009). Applied multivariate statistics for the social Education, 16 (2), Retrieved May 9, 2012 from http://
sciences (5th ed.). New York: Routledge. www/amstat.org/publications/jse/v16n2/zieffler.html.
Swingler, M. V., Bishop, P., & Swingler, K. M. (2009). SUMS: Zirkel, S. (2002). Is there a place for me? Role models and aca-
A flexible approach to the teaching and learning of statistics. demic identity among white students and students of color.
Psychology Learning & Teaching, 8, 39–45. Teachers College Record, 104, 357–376.
harlow 117
CHAPTER
Abstract
This chapter provides a unified treatment of seven major topics in test theory. It relates Modern
Theory based on item response modeling to classical linear modeling through the use of the common
factor model. The topics include choosing a metric; measurement and measurement error (reliability);
item selection: homogeneity and dimensionality; validity; equating tests; comparing populations. The
treatment of these problems makes no distribution assumptions.
Key Words: Test theory, metri